A Topic-based Forensic Analysis and Visualization of an Email network: Application to the Enron Dataset

Casey Kalinowski, M. Zakaria KURDI

Keywords: topic modeling, Social Network Analysis (SNA), Social Network Visualization, Term Frequency (TF), Latent Dirichlet Allocation (LDA).

Issue II, Volume I, Pages 1-22

This work is about visualizing an email network with graphs. This visualization is based on the email’s topics.

So, the first part of this work is about exploring three rule-based methods and an unsupervised method of topic

detection applied to a large dataset. Keyword or Term Frequency (TF) method is used as a baseline for comparison.

Latent Dirichlet Allocation (LDA) combined with WordNet as well as two versions of conceptual topic detection,

both involving a version of keyword extraction combined with WordNet, are also compared. Our results show that LDA

combined with wordnet has the highest precision but a comparable F-measure to the conceptual approaches.

Through a series of examples, we then demonstrate how annotating the emails with topics is a good way to shed light on

the underlying professional and social relationships within the email network, which can provide substantial help within application

contexts such as forensic investigations. This annotation is also showed to help in providing quantitative feedback about the performance

of the topic detection algorithms.

[1] J. Diesner and K. Carley, “Exploration of communication networks from the Enron email corpus”, In Proceedings of Workshop on Link Analysis, Counterterrorism and Security, Newport Beach CA, 2005.

[2] J. Diesner, T. L. Frantz, and K. M. Carley. “Communication networks from the Enron email corpus”. Journal of Computational and Mathematical Organization Theory, 11:201–228, 2005.

[3] Ryan Rowe, German Creamer, “Automated Social Hierarchy Detection through Email Network Analysis”, Joint 9th WEBKDD and 1st SNAKDD Workshop’07 August 12, San Jose, California, USA, 2007.

[4] Gerard Salton, ed., The SMART Retrieval System. Englewood Cliffs, N.J. Prentice Hall, 1971.

[5] Everitt B.S., Landau S., Leese M., Stahl D., Cluster Analysis, 5th Edition, Wiley, 2010.

[6] Halabi Ammar, Ahmed-Derar Islim, Mohamed-Zakaria Kurdi, A hybrid approach for indexing and retrieval of archaeological textual information, International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, /9/8, pp 527-535, 2010.

[7] Shafiq Joty, Giuseppe Carenini, Gabriel Murray, and Raymond Ng, Finding Topics in Emails: Is LDA enough? 2009 https://raihanjoty.github.io/papers/joty-carenini-ng-lda-nips-09.

[8] Veselin Stoyanovand, Claire Cardie, Topic Identification for Fine-Grained Opinion Analysis, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 817–824 Manchester, August 2008.

[9] Chin-Yew Linand, Eduard Hovy, The Automated Acquisition of Topic Signatures for Text Summarization, COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1 Pages 495-501, Saarbrücken, Germany — July 31 - August 04, 2000.

[10] W. Holmes Finch Maria E. Hernández Finch Constance E. McIntosh Claire Braun, “The use of Topic Modeling to Analyze Open-Ended Survey Items”, OpenMx XSEM with Applications to Dynamical Systems Analysis, May 22, 2017.

[11] Jinsong Su, Deyi Xiong, Yang Liu, Xianpei Han, Hongyu Lin, Junfeng Yao, Min Zhang, A Context-Aware Topic Model for Statistical Machine Translation, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th

International Joint Conference on Natural Language Processing, pages 229–238, Beijing, China, July 26-31, 2015.

[12] Lise Getoor, Christopher P. Diehl, LinkMining: A Survey, SIGKDD Explorations, 7(2), pp 3-12, 2005.

[13] Man Wang, Minghu Jiang, Text categorization of Enron email corpus based on information bottleneck and maximal entropy, ICSP2010 Proceedings, 2010.

[14] A. McCallum, X. Wang, A Corrada-Emmanuel, Topic and role discovery in social networks with experiments on Enron and academic email, Journal of Artificial Intelligence Research, Volume 30 Issue 1, September Pages 249-272, 2007 .

[15] M. Zakaria KURDI, Content-dependent vs. content-independent features for gender and age range identification in different types of texts, International Florida Artificial Intelligence Society FLAIRS-33, May 19-22, 2019, Sarasota Florida.

[16] George A. Miller, WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41, 1995.

[17] Christiane Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998.

[18] M. Zakaria KURDI, Natural Language Processing and Computational Linguistics 2: Semantics, Discourse and Applications, London, ISTE-Wiley. ISBN: 978-1-848-21921-2, 2017.

[19] Gerard Salton, and Michael J. McGill, Introduction to Modern Information Retrieval, New York, McGraw-Hill Book Company, ISBN 0-07-054484-0, 1983.

[20] Furnas, G. W., T. K. Landauer, L. M. Gomez, and S. T. Dumais, “The vocabulary problem in human-system communications”, Communications of the ACM 30(11), 964-971, 1987.

[21] David Blei, Andrew Ng, Michael Jordan, and John Lafferty, (ed.). "Latent Dirichlet Allocation". Journal of Machine Learning Research. 3 (4–5): pp. 993-1022, 2003.

[22] Marina Sokolova, Guy Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing and Management 45, pp 427–437, 2009.