Towards Understanding the Social Structure of Email and Spam Traffic

University dissertation from Chalmers University of Technology

Abstract: Email is a pervasive means of communication on the Internet. Email exchanges between individuals can be seen as social interactions between email sender(s) and receiver(s), thus can be represented as a network. Networks of human interactions such as friendship relations, research collaborations, and phone calls have been widely studied before to allow understanding of the characteristics, as well as the structure and dynamics of such social interactions. In this thesis, we look into the social network properties of email networks generated from real traffic, and investigate how a vast amount of unsolicited email traffic (spam) affect these properties. Current advances in Internet data collection and processing has facilitated the study of the characteristics of email traffic observed on the Internet. In our study, we have collected large-scale email datasets from traffic traversing a high-speed Internet backbone link and have generated email networks from the observed communications to analyze the structure and dynamics of these social interactions. Moreover, we aim at unveiling the distinguishing characteristics of legitimate and unsolicited email communications. We show that the networks of legitimate email traffic has the same structural and temporal properties that other social networks exhibit, and therefore can be modeled as small-world scale-free networks. However, the unsolicited email communications cause deviations and anomalies in the structure of email networks, and this deviation from the expected social structural properties can be used to find the sources of spam email. We also show that email networks, similar to other social networks, have a community structure which can be found using different community detection algorithms. However, not all community detection algorithms can identify structural communities that coincide with the true logical communities of email networks, i.e., distinct communities of legitimate and unsolicited email. Our study shows that a link-based community detection algorithm is more suitable for this purpose than more widely used node-based algorithms. The possibility of merely using the social structure of email traffic to identify the source of spam and separate the unsolicited email from legitimate email, can potentially be used to improve the protection against spam and other types of malicious activities on the Internet.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)