«
»

Progress update #2

November 17, 2008

V

Summary: Mining email social networks

November 17, 2008

This paper is a good example of what I ultimately want to do: analyze digital artifacts of an open source software project to uncover the social structure of the community.

Summary: Mining email social networks

By: Chris Malek

Nov 17 2008

Category: Summaries

2 Comments »

This paper is a good example of what I ultimately want to do: analyze email archives and revision control system logs (at least) of an open source software project to uncover the social structure of the community.  While this paper is more interested in what is happening than in making a tool to uncover what is happening, it’s still quite relevant.  It’s also an example of one of the many papers that relate to open source software communities that wouldn’t show up in Web of Science search (because it’s a conference paper, not a journal paper) but which would prove very useful to me.

Of special interest to me are the technical aspects of how they did their work: figuring out the sequence of replies and attempting to deal with what they call email alias unmasking.   In the email archives of a traditional firm, one can trace back each email address to a real person (one can look in the organizational directory to figure out who corresponds to what address) and each person typically only has one email address.  Not so in such public forums as are used in OSS projects:  in such spaces, a single person may use many aliases (foo@gmail.com, bar@yahoo.com, foo.bar@abccorp.com may all be the same person), and discovering that those aliases all belong to one person is important for discovering the true social structure.

Summary of

Bird, C., Gourley, A., Devanbu, P., Gertz, M., and Swaminathan, A. (2006). Mining email social networks. In MSR ‘06: Proceedings of the 2006 international workshop on Mining software repositories, pages 137-143, New York, NY, USA. ACM Press.

The authors looked at the Apache developers mailing list archive and CVS repository commit logs, considering messages covering period of four or five years. They did so with the goal of studying communication and collaboration technologies (C&C) in software projects, particularly in open source software development. They are specifically interested in how activities in C&C correspond to development activities in the source code: what are the social properties of the developer network; do active communicators also make a lot of source code changes; do developers and non-developers play different social roles; and do the most active developers have the highest status among developers. They examined an open source project because most/all communications are purposely publicly available.

The authors looked at each participant in the mailing list, and divided the group into developers (those who contributed code or documentation changes to the CVS repository) and non-developers (those who didn’t). For each participant, they looked at how many messages the person sent, how many of their messages were replied to, and three social networking measures: in-degree (the number of edges connecting to a node in a directed graph; in this case, the number of different people to whom a person has replied), out-degree (the number of edges emerging from a node in a directed graph; in this case, the number of individuals who have replied to a person) and betweenness (the number of shortest paths that go through a node; high betweenness indicates that a person acts as a gatekeeper or broker, playing a role in many interactions). They also presented a directed sociogram of the Apache mailing list archive in which the arrows indicated who responded to whom more often (but didn’t do much with it).

They found that messages sent, messages replied to, in-degree, and out-degree follow a Pareto distribution (a power law probability distribution; a few people send a lot, but most people send a little), the latter showing a “long tailed degree distribution, characteristic of small world networks” (p. 141). There was a strong relationship between the number of messages sent by someone and the number of distinct people that respond to them (p. 141). They found a high correlation (Spearman rank correlation of 0.80) between messages sent and number of source changes made, indicating that C&C activity is correlated with development work (p. 141). There was a lower correlation between messages sent and document changes.

Developers do act as brokers or gatekeepers more than non-developers (p. 142), and generally have higher status (computed as what?), and developers who do more source code changes play more significant roles in the mailing list. Higher activity in source code changes is strongly correlated with higher activity in the mailing list; document changes are less so correlated. Generally, high in-degree, out-degree, and betweeness are correlated with status (how?) and source code change activity.

Data extraction

They used the Reply-To: address and Message Id: of each message to which a message is a reply (if any) to determine who replied to whom, and suggest that you could look through the contents for quoted text attributions. The sender of a reply is “one who found the initial message of interest” (p. 139).

One of the few groups to deal explicitly with e-mail alias unmasking: many people have more than one e-mail address, and ensuring that we count all the e-mail from those different addresses as belonging to that person is not trivial. They used a clustering algorithm plus manual inspection to develop a lookup table of e-mail addresses to names. The similarity measure they used for the clustering is based on the fields in the From: line.

They compared the normalized names to names and e-mails to e-mails using the Levenshtein distance, compared names to e-mails, and took the max scoring of the three (p. 139). They did this for all pairs of <name, e-mail> tuples. They used a similar method for unmasking CVS aliases.

Social networking measures

They comment on connectedness, but don’t use it except to say that the most highly connected people in the Apache network are, in fact, the most productive developers (p. 140), and that they are doing further research into that.

The “small world network” is a statement about mean shortest path and clustering of the network. Small world networks exhibit a power-law distribution of degrees of its nodes (few people are highly connected, and most people are not highly connected).  Scale-free networks follow an exponential distribution.

They used messages sent and out-degree to make the statement about number of messages sent vs. number of unique repliers. They’re doing further investigation into this.

They used betweenness with in-degree and out-degree to show that developers do act as brokers more than non-developers (p. 142), and generally have higher status (computed as what?), and that developers who do more source code changes play more significant roles in the mailing list.

2 Responses to “Summary: Mining email social networks”

  1. I share your vision and have spent a few years developing taxonmies to allign human capital behavior management using digital event management and deep packet inspection (emial, docs., etc.) agree there should be an open source solution.

    enjoyed the read.
    Michael Brown

  2. Michael,

    Thank you for reading and commenting on what I’ve written here! I find it very encouraging that someone at your high level of expertise and responsibility looks around personally at what others in the field are working on.

    I really appreciate your comment because I didn’t know that there was such a business as the one you are in: skills management. Although perhaps I should have guessed, since I’ve had skills gap analysis recommended as an analysis to do on my team of programmers and systems administrators at Caltech. I’m now looking into what academics have done with skills assessment and management.

    About open source solutions, I’m not as concerned with what I make being open source as I am in supplying a tool that open source development communities can use for self-reflection.

    Chris

Leave a Reply