I am just back from the
Web Science Doctoral Summer School 2011 hosted at DERI, Galway, Ireland. It was a busy, fun, and fulfilling week.
The talks were on a variety of topics, and it was good to be pushed outside my comfort zone. We divided into teams to do miniprojects, and a group of us from
Clique did a survey of our follow participants and analyzed the resulting network (more on this later). I would never have thought of doing a survey. ("surveys are for sociologists, but I'm a mathematician".)
I'd never used an API such as the Twitter or Facebook APIs before. I've done a fair bit of programming over the years, but the last time I did anything on the web was about 10 years ago with ASP and where Javascript was merely used for form validation. But we had to access these APIs to get the follow/friend networks for those people at the summer school, so it was useful to be forced to try new things like that.
One of the recurring topics was about breaking the boundaries between 'builders' and 'studiers' and it was heartening to see sociology and statistics and computer science, and so on, coming together to look at the world wide web - in fact, the general mood was that this had already been achieved. I've always found it unhelpful to label somebody as a 'sociologist' or as a 'computer scientist' or anything like that - we all have a variety of interests and these interests change dramatically over time. For example, I had no interest in networks prior to joining a PhD programme two years ago!
Bernie Hogan's talk was a good example of this 'interdisciplinarity in one person' - his slides (link anybody?) discusses some history of identity and privacy on the internet alongside technical examples based on the Facebook API. (Talking of Bernie, it was originally his idea to gather the survey for the miniproject we worked on).
On the other hand, the term 'Web Science' is perhaps a little too restrictive in some ways. Mobile telecommunications give rise to social networks, and are therefore relevant for research for many of us at the summer school. I don't mean this as a criticism of the Web Science 'project', it's just to reinforce that research topics overlap with each other pervasively.
A number of us had fun playing with
Fractional Calculus on our own time. it wasn't directly relevant to the summer school but it was fun to play with maths like this; any mathematician can tell you about the 1
st, 2
nd and 3
rd derivatives, but what about the 1.5
th derivative? If you ask me more, there's a danger I'll tell you about
negative probabilities or
negative temperature - you have been warned :-) . We also discussed
Zorn's Lemma. These experiences reminded me of the summer schools I attended as part of doing a Maths degree with the
Open University (they were hosted at the University of Nottingham). They were extremely good intellectual fun. The number of students at the Web Science Summer School was also about right. There were 74 participants; if there were many more we wouldn't get to really build up a relationship over the week. The organized social events (Segway tour of Galway, and a visit to the Aran Islands) also helped a lot!
Anyway, I promised I'd talk a little about our miniproject and tell you a little more about the survey we did. (In alphabetical order, we were To Tu Cuong, Samantha Lam, myself, Ursula Redmond, Arthur White.) We asked the participants to fill in survey recording who they knew before the summer school and who they talked to during the first few days of the school. 41 out of 74 responded,
giving us 1198 directed edges.
Update for clarity: At first, we created a simple directed graph linking A to B if A reported a relationship of either kind with B. This gave us 1198 edges. There were 624 edges in the 'before' network, and 957 edges in the 'during' network. The 1,198 edges is simply the union of these sets of edges.
We hope to make the data available in anonymized (to the best of our abilities) form, but for now I'll just give you a taste of what we found when we ran a clustering algorithm on it.
This is an
adjacency matrix. Each row corresponds to somebody who filled in the survey (a
respondent) and each column corresponds to a one of the
participants in the summer school. A black dot records the fact that the
respondent (claims to have) talked to the
participant during the school or knew them before hand.
The 33 rows at the bottom, all white, clearly respresent those participants who didn't fill in the survey. The names of the people in the rows are in the same order as those names on the columns, and hence you can see that this matrix is somewhat symmetric (in the upper left quadrant); if A claims to know B, it usually means that B also claims to know A. In other words, we had a good memory of who we had talked to!
The interesting thing for me is the red lines. I didn't just order the rows and columns arbitrarily - I ordered them according to the clusters I found with a stochastic block-modelling (SBM) algorithm I'm working on at the moment. The red lines mark the clusters found by the algorithm. The algorithm takes a network as input and places the nodes into clusters, where nodes are placed together if they connect to the same nodes or if they connect to nodes in the same clusters. It estimated the numbers of clusters as 7, and I demonstrate one of the 7-cluster clusterings here. This gives us 49
blocks between the red lines, and I'll now talk a little about the 7
block-rows and 7
block-columns.
By manually inspecting the names in the clusters, we saw that the second cluster represented the main
Organizers of the summer school. They talked to everyone (the second block-row) and everyone remembered talking to them (the second block-column). The first cluster typically represented those people, like me, who came from outside DERI and Galway to attend the school. I'll call this cluster the
Visitors. The third and fourth clusters were primarly made up of people based in DERI itself - hence they tended to know each other before the summer school. By looking at the blocks on the main diagonal we can see that these
Locals know each other well, as now do the
Visitors. But by looking at the off-diagonal blocks we can see that there is less interaction between
Locals and
Visitors.
We named the fifth cluster the
Quiet cluster. You can see from the matrix above that they didn't tend to talk to many people (the fifth block-row and the fifth block-column). In fact, it appears that no two
Quiet people ever talked to each other! We looked at the names of these people - it including some people based at DERI and some not based at DERI.
You might be wondering why the non-responders were split into two clusters (the last two block-rows). By looking at the corresponding block-columns we can see hints as to why the non-responders were divided into two groups by our clustering algorithm - a group that was known by the
Locals and a group that was not. The relevant blocks in the sixth column are quite dense and sparse respectively.
I'm quite happy with these results - we merely fed a simple directed network into the algorithm and the clusters that came out have a ready interpretation. Perhaps this dataset will be included in the draft paper we are writing on the algorithm (I'm collaborating with
Brendan Murphy,
Nial Friel and
Neil Hurley on a paper about this new algorithm.)
Addendum: Using the same clustering to order the rows and columns, but displaying only the 'before' edges, gives this adjacency matrix:
This network is sparser, and it's no surprise that
Organizers and
Locals knew each other before the summer school - most of the density in the matrix is between the
Organizers and the
Locals. Finally, I looked at the matrix where only the 'during' edges are included:
To my eyes, this is more difficult to interpret. It would appear that most of the density is at the top left, but it's relatively dense everywhere. It suggests that the grouped mixed well during the summer school.