Friday, November 4, 2011

Function application should be left-to-right, not right-to-left

The following command line is typical when processing data that is in a text file:

cat data_file.txt | egrep '^X' | sort | uniq | wc -l

This counts the number of unique lines in the text file that begin with X. The great thing is that it's quite readable. You can read the data left to right

"we take the data in data_file.txt, extract the lines that begin with X, sort it, discard the (consecutive) dupes, and then count the number of lines."

Unfortunately, most programming languages force you to do this back-to-front. In a many languages, you'd have to do something crazy like:

wc_l ( uniq ( sort ( filter (data_file, beginsWithX))))


Function application should be left-to-right, not right-to-left. That feels more natural. I've written a little tool that allows me to program simple tools in the Haskell language in this style. Instead of using sed and awk and perl and python from the command line, I hope to use this tool instead. I'm calling it l2r and it wraps around runhugs, a Haskell interpreter.
l2r "x <- cat \"edge_list.txt\"; x & lines & filter ( head # (=='X') ) & sort & uniq & unlines & putStr"
The import things are the ampersands (&). They act like the pipe (|) in the bash, allowing functions to be listed left-to-right in the natural way. x is simply a (lazy) string that holds the entire contents of the file.

You might have hoped that you could simply do something like
l2r "cat \"edge_list.txt\" & lines & filter ( head # (=='X') ) & sort & uniq & unlines & putStr"
instead. But you can't. I'm not going to explain it fully here, but there is an important difference between x and cat. x is a plain String, but cat is a special object called a Monad which describes a (potentially side-effect-ful) operation which reads in a String.

A more useful example for me in my research is:
l2r 'x <- cat "edge_list.txt"; x & lines & map words & map (take 2) & map (map readInt) & filter ( \(a:b:[]) -> a /= b ) & map sort & sort & uniq & map (map show) & map unwords & unlines & putStr'
as it is code which identifies all the unique undirected edges in an edge list representation of a network.

Finally, here is the l2r code itself:





Any questions or feedback to aaronmcdaid@gmail.com

Wednesday, July 13, 2011

Web Science Doctoral Summer School 2011

I am just back from the Web Science Doctoral Summer School 2011 hosted at DERI, Galway, Ireland. It was a busy, fun, and fulfilling week.

The talks were on a variety of topics, and it was good to be pushed outside my comfort zone. We divided into teams to do miniprojects, and a group of us from Clique did a survey of our follow participants and analyzed the resulting network (more on this later). I would never have thought of doing a survey. ("surveys are for sociologists, but I'm a mathematician".)

I'd never used an API such as the Twitter or Facebook APIs before. I've done a fair bit of programming over the years, but the last time I did anything on the web was about 10 years ago with ASP and where Javascript was merely used for form validation. But we had to access these APIs to get the follow/friend networks for those people at the summer school, so it was useful to be forced to try new things like that.

One of the recurring topics was about breaking the boundaries between 'builders' and 'studiers' and it was heartening to see sociology and statistics and computer science, and so on, coming together to look at the world wide web - in fact, the general mood was that this had already been achieved. I've always found it unhelpful to label somebody as a 'sociologist' or as a 'computer scientist' or anything like that - we all have a variety of interests and these interests change dramatically over time. For example, I had no interest in networks prior to joining a PhD programme two years ago!

Bernie Hogan's talk was a good example of this 'interdisciplinarity in one person' - his slides (link anybody?) discusses some history of identity and privacy on the internet alongside technical examples based on the Facebook API. (Talking of Bernie, it was originally his idea to gather the survey for the miniproject we worked on).

On the other hand, the term 'Web Science' is perhaps a little too restrictive in some ways. Mobile telecommunications give rise to social networks, and are therefore relevant for research for many of us at the summer school. I don't mean this as a criticism of the Web Science 'project', it's just to reinforce that research topics overlap with each other pervasively.


A number of us had fun playing with Fractional Calculus on our own time. it wasn't directly relevant to the summer school but it was fun to play with maths like this; any mathematician can tell you about the 1st, 2nd and 3rd derivatives, but what about the 1.5th derivative? If you ask me more, there's a danger I'll tell you about negative probabilities or negative temperature - you have been warned :-) . We also discussed Zorn's Lemma. These experiences reminded me of the summer schools I attended as part of doing a Maths degree with the Open University (they were hosted at the University of Nottingham). They were extremely good intellectual fun. The number of students at the Web Science Summer School was also about right. There were 74 participants; if there were many more we wouldn't get to really build up a relationship over the week. The organized social events (Segway tour of Galway, and a visit to the Aran Islands) also helped a lot!

Anyway, I promised I'd talk a little about our miniproject and tell you a little more about the survey we did. (In alphabetical order, we were To Tu Cuong, Samantha Lam, myself, Ursula Redmond, Arthur White.) We asked the participants to fill in survey recording who they knew before the summer school and who they talked to during the first few days of the school. 41 out of 74 responded, giving us 1198 directed edges. Update for clarity: At first, we created a simple directed graph linking A to B if A reported a relationship of either kind with B. This gave us 1198 edges. There were 624 edges in the 'before' network, and 957 edges in the 'during' network. The 1,198 edges is simply the union of these sets of edges.

We hope to make the data available in anonymized (to the best of our abilities) form, but for now I'll just give you a taste of what we found when we ran a clustering algorithm on it.

This is an adjacency matrix. Each row corresponds to somebody who filled in the survey (a respondent) and each column corresponds to a one of the participants in the summer school. A black dot records the fact that the respondent (claims to have) talked to the participant during the school or knew them before hand.

The 33 rows at the bottom, all white, clearly respresent those participants who didn't fill in the survey. The names of the people in the rows are in the same order as those names on the columns, and hence you can see that this matrix is somewhat symmetric (in the upper left quadrant); if A claims to know B, it usually means that B also claims to know A. In other words, we had a good memory of who we had talked to!

The interesting thing for me is the red lines. I didn't just order the rows and columns arbitrarily - I ordered them according to the clusters I found with a stochastic block-modelling (SBM) algorithm I'm working on at the moment. The red lines mark the clusters found by the algorithm. The algorithm takes a network as input and places the nodes into clusters, where nodes are placed together if they connect to the same nodes or if they connect to nodes in the same clusters. It estimated the numbers of clusters as 7, and I demonstrate one of the 7-cluster clusterings here. This gives us 49 blocks between the red lines, and I'll now talk a little about the 7 block-rows and 7 block-columns.

By manually inspecting the names in the clusters, we saw that the second cluster represented the main Organizers of the summer school. They talked to everyone (the second block-row) and everyone remembered talking to them (the second block-column). The first cluster typically represented those people, like me, who came from outside DERI and Galway to attend the school. I'll call this cluster the Visitors. The third and fourth clusters were primarly made up of people based in DERI itself - hence they tended to know each other before the summer school. By looking at the blocks on the main diagonal we can see that these Locals know each other well, as now do the Visitors. But by looking at the off-diagonal blocks we can see that there is less interaction between Locals and Visitors.

We named the fifth cluster the Quiet cluster. You can see from the matrix above that they didn't tend to talk to many people (the fifth block-row and the fifth block-column). In fact, it appears that no two Quiet people ever talked to each other! We looked at the names of these people - it including some people based at DERI and some not based at DERI.

You might be wondering why the non-responders were split into two clusters (the last two block-rows). By looking at the corresponding block-columns we can see hints as to why the non-responders were divided into two groups by our clustering algorithm - a group that was known by the Locals and a group that was not. The relevant blocks in the sixth column are quite dense and sparse respectively.


I'm quite happy with these results - we merely fed a simple directed network into the algorithm and the clusters that came out have a ready interpretation. Perhaps this dataset will be included in the draft paper we are writing on the algorithm (I'm collaborating with Brendan Murphy, Nial Friel and Neil Hurley on a paper about this new algorithm.)

Addendum: Using the same clustering to order the rows and columns, but displaying only the 'before' edges, gives this adjacency matrix:
This network is sparser, and it's no surprise that Organizers and Locals knew each other before the summer school - most of the density in the matrix is between the Organizers and the Locals. Finally, I looked at the matrix where only the 'during' edges are included:

To my eyes, this is more difficult to interpret. It would appear that most of the density is at the top left, but it's relatively dense everywhere. It suggests that the grouped mixed well during the summer school.

Thursday, May 12, 2011

Push and Pop Stream state in C++

I often find iostreams in C++ a little confusing. I sometimes find myself falling back on C-style printfs, but from now on I'm trying to do formatted text 'The Right Way' in C++. I've written some code to allow myself to write my code like this (the stack.push/pop are my contribution):

  cout << stack.push << fixed << setprecision(1) << x << stack.pop << endl;

Some manipulators (like setprecision()) are permanent, any change made will remain in place for the rest of the execution of the program. But I don't want to have to reset everything manually. So I wanted to be able to save the entire state of the format flags. So I added the stack.push and stack.pop manipulators to push the format flags and the precision, so that the manipulators are now effectively temporary.

NB: Does anyone have a different/better solution? I feel like I'm reinventing the wheel.

The code needed is at http://gist.github.com/968629 . You'll need to declare a variable called stack (or any other name you like) before you can use it.

   #include "FormatFlagStack.cpp"
   FormatFlagStack stack;
     cout << stack.push << fixed << setprecision(1) << x << stack.pop << endl;

Monday, August 9, 2010

Presenting at ASONAM conference

Here are slides (SlideShare, pdf) I presented on Wed 11th August, as part of the ASONAM 2010 conference.

(Update 2010-08-19: The conference paper itself is now available. Update 2010-11-03: Added SlideShare link.)

I describe the MOSES algorithm, an overlapping community finding algorithm.

- Aaron McDaid, Neil Hurley. Detecting highly overlapping communities with Model-based Overlapping Seed Expansion. ASONAM 2010.

This is not to be confused with GCE, another algorithm that I have contributed to for finding overlapping communities. 

- Lee, Reid, McDaid, Hurley. Detecting highly overlapping community structure by greedy clique expansion. Poster paper KDD 2010

Monday, July 26, 2010

YALM: Yet Another Lamdba Macro for C++

Many programming languages offer what are called lambdas, which allow you to describe an algorithm in the place where it makes sense in the code. I've run into various problems while attempting to use existing methods with C++, and hence I present my own solution. If there's any prior art, please let me know.

A good example of the limitation I wished to overcome is on Boost's Lambda library. Using Boost, it is *not* possible to write:

sort(a,b,  _1.size() < _2.size()  ); // Doesn't work in Boost

However, the following works with YALM:

sort(a,b,
YALM_START(bool,vector<int>,vector<int>)
{ return l.size() < r.size() ; }
YALM_END
);

If your expression is quite simple like this one and doesn't contain commas, it'll be possible to use this shorter syntax:

sort(vs.begin(), vs.end(),
YALM(bool,vector<int>,vector<int>,return l.size() < r.size()  )
);

This allows for greater expressiveness within the body of the statement-group, avoiding the ugly applications of boost::bind(). But I admit that it might be inconvenient to have to explicitly write in the types all the time. Perhaps this can be auto-detected?


Here is the code needed to make this work. There are obvious limitations, but it should be clear how to extend this to work for functions with just one parameter, or more than two parameters. Any suggestions?

Now for some technicalities for the curious:

The virtual method is required in order to work around a limitation whereby locally-defined struct types cannot be used in standard algorithms (I'm using g++ (Debian 4.3.2-1.1) 4.3.2):


no matching function for call to ‘sort(..., ..., main()::X&)’

The macros defined in yalm.hpp declare a local struct, which extends a globally-defined function type, allowing us to pass in a pointer-like object to the standard algorithm.