Last Thursday I went to Utrecht for the Haren Hackathon, a gathering of data and social media nerds to investigate a dataset of tweets concerning the rioting in Haren last month. Lots of fun stuff got made, but I only just now finished my contribution and put it online.
Basically I built an hour-by-hour “trending topics” visualisation, from the dataset of 550,000 tweets that Harro Ranter brought to the Hackathon. The intention was to get a sense of the emotional tone through the course of the day. I must admit that this doesn’t come through as clearly as I had hoped (although the sudden emergence of “afgebroken” and “wereldoorlog” at 8pm is clear). The technique does clearly show the appearence at 10pm of the rumour that a 19-year-old girl had been trampled to death (thankfully this turned out to be untrue), and the cleanup discussion starting at 7 the next morning.
There is one improvement that I would like to make (although probably won’t get around to). The analysis I made simply went hour-by-hour, but the volume of tweets in those hours varies wildly, from less than 1000 in the early hours of 21/9 to 110,000 in the hour before midnight of the same day. The statistical technique I used for the trending topics requires a reasonably large volume to give good results, but there is certainly room to divide up the hours from 9pm until midnight into smaller time periods, which would give a more fine-grained picture.
Another feature that the visualisation really needs, although I don’t see an easy way to provide it, is a back-link from each term to some indication of the tweets that it represents. Sadly, twitter search doesn’t give old results, and the code I used for the calculations doesn’t keep track of which documents the terms came from.1