It starts with the data

BFe_lADCEAMOIiA.jpg_large

Photo credit: @worldbankdata

What good can data do?

The World Bank and DataKind set out to further explore this question during the Data Dive held March 16 and 17 in Washington DC (#data4good). People  who rarely work together — coders, quants, data visualizers, procurement experts, economists, lawyers, students, senior managers, open data evangelists — ended up at the same table for 36 hours of intense work, united by their love of data. The goals were attractive. How can we measure poverty more often and more accurately? Can we detect fraud by looking at the data?

Photo credit: Jake Porway @jakeporway

Photo credit: Jake Porway @jakeporway

It was my first participation and the first thing that I learnt is that bringing your desktop computer in the land of laptops makes for a good conversation piece and several tweets.

The second lesson is rather a reminder: all data visualization starts with data gathering and verification. Hold your horses, get the data right. Delayed gratification is the best anyway. And delay our gratification, we did.

The World Bank has some rich and reliable data sets and, indeed, they directed us to a file with 77 dimensions for 13,628 World Bank projects between 1947 and 2013. One million data points for your viewing pleasure. The list of disbarred firms was less enthusing: it had only firms currently disbarred, no historical data and the grounds for disbarment had typos and structure problems. Thankfully, the wizardry of Taimur, Sameer and Jayesh meant that about halfway through the day we had a historical list scraped from the Wayback Machine of Archive.org. The following morning, the grounds for disbarment were clean.

Data Dive World Bank March 2013

It was not as silent as it looks. Photo credit: Neil Fantom.

But the real problem was the missing link between these two data sets. The disbarment list contains no information about the project for which the firm or individual was disbarred. Without it, it is impossible to explore the characteristic of projects for which cases are detected. This information exists somewhere and in fact, it could be manually garnered from the determinations, made publicly available in scanned PDFs, a data person’s nightmare. Still, our three aforementioned wizards put their brains and digits to it, found some intermediary data set and, at the very end of the event, we had a debarment list with project names. I won’t link to it however as we did not have time to verify both the methodology and results, and this is delicate information to get wrong.

The event started Friday night, with some speeches and mingling, and finished Sunday morning with presentations. So it’s about 12-13 hours of work on Saturday, from 10 am to 11 pm. Receiving instructions, understanding the topic, seeing the data sets, thinking up  questions for the data, figuring out the problems, brainstorming solutions, weeding out the wrong ones, implementing the promising ones, seeing and checking the results took our group most of this precious time. We never got to the point where we could ask the questions we had early in the day. Reflecting upon the experience now, maybe we should have limited our questions to a universe that could be answered by the existing data. Make that the third lesson.

The data providers that make the data public would benefit from releasing it in the right format, sparing users a lot of the scraping. Webpages like this were certainly created from a database in the first place and yet we had one person spend the whole day just recreating it. World Bank: share the database. Since it is public information anyway, keep the master file on the server, update it right there.

These data issues are commonplace at such events, we’re told. I can believe it from my personal experience with data. I’m sure it’s fun to be a data visualizer fed with perfect data, but I’m yet to encounter such a situation. Learning to test and clean the data is still, today, a skill that a data visualizer needs. Jon Schwabish recently started a discussion on Twitter concluding that data processing is a defining skill of a data visualization expert and I can only agree.

Advertisements
This entry was posted in dataviz, events, open data and tagged . Bookmark the permalink.

5 Responses to It starts with the data

  1. Good point about ‘missing’ data..we are getting there though! Keep pushing. Thanks so much for coming.

    • Francis G says:

      I’m glad I did. Someone said that this event might be the start of an open data revolution at the World Bank. I’d say that the revolution has started years ago thanks to people like you. Keep it up.

  2. Pingback: Diving with a view | Visual Rhetoric - Seeing is Believing

  3. Pingback: Data Volunteering with International Organizations at the World Bank

  4. Pingback: Scenes from a big data dive – the movie version | Voices from Eurasia - We help build better lives.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s