Abstract for Ignite Open Data

Abstract for Ignite Open Data, an event with OCF and ODI.

SheetHub.com = open(open data portal)


SheetHub.com is a platform to share data with the ambition of becoming the Github for data. In Taiwan, 7,000 datasets are currently hosted by 18 different portals. We are aggregating all these data in one place, with the procedure of scraping, cleaning, and reorganizing. In this talk, we will share the challenges we encountered, the technology we used, and the benefit of having all 7,000 datasets in one place.

The challenge

Data exists in many formats.

For tables, there are .csv, .tsv, .xls, .xlsx, .json, .xml, .pdf, .doc, and .docx formats. For geographical data, there are .geojson, .topojson, .kmz, .kml, and .shp formats. This is a major problem for developers and a nightmare for the average user (such as a journalist).
Open Data Portal also exists in many different formats. We discovered that some formats are incompatible with the de facto standard, such as RFC 4180 for .csv files. Even when it is compatible with the standard, the portal can be quite opinionated about the way it serves the data, making it difficult to use.

The technology

The 5 Star Open Data, deployment scheme defines 5 different qualities of data:

  • Level 1: Data is available under an open license
  • Level 2: Data is structured
  • Level 3: Data uses non-proprietary formats
  • Level 4: Data uses URIs to denote things
  • Level 5: Data is linked to other

SheetHub.com transforms two star data to five star data. It can parse a vast variety of formats, such as excel, .csv, .xml, .shp, and .kml. Once data is parsed, it splits the datasets into rows and assigns each record a unique URL. After that, it compares every row to the existing reference data and helps users create linked data.

We are also developing refine.ls, an in-browser data tool that cleans up messy data. It's similar to Google (Open) Refine, with a programming (javascript/livescript) first approach. The result is a set of small, very flexible programs for cleaning and manipulating data.

The benefit

NY Times states that the average Data Scientist spends 50% to 80% of time on "Janitor Work," such as searching, cleaning, and manipulating data. These are (expensive) experts with advanced degrees whose works are referred to as "The Sexiest Job of the 21st Century." By providing "sanitized data," SheetHub.com has the potential of boosting the productivity of the whole industry by a factor of 2 to 5.