Quest for a broad categorization of publishable government data

I’m trying to build a big picture that would unite the general themes of public sector data and the general classes of applications, application-ideas and other uses of public data. My three big questions are:

  • What government data is in highest demand?
  • What data is currently provided by the governments – general overview to the content of the data catalogues?
  • What are the general categories of publishable datasets that governments typically have?

I'm asking your help to fill in the blanks in the yellow central column.

The questions above are continuation to the discussion started by Steven Clift over a year ago on the public-egov-ig mailing-list. I have summarised the e-mail discussion and done some further research on the topic. Here are some of my findings.

Demand: Most popular, most important, most valuable, most useful?

Within 45 days, each agency shall identify and publish online in an open format at least three high-value data sets.

Open government directive 8.12.2009

Which datasets are “high-value”? For example weather data and geospatial data are often mentioned as cases where there is straight business potential for the data re-use. Beside commercial re-use there are other kinds of “valuation for the data” such as public utility (for example the public transport data), scientific research, transparency and democracy, efficiency and cost savings for the government etc.

One way to approach the question is to ask the citizens (or businesses or programmers), what do they want. City of Berlin has asked its citizens in a poll, which themes of PSI they should release as open data first. People chose 3 themes out of a list of 20 options – the preliminary results show that top themes were: city planning, administration, environment, controlling, infrastructure, and population. A spanish report on eGov (page 224) states that the two most important information sets at the regional level for citizens are: organisation chart and public job vacancies.

When polling the potential re-users in this way we do get useful insights, but there are things that do not show in the results. We should keep in mind two things: a) what is popular in the future might not be popular today  and b) even if some data is not used by many it may lead to development that have high value for the society.

Chris Beer writes:

“In Australia the “most important” PSI based on citizen usage of services using the PSI is Weather information. Followed by (in no order) Tax, Employment and Social Security. … I’m betting that if you got similar information from other places you’d see a pattern emerge: people are interested in information that affects them.”

The pattern mentioned by Chris is easy to verify. I did few trials with Google insight – what keywords people use in Google searches togeather with the word “tilasto” (statistic in Finnish): weather, housing prices, temperature, salary.

Chris continues:

“Does popular = important? Personally I think it’s all vital, and the concept of openness and transparency means ultimately, where possible, it all needs to go up, regardless of if it gets used regularly or not. … Thought for the day: Only one person may ever look at a dataset relating to GIS data and observatories for instance, but that one person might be Stephen Hawking, and that one use of that one dataset by that one person might change the world.”

Most people (including my self) working with the government open data initiatives agree with Chris’s idea of “it all needs to go up”. Still there is the burning question of prioritising and focusing – what to do first, when there is no resources to do everything at once?

Wiki page of the Civic Commons project says it well: Any open data initiative will begin with the question, “What do we open first?” or, “What’s the most important thing for us to open?”. To address those questions the site lists four strategies for setting open data priorities: Evidence, Feedback, Low Hanging Fruit, and High Return on Investment (ROI).

I would suggest using all these strategies simultaneously. Especially the high-ROI strategy is important, since the feedback and evidence collected after releasing only the “low hanging fruits” may well be quite discouraging (see for example: Info released under Obama transparency order is of little value, critics say).

My own contribution to the research of the demand side is the screen-scraped collection of open data apps from several application catalogues (including,, and others). Initial observation is that location based applications are most common, public transportation data, safety related data and other location information are therefore most “demanded” among the current applications. Deeper analyses is work in progress, but it can be said that generally the apps based on open data are still in the early stage of development.

Supply: Catalogues
Past few years we have witnessed a wave of national-, regional-, city-wide- and independent data catalogues being launched all over the world (see catalogues from EU member states and international list of data catalogues). This has created interest in analysing the content, comparing the catalogues and building unifying search interfaces to them.

Linked data research center at DERI has created a simple faceted browser over the catalog, while Li Ding and Jim Hendler from the Rensselaer Polytechnic Institute have visualised the content as a Cloud of Government Data.

Koumenides & al. (2010) researched the keyword annotations in four separate data catalogues (,, OPSI’s Information Asset Registers and They observed noteworthy variations in the four catalogues. In terms such as “health” and “social care” were prevailing annotators, the catalogue had more environmental-related annotations, i.e. “toxic release”, “chemical release”, “facilities”, etc. Australia’s catalogue gives emphasis to “education” and “environmental management”. OPSI on the other hand concentrates on governmental affairs, i.e. “office services”, “supplier contracts”, “complaints”, etc.

The current data catalogues mostly list datasets that are already open, so the represent a “part of the whole”. It would help the discussions and the process of prioritising data opening if also the “whole” would be represented somehow. The realm of possibly publishable datasets (data that is not sensitive by nature i.e. for privacy or security reasons) that governments typically have is vast but not unlimited.

It should be possible to build some general level categorisation that would state that a country (or a city) most probably holds publishable information in these themes x,y,z…. Of course the countries and cities are different and collect different information and we may dive deep into the discussions like “what is considered as government data (big databases vs. individual excel sheets used by few civil servants)?” and “how do we define a dataset (are housing price statistics 2010 and 2011 different datasets)?”

Jonathan Gray writes:

Some kind of basic visual ‘map’ to PSI would be interesting, so that the public holdings of different countries could be directly compared and so you could get a sense of what was available and what was missing in different countries.

The report of the 2006 MEPSIR study is one good starting point for general categories of public sector information (PSI). The study divides PSI in six domains and several sub-domains. The main domains are: Business information, Geographic information, Legal information, Meteorological information, Social data and Transport information (see p. 25 chapter 8: Overview of results for the domains).

Beside MEPSIR, are there any other ready made categorisations? Especially I’m interested in city-level public data?

4 responses to “Quest for a broad categorization of publishable government data

  1. For some ideas on categorization/taxonomy, take a look at

    Click to access JHS145_liite1.pdf

  2. Hi Samuel, thanks for the hints. With a quick overview the categorization seemed to based on the fuctions/services of different public bodies. In the open data the categorization should be closely tied to the actual content of the data, because people may use same data to very varied purposes. Anyways… this is better than nothing, from many function-categories it is possible to guess the content-categories. I have to check this closer.

  3. Pingback: » Otwarty rząd – linki – 18.02.2011 CENTRUM CYFROWE PROJEKT : POLSKA

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s