Turning open data into usable data

Figby, the magical data processing unicorn.

Figure.NZ takes data released by government agencies and other organisations and transforms it into usable forms like the tables, charts and maps you see on our website. So far, we’ve published 17,506,536 data points in 1,512 tables, and made 39,212 charts and 686 maps from them.

Our goal is to make it easy for anyone, regardless of their technical expertise or knowledge of the New Zealand data landscape, to find and use figures about New Zealand in formats they understand.

But what does that actually mean? What form do we get the data in, and what do we actually do to make it usable?

Where data comes from

When we track down data on government sites, it’s most often in Excel spreadsheets and sometimes in CSV files. Most of the time, the agencies make these files for their internal use; they’re for sharing figures with another person inside the agency so they’re designed for humans to read, and they work well for this.

The problem comes when we try and use machines to access the data. The files are often formatted in ways that make it hard to use them for other things, like making charts or maps, doing calculations, turning the file into a table in your database, or comparing with other datasets. This is because they have things like formatting to show information (like shading a cell yellow to show that there’s a note about the content), metadata embedded with the numbers (like an asterisk), tables across multiple pages, or they use slightly different terms to describe the same concept — especially when you’re looking at data from different providers.

Screenshot of ‘Oil Supply, Transformation, and Demand’ spreadsheet from a government agency. The spreadsheet uses indentation within a cell to visually show a relationship and footnotes — things a computer can’t interpret easily.

The formatting of the spreadsheets can mean that it’s super hard for a computer to tell what information like categories or subcategories (like fuel type in the screenshot above) are associated with each cell value. This is bad news if you’re a programmer, data scientist, or someone else trying to make something from this spreadsheet — you have to do a lot of hard work to make it useful.

Transforming data

Figure.NZ has built a system called Grace. It enables us to get the figures out of these files so we can turn them into something more useful. You can rest assured that what you see is what was in the original file because Grace doesn’t have the ability to edit the numbers when we process a spreadsheet.

However, while we’re processing them into database tables, we also do some other cool stuff that’s not supported by spreadsheets.

Typing Data

Excel’s ‘Format Cells’ options

‘Typing data’ is a technical term that means assigning a value to a particular categorisation. You might be familiar with typing in Excel where you can state that a cell range is a number or currency, and what format it takes — Excel has types like text, number, currency, and percentage, and these dictate how the cell content is formatted and interpreted (number of decimal places, alignment, showing a % sign).

Grace can do all of that, but it also has the ability to create custom types that are specific to the sorts of data we process. These range from classifications created by agencies like Stats NZ (such as regions, territorial authorities, types of industry), to things that describe numerical concepts (like age ranges) to categories of objects (like types of alcoholic beverage or ports in New Zealand).

We use two techniques to type data in Grace: units and enumerations.

Units

Grace has over 50 kinds of unit. These units are more complex than the normal ‘types’ you find in Excel or in a programming language.

A sample of the unit types we have in Grace.

Units enable us to say that figures we process (regardless of where they’re from) describe the same concept, whether that’s age, gender, region, job, or something else. This means they can be compared with each other. These concepts are things like NZD, age range, region, territorial authority, and types of alcoholic beverage. They also enable us to format figures and labels on charts in a consistent way, correcting for spelling mistakes and inconsistencies.

Every cell in a spreadsheet we process is assigned a unit. Our code has special instructions that tell us what to do with numbers assigned to each unit so we can turn them into useful values. For example, a spreadsheet might have a set of numbers that are age ranges, like ‘55–64’ and ‘above 64’. Our code recognises that for age ranges, ‘above 64’ is the same thing as ‘65 and above’ and ‘65+’, so when we create our table we change all of these labels to ‘65+’ so they’re the same for all data we hold. This doesn’t mean we combine different age ranges across different datasets (that’s a risky proposition because of rounding and confidentialisation), just that we use consistent naming structures.

Example of a chart created using the age range unit.

Each unit can also be assigned other properties, such as a custom sorting order for when the data is displayed on a chart. For example, our age range unit knows that it needs to sort ‘5–9’ before ‘10–14’. Without this, the default sorting is alphabetic and sorts ‘10–14’ before ‘5–9’ because 10 starts with a 1. This makes for confusing charts! Another example is region which has a custom sorting option that lists regions in order from the top to the bottom of the country.

There are 3 different types of unit in Grace:

  • Basic units (like you’d find in Excel or programming languages where something is a % or text or a number)
  • Complex units like ‘age group’ where we can specify that a range of numbers is an age range.
  • Enumerated units.

Enumerations

Some units are enumerated. In other words, we have a list of values that are valid for that unit. The ‘Region’ unit has a list of every region in NZ. When we assign the region unit to a table, Grace checks that only valid regions are present, and then links each figure up so we know that it is a region and which region it is. This means when we process the figure for the Wellington Region population, we know that it’s a number, and that it talks about a region, and that the region is the Wellington Region (and it’s the same Wellington Region other figures talk about). Grace is also smart enough to know that the label ‘Wellington’ in a list of regions means ‘Wellington Region’ even if it doesn’t say ‘region’ in the label.

Some of the tables with figures for the Wellington Region.

We have a lot of enumerated units, and many of them describe very specific concepts in NZ data collection like trophic state (a measure of the biological condition of lakes) and sentence (types of sentences issued by the justice system).

We’ve just added the ability to provide a definition for each item in an enumeration so that soon when we link a figure to a concept like the occupation ‘keyboard operator’, anyone who downloads or looks at a chart made from that data will be able to see exactly what the definition is without needing to go and look it up in another system.

Why do we do this?

The short answer is to make the figures easier for everyone to use! Typing the figures we process gives us a whole bunch of benefits, but two of the biggest ones are linking data and standardisation.

Linking the figures to a concept as defined by a unit enables us to do cool things like ‘Give me all the data from any provider that talks about the Wellington Region’. Or about women. Or about beer, or dogs, or diseases. It also enables us to build relationships between concepts (like occupations and industries), but we’ll talk more about that in another post.

Standardisation enables us to deal with the fragmentation in data in NZ. Sometimes we might start with 2 spreadsheets from different providers where one talks about Whanganui and one about Wanganui. They’re both talking about the same thing — Whanganui. The unit system enables us to standardise them both to Whanganui. It also makes it easy to update names. For example, if Auckland decided to use its Māori name, Tāmaki Makaurau, we only need to update it once and all the tables, charts, and maps would be updated too. This makes it easier to find relevant data.

For data experts, there’s another bonus: when you use data from Figure.NZ and you need to link two datasets together, you can be confident they’ll use the same spelling and formatting, regardless of the source. And you’ll always be able to double-check and see exactly what the source provided us — we link back to the source, and provide a copy of the spreadsheet we downloaded.

It also allows us to build things like Business Figures, where you can tell us your business or job type and where it is, and we’ll return all the relevant data we hold.

TL;DR:

We get spreadsheets formatted for humans to read, but not for them to use. We give you long-form tables with the exact same values extracted, typed, and with metadata attached correctly in a handy CSV file so you can take the figures and do awesome things with them without doing the boring tidying up work.

It also allows us to make lots of charts and maps very quickly, so that we can share the amazing stories of NZ held in the figures.

Before:

After:

And a chart made from it:

Chart of oil consumed for land transport in New Zealand. Source: Figure.NZ, license CC-BY-3.0

Turning open data into usable data was originally published in It Figures on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powered by WPeMatico

Please follow and like us:
error0