Data Dictionaries

From WPRDC Wiki
Revision as of 20:44, 9 March 2022 by DRW (talk | contribs) (Move table out of list to improve parsing of Markdown)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

What is a data dictionary?

Much like a regular dictionary defines words and tells the reader how to use them, a data dictionary explains all the columns (fields) of your data with enough detail that the reader can start using your data.

It lists each field name that appears in the table of data, provides a definition for the field, specifies the type of data in that field (e.g., "string" or "integer"), and gives an example value that could occur in that field.

Why are data dictionaries important?

Published data often has field names that are deliberately short, and they may wind up being cryptic or unclear, particularly to someone unfamiliar with the topic that the data describes. The field might be "weight", but if the value is 8, the user won't know if the item weighs 8 pounds or 8 kilograms or 8 tons.

Data dictionaries are a simple way to explain such details and make the data you publish more accessible to other people.

Suggested formats

Our integrated data dictionaries use a four-field data dictionary to define your data's fields.

The fields we recommend are:

  • "column": The name of the field. (Preferably formatted in all lowercase letters and with underscores instead of spaces or other punctuation.)
  • "type": The type of the field, coded as something like "text" or "float" or "int". The full list of types for values that go into the CKAN datastore (which the WPRDC data portal runs on) is shown in Table 1 below. Note that when using CKAN's integrated data dictionaries, the type of a data dictionary entry is set by the type of the field in the CKAN datastore, so you shouldn't have to set this yourself (unless you need to override the type that CKAN thinks that field is [this is only for a non-ETL upload]).
  • "label": A short human-readable label for this field. If the field name is zip, an appropriate label might be "ZIP Code" or "Postal Code".
  • "description": A definition of the field (you could, for instance, include in here that the units of the field value are furlongs). This is also the place to put any other information relevant to the field, including information about how the field was calculated from another field or how the field was transformed for publication.
Table 1
CKAN type description
text text string
int integer
float real number
boolean a Boolean value (True or False)
date a date without a time
time a time without a date
timestamp a date and a time together (a.k.a., a "datetime")
json a JSON representation of some data (superuseful but by far the most obscure type on this list)

We also like the Frictionless Data JSON Table Schema example approach to data dictionaries, but we're not quite ready for that yet.

Other suggestions: We like to name fields by making all the letters lowercase and converting spaces and other punctuation to the underscore character \(\_\). So, we would convert the field name "Walrus Count" to the name "walrus_count". This is called "snake case".

How to create a data dictionary

Option 1 (preferred and easiest): Use our fancy new integrated data dictionaries, which you can create through the CKAN web interface.

Option 2: You can use a spreadsheet program and then export the results to CSV.

If you do it this way, check it over by opening it in a text editor, to make sure that Excel didn't format anything (like dates) weirdly.

Option 3: You can type it up by hand. It's not that hard if you have an example. (See here for examples about books.)

Option 4: Use this handy Python script I wrote: little-lexicographer.

Integrated data dictionaries

Our latest version of the data-portal software includes nifty built-in data-dictionary capabilities. As the publisher, you can edit the data dictionary through the management interface and then the user can view it right below the corresponding data table.

How to edit a resource's integrated data dictionary

1) From the resource page, click the “Manage” button. 2) Click on the “Data Dictionary” tab. You will see a long form with selectors and blanks for each field. 3) Optional: Use the “Type Override” selector to change the types for any fields that need to be changed. 4) You can also provide human-readable names for the fields in the “Label” blank and a longer description in the catch-all “Description” field. 5) Click “Save” at the bottom of the page.

Uploading integrated data dictionaries

Little lexicographer supports uploading properly formatted CSV files to the integrated data dictionary of an existing resource: https://github.com/WPRDC/little-lexicographer#uploading-integrated-data-dictionaries

Beyond data dictionaries

Some datasets benefit from extended documentation. For these, we have Data Guides!

We can also recommend the Datasheets for Datasets standard for a comprehensive approach to documenting data that you are publishing.