ETL

From WPRDC Wiki
Revision as of 20:52, 1 March 2022 by DRW (talk | contribs) (Add byte-order mark warning)
Jump to navigation Jump to search

ETL overview

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in rocket-etl, an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the Western Pennsylvania Regional Data Center open-data portal. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

Writing ETL jobs

A useful tool for writing ETL jobs is Little Lexicographer. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

Pitfalls

The byte-order mark showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".