Difference between revisions of "ETL"

Revision as of 18:27, 2 March 2022

ETL overview

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in rocket-etl, an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the Western Pennsylvania Regional Data Center open-data portal. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

Writing ETL jobs

A useful tool for writing ETL jobs is Little Lexicographer. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

Snake case

Whenever possible, format column names in snake case. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so "FIELD NAME" becomes "field_name" and "# of pirates" should be changed to "number_of_pirates"). Reasons we prefer snake case: 1) Marshmallow already converts field names to snake case to some extent automatically. 2) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).

Pitfalls

The byte-order mark showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".

Testing ETL jobs

Typical initial tests of a rocket-etl job can be invoked like this:

> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file

where the mute parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

> python launchpad.py engine/payload/robopgh/census.py mute to_file

would write its output to a file in the directory <PATH TO rocket-etl>/output_files/robopgh/. Note that the namespacing convention routes the output of robopgh jobs to a different directory than that of wormpgh jobs, but if there were two jobs in the robopgh payload folder that write to population.csv, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, robopgh/census.py file).

After running the job, examine the output. VisiData is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (> vd output_files/robopgh/population.csv) and invoking Shift+F on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How marshmallow transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by plotting record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the test parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

> python launchpad.py engine/payload/robopgh/census.py mute test

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the test parameter is not specified) as a safety feature. The parameter that controls this setting is PRODUCTION, which can be found in the engine/parameters/local_parameters.py file and which should be defined like this: PRODUCTION = False

Only in production environments should PRODUCTION be set to True.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this: > python launchpad.py engine/payload/robopgh/census.py mute production

Deploying ETL jobs

(To be written.)

@@ Line 10: / Line 10: @@
 A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.
+=== Snake case ===
+Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so "FIELD NAME" becomes "field_name" and "# of pirates" should be changed to "number_of_pirates"). Reasons we prefer snake case: 1) Marshmallow already converts field names to snake case to some extent automatically. 2) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
 === Pitfalls ===
 The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
+== Testing ETL jobs ==
+Typical initial tests of a rocket-etl job can be invoked like this:
+<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>
+where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job
+<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>
+would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).
+After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).
+Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?
+Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).
+Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.
+If you can't figure out something about the data, ask someone else and/or the publisher.
+Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):
+<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>
+Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
+<code>PRODUCTION = False</code>
+Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.
+In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
+<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>
+== Deploying ETL jobs ==
+(To be written.)
 [[Category:Onboarding]]