Difference between revisions of "ETL"

Jump to navigation Jump to search
3,051 bytes added ,  21:27, 3 June 2023
Add "Inventorying ETL jobs" section
(Finish schema design section)
(Add "Inventorying ETL jobs" section)
 
(8 intermediate revisions by the same user not shown)
Line 32: Line 32:
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). c) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?
 
''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas (and especially if your data table includes representations of race/ethnicity). The General Recommendations on page 41 provide a good overview.''


=== Pitfalls ===
=== Pitfalls ===
Line 72: Line 75:
In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>
=== Schema-source comparison ===
When running ETL jobs, you will sometimes see console output indicating that 1) fields in the source file are not being used in the schema or 2) fields in the schema cannot be found in the source file. These are checks to ensure that the schema matches and accounts for all the fields in the source file. (While marshmallow does have some support for some operations, we've written our own code for handling these comparisons.)
If there's a field in the source file that you don't want to publish (and you don't want to keep getting console output about it), you can either list it in the <code>exclude</code> option, in the schema's <code>Meta</code> class, or you can add the field to the schema, but set it to <code>load_only=True</code>.
If there's a field in the schema that is supposed to be published ("dumped" in Marshmallow jargon) and is supposed to be loaded from the source file, but it can't be found in the source file, an error message will be printed to the console. If additionally the job is trying to push data to the CKAN datastore, an exception will be raised.
These checks are really helpful when writing/testing/modifying an ETL job, as they make it easy to find typos in field names or other errors that are preventing source data from getting to the output correctly.


== Deploying ETL jobs ==
== Deploying ETL jobs ==
Line 89: Line 102:
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
== Retiring ETL jobs ==
Every ETL job has a life cycle. When an ETL job comes to the end of its existence because the data source has disappeared, we designate this dataset "orphaned". Metadata values for the update frequency fields should be changed to the value that has "Historical" in its description. The dataset's "_etl" tag should be removed and replaced with the "_orphaned_etl" tag. If there is still a manual inventory of ETL jobs, that inventory should be updated.
== Inventorying ETL jobs ==
When you add, delete, or modify ETL jobs, update [https://docs.google.com/spreadsheets/d/1amPsZ1RA9d9m41MXkMmMB8-VDmwAOxU72A35DlSWbT4/edit#gid=0 this Google Sheet]. (Eventually, we'll make ETL jobs self-tracking somehow, probably by adding the location of the script (and any other useful metadata) to the package <code>extras</code> metadata field.)
Currently ETL jobs are reporting as metadata things like last_etl_update date and the source file hash. Even the last source file name is being used in the new Data Rivers/Google Cloud Platform to check whether a given file is newer or older than the last one used.
[[Category:Onboarding]]
[[Category:Onboarding]]

Navigation menu