Difference between revisions of "ETL"

Jump to navigation Jump to search
1,340 bytes added ,  21:27, 3 June 2023
Add "Inventorying ETL jobs" section
(→‎Testing ETL jobs: Add schema-source comparison section)
(Add "Inventorying ETL jobs" section)
 
(2 intermediate revisions by the same user not shown)
Line 83: Line 83:


If there's a field in the schema that is supposed to be published ("dumped" in Marshmallow jargon) and is supposed to be loaded from the source file, but it can't be found in the source file, an error message will be printed to the console. If additionally the job is trying to push data to the CKAN datastore, an exception will be raised.
If there's a field in the schema that is supposed to be published ("dumped" in Marshmallow jargon) and is supposed to be loaded from the source file, but it can't be found in the source file, an error message will be printed to the console. If additionally the job is trying to push data to the CKAN datastore, an exception will be raised.
These checks are really helpful when writing/testing/modifying an ETL job, as they make it easy to find typos in field names or other errors that are preventing source data from getting to the output correctly.


== Deploying ETL jobs ==
== Deploying ETL jobs ==
Line 100: Line 102:
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
== Retiring ETL jobs ==
Every ETL job has a life cycle. When an ETL job comes to the end of its existence because the data source has disappeared, we designate this dataset "orphaned". Metadata values for the update frequency fields should be changed to the value that has "Historical" in its description. The dataset's "_etl" tag should be removed and replaced with the "_orphaned_etl" tag. If there is still a manual inventory of ETL jobs, that inventory should be updated.
== Inventorying ETL jobs ==
When you add, delete, or modify ETL jobs, update [https://docs.google.com/spreadsheets/d/1amPsZ1RA9d9m41MXkMmMB8-VDmwAOxU72A35DlSWbT4/edit#gid=0 this Google Sheet]. (Eventually, we'll make ETL jobs self-tracking somehow, probably by adding the location of the script (and any other useful metadata) to the package <code>extras</code> metadata field.)
Currently ETL jobs are reporting as metadata things like last_etl_update date and the source file hash. Even the last source file name is being used in the new Data Rivers/Google Cloud Platform to check whether a given file is newer or older than the last one used.
[[Category:Onboarding]]
[[Category:Onboarding]]

Navigation menu