Difference between revisions of "ETL"

650 bytes added ,  21:27, 3 June 2023
Add "Inventorying ETL jobs" section
(Add "Inventorying ETL jobs" section)
 
Line 105: Line 105:
== Retiring ETL jobs ==
== Retiring ETL jobs ==
Every ETL job has a life cycle. When an ETL job comes to the end of its existence because the data source has disappeared, we designate this dataset "orphaned". Metadata values for the update frequency fields should be changed to the value that has "Historical" in its description. The dataset's "_etl" tag should be removed and replaced with the "_orphaned_etl" tag. If there is still a manual inventory of ETL jobs, that inventory should be updated.
Every ETL job has a life cycle. When an ETL job comes to the end of its existence because the data source has disappeared, we designate this dataset "orphaned". Metadata values for the update frequency fields should be changed to the value that has "Historical" in its description. The dataset's "_etl" tag should be removed and replaced with the "_orphaned_etl" tag. If there is still a manual inventory of ETL jobs, that inventory should be updated.
== Inventorying ETL jobs ==
When you add, delete, or modify ETL jobs, update [https://docs.google.com/spreadsheets/d/1amPsZ1RA9d9m41MXkMmMB8-VDmwAOxU72A35DlSWbT4/edit#gid=0 this Google Sheet]. (Eventually, we'll make ETL jobs self-tracking somehow, probably by adding the location of the script (and any other useful metadata) to the package <code>extras</code> metadata field.)
Currently ETL jobs are reporting as metadata things like last_etl_update date and the source file hash. Even the last source file name is being used in the new Data Rivers/Google Cloud Platform to check whether a given file is newer or older than the last one used.


[[Category:Onboarding]]
[[Category:Onboarding]]