Difference between revisions of "ETL"

ETL (view source)

1,119 bytes added , 17:17, 2 June 2023

→‎Testing ETL jobs: Add schema-source comparison section

73

edits

@@ Line 75: / Line 75: @@
 In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
 <code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>
+=== Schema-source comparison ===
+When running ETL jobs, you will sometimes see console output indicating that 1) fields in the source file are not being used in the schema or 2) fields in the schema cannot be found in the source file. These are checks to ensure that the schema matches and accounts for all the fields in the source file. (While marshmallow does have some support for some operations, we've written our own code for handling these comparisons.)
+If there's a field in the source file that you don't want to publish (and you don't want to keep getting console output about it), you can either list it in the <code>exclude</code> option, in the schema's <code>Meta</code> class, or you can add the field to the schema, but set it to <code>load_only=True</code>.
+If there's a field in the schema that is supposed to be published ("dumped" in Marshmallow jargon) and is supposed to be loaded from the source file, but it can't be found in the source file, an error message will be printed to the console. If additionally the job is trying to push data to the CKAN datastore, an exception will be raised.
 == Deploying ETL jobs ==