Difference between revisions of "ETL"

912 bytes added ,  20:22, 1 June 2023
Add schema design section to ETL page
m
(Add schema design section to ETL page)
Line 25: Line 25:
A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.
A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.


=== Snake case ===
=== Schema design ===


Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so "FIELD NAME" becomes "field_name" and "# of pirates" should be changed to "number_of_pirates"). Reasons we prefer snake case: 1) Marshmallow already converts field names to snake case to some extent automatically. 2) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so "FIELD NAME" becomes "field_name" and "# of pirates" should be changed to "number_of_pirates"). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates `y` and `x` (or `lat` and `long`) but `latitude` and `longitude` are already being used by other data tables, switch to `latitude` and `longitude`.
# '''Standardize column values.'''


=== Pitfalls ===
=== Pitfalls ===