Difference between revisions of "ETL"

ETL (view source)

Revision as of 20:49, 1 June 2023

2,175 bytes added , 20:49, 1 June 2023

Finish schema design section

DRW

Bureaucrats, Interface administrators, Suppressors, Administrators

73

edits

@@ Line 29: / Line 29: @@
 After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
 # '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
-# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so "FIELD NAME" becomes "field_name" and "# of pirates" should be changed to "number_of_pirates"). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
+# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
-# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates `y` and `x` (or `lat` and `long`) but `latitude` and `longitude` are already being used by other data tables, switch to `latitude` and `longitude`.
+# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
-# '''Standardize column values.'''
+# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
+# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). c) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?
 === Pitfalls ===