Difference between revisions of "ETL"

79 bytes added ,  20:58, 1 June 2023
no edit summary
(Finish schema design section)
Line 32: Line 32:
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). c) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). c) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?


=== Pitfalls ===
=== Pitfalls ===