Difference between revisions of "ETL"

370 bytes added ,  16:35, 2 June 2023
(Reorder two statement in an ordered list.)
Line 33: Line 33:
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?
''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas.''


=== Pitfalls ===
=== Pitfalls ===