Difference between revisions of "ETL"

Jump to navigation Jump to search
283 bytes added ,  20:52, 1 March 2022
Add byte-order mark warning
(Add information about type inference)
(Add byte-order mark warning)
Line 10: Line 10:


A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.
A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.
=== Pitfalls ===
The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".


[[Category:Onboarding]]
[[Category:Onboarding]]

Navigation menu