Difference between revisions of "ETL"

ETL (view source)

868 bytes added , 20:48, 1 March 2022

Add information about type inference

73

edits

@@ Line 1: / Line 1: @@
+== ETL overview ==
 ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.
@@ Line 4: / Line 6: @@
 Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.
+== Writing ETL jobs ==
+A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.
 [[Category:Onboarding]]