WPRDC Wiki - User contributions [en]

Data Guides

2024-02-01T14:44:17Z

DRW: /* WPRDC Data Guides */

Links to detailed document about data that includes social contextual information.

== WPRDC Data Guides ==
[[Data Across Sectors for Health (DASH) Data Guide]] - A guide for [https://data.wprdc.org/dataset/?q=dash&sort=score+desc%2C+metadata_modified+desc&tags=ACHD+DASH these 29 health-related datasets].

[[Guide to Property and Housing Data]]

[[Allegheny County Property Assessment Data User Guide]] | ([https://data.wprdc.org/dataset/2b3df818-601e-4f06-b150-643557229491/resource/cc4bafd2-25b6-41d7-83aa-d16bc211b020/download/allegheny-county-property-assessment-data-user-guide.pdf PDF version])

[[Crime, Courts, and Corrections in the City of Pittsburgh]]

[[City of Pittsburgh 311 Data User Guide]]

Data Guides

2024-02-01T14:43:25Z

DRW: /* WPRDC Data Guides */ Clarify nature of DASH data guide

Links to detailed document about data that includes social contextual information.

== WPRDC Data Guides ==
[[Data Across Sectors for Health (DASH) Data Guide]] - A guide for [these 29 health-related datasets](https://data.wprdc.org/dataset/?q=dash&sort=score+desc%2C+metadata_modified+desc&tags=ACHD+DASH).

[[Guide to Property and Housing Data]]

[[Allegheny County Property Assessment Data User Guide]] | ([https://data.wprdc.org/dataset/2b3df818-601e-4f06-b150-643557229491/resource/cc4bafd2-25b6-41d7-83aa-d16bc211b020/download/allegheny-county-property-assessment-data-user-guide.pdf PDF version])

[[Crime, Courts, and Corrections in the City of Pittsburgh]]

[[City of Pittsburgh 311 Data User Guide]]

Data Guides

2024-02-01T14:40:56Z

DRW: /* WPRDC Data Guides */ Reorder guides

Links to detailed document about data that includes social contextual information.

== WPRDC Data Guides ==
[[DASH Data Guide]]

[[Guide to Property and Housing Data]]

[[Allegheny County Property Assessment Data User Guide]] | ([https://data.wprdc.org/dataset/2b3df818-601e-4f06-b150-643557229491/resource/cc4bafd2-25b6-41d7-83aa-d16bc211b020/download/allegheny-county-property-assessment-data-user-guide.pdf PDF version])

[[Crime, Courts, and Corrections in the City of Pittsburgh]]

[[City of Pittsburgh 311 Data User Guide]]

Data Guides

2024-02-01T14:36:29Z

DRW: /* WPRDC Data Guides */ Indicate that one link is just a PDF version of another data guide

Links to detailed document about data that includes social contextual information.

== WPRDC Data Guides ==
[[DASH Data Guide]]

[[Guide to Property and Housing Data]]

[[Crime, Courts, and Corrections in the City of Pittsburgh]]

[[City of Pittsburgh 311 Data User Guide]]

[[Allegheny County Property Assessment Data User Guide]] | ([https://data.wprdc.org/dataset/2b3df818-601e-4f06-b150-643557229491/resource/cc4bafd2-25b6-41d7-83aa-d16bc211b020/download/allegheny-county-property-assessment-data-user-guide.pdf PDF version])

Main Page

2024-01-24T14:40:01Z

DRW: /* Learning Resources */ Add Tutorials link

<strong>Welcome to the [https://www.wprdc.org WPRDC] wiki!</strong>

== Learning Resources ==

=== Topics ===
[[Topic:GIS|Mapping/GIS]]

=== 👋 Getting started with open data ===
[[Data Guides]]

=== 📕Grimoires ===
[[List of Scripts|Scripts]]

[[Useful Parcel Queries]]

[[Working with certificates]]

=== Tutorials ===
[[Tutorials|Tutorials on using data, SQL, and computers]]

== About the WPRDC ==
In short, we'll put some sort of short description here.

[[The WPRDC|(See the project's page for more details)]]

== Maintainer Information ==
Consult the [[mediawikiwiki:Special:MyLanguage/Help:Contents|User's Guide]] for information on using the wiki software.
*[[mediawikiwiki:Special:MyLanguage/Manual:Configuration_settings|Configuration settings list]]
* [[mediawikiwiki:Special:MyLanguage/Manual:FAQ|MediaWiki FAQ]]
* [https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ MediaWiki release mailing list]
* [[mediawikiwiki:Special:MyLanguage/Localisation#Translation_resources|Localise MediaWiki for your language]]
* [[mediawikiwiki:Special:MyLanguage/Manual:Combating_spam|Learn how to combat spam on your wiki]]

Tutorials

2024-01-24T14:38:27Z

DRW: Add a link to a tutorial and another to a discussion of tutorials

* [https://kbroman.org/dataorg/ Organizing Data in Spreadsheets] - Tips on how to store data in a spreadsheet in a way that will make it easiest to sustainably work with. (Highly recommended if you ever want to publish your data.)
* [https://www.youtube.com/watch?v=7Ma8WIDinDc Data Cleaning Principles] - A video of a talk from csv,conf6 on how to approach and think about cleaning data. The slides, annotated with notes, are [https://kbroman.org/Talk_DataCleaning/data_cleaning_notes.pdf here] and the corresponding GitHub repo is [https://github.com/kbroman/dataorg here].
* [https://www.youtube.com/watch?v=Ssso_5X1UPs Creating Effective Figures and Tables] - A video of a talk about how to communicate data clearly with graphs and tables. [https://www.biostat.wisc.edu/~kbroman/presentations/graphs2018.pdf A PDF of the slides] and [https://github.com/kbroman/Talk_Graphs the GitHub repo] are also available.

* [https://missing.csail.mit.edu/ The Missing Semester] - A series of MIT class videos teaching a lot of practical things that developers need to know.
* [https://mptc.io/ Modern Plain Text Computing] - A short seminar taught at Duke, aimed at students and researchers, covering how to think about and use computers, file systems, the shell, text editors, version control, and build systems. The philosophy is that these is essential knowledge for doing lots of research and making that research reproducible.
** [https://mastodon.social/@kjhealy/111791054059558470 A Mastodon thread discussing similar classes and resources]
* [https://gitlab.com/slackermedia/bashcrawl bashcrawl] - A text adventure to teach shell skills.
* [https://mystery.knightlab.com/ SQL Murder Mystery] - "The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime."
- "... If you really want to learn a lot about SQL, you may prefer a complete tutorial like [https://selectstarsql.com/ Select Star SQL]."
* [https://jsvine.github.io/intro-to-visidata/index.html An Introduction to VisiData] - One way to learn the best power tool for exploring and analyzing CSV files, SQLite databases, Excel files, and many other file formats. VisiData is also handy for editing, manipulating, and joining CSV files. Click [https://www.visidata.org/install/ here] to install it.

[[Category:Onboarding]]

CKAN Administration

2024-01-04T17:07:20Z

DRW: Add URL for fetching CKAN version and installed extensions

More information on deploying/administering CKAN (like adding/removing extensions and editing the theme), may be found in [https://github.com/WPRDC/ckan-docker/wiki the wiki of the GitHub repo for our fork of the CKAN Docker repo].

== Checking CKAN configuration ==

* Check CKAN version and installed extensions: [https://data.wprdc.org/api/3/action/status_show https://data.wprdc.org/api/3/action/status_show]

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

=== Canonical Views ===
If you want to set a map or data table to be on the dataset landing page, you create a corresponding "view" under one of the resources in the dataset and then click the "Canonical View" button for that view. The catch is that CKAN does not enforce that only one view may be canonical, so if multiple views have their "Canonical View" button depressed, one of them will be chosen by CKAN to be the displayed one, and you will have to unclick others in order to get the one you want to display on the dataset landing page.

=== Writing dataset descriptions ===
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Managing the Docker containers ===

<code> > cd docker-ckan

> docker ps
</code>

should return a list of running containers, which should include the following container names: <code>datapusher-plus</code>, <code>ckan</code>, <code>solr</code>, and <code>redis</code>.

Use
<code>> sudo docker-compose logs --tail=100 ckan</code>
to show the last 100 lines of the log for the <code>ckan</code> Docker instance.

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

=== Dealing with inadequate disk space ===
Once, we were seeing OSError: write error in the CKAN Docker logs and had to increase disk space to make CKAN function again.

Steve on increasing volume size on AWS:
> if you ever have to increase a volume, here’s what i followed to make the filesystem use the new space: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
that server was a <code>Xen</code> instance

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]] [[Category:CKAN]]

CKAN Administration

2023-11-21T18:27:30Z

DRW: Fix markup

More information on deploying/administering CKAN (like adding/removing extensions and editing the theme), may be found in [https://github.com/WPRDC/ckan-docker/wiki the wiki of the GitHub repo for our fork of the CKAN Docker repo].

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

=== Canonical Views ===
If you want to set a map or data table to be on the dataset landing page, you create a corresponding "view" under one of the resources in the dataset and then click the "Canonical View" button for that view. The catch is that CKAN does not enforce that only one view may be canonical, so if multiple views have their "Canonical View" button depressed, one of them will be chosen by CKAN to be the displayed one, and you will have to unclick others in order to get the one you want to display on the dataset landing page.

=== Writing dataset descriptions ===
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Managing the Docker containers ===

<code> > cd docker-ckan

> docker ps
</code>

should return a list of running containers, which should include the following container names: <code>datapusher-plus</code>, <code>ckan</code>, <code>solr</code>, and <code>redis</code>.

Use
<code>> sudo docker-compose logs --tail=100 ckan</code>
to show the last 100 lines of the log for the <code>ckan</code> Docker instance.

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

=== Dealing with inadequate disk space ===
Once, we were seeing OSError: write error in the CKAN Docker logs and had to increase disk space to make CKAN function again.

Steve on increasing volume size on AWS:
> if you ever have to increase a volume, here’s what i followed to make the filesystem use the new space: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
that server was a <code>Xen</code> instance

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]] [[Category:CKAN]]

CKAN Administration

2023-11-21T18:26:55Z

DRW: Add link to WPRDC wiki on CKAN Docker deployment/administration

More information on deploying/administering CKAN (like adding/removing extensions and editing the theme), may be found in [https://github.com/WPRDC/ckan-docker/wiki](the wiki of the GitHub repo for our fork of the CKAN Docker repo).

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

=== Canonical Views ===
If you want to set a map or data table to be on the dataset landing page, you create a corresponding "view" under one of the resources in the dataset and then click the "Canonical View" button for that view. The catch is that CKAN does not enforce that only one view may be canonical, so if multiple views have their "Canonical View" button depressed, one of them will be chosen by CKAN to be the displayed one, and you will have to unclick others in order to get the one you want to display on the dataset landing page.

=== Writing dataset descriptions ===
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Managing the Docker containers ===

<code> > cd docker-ckan

> docker ps
</code>

should return a list of running containers, which should include the following container names: <code>datapusher-plus</code>, <code>ckan</code>, <code>solr</code>, and <code>redis</code>.

Use
<code>> sudo docker-compose logs --tail=100 ckan</code>
to show the last 100 lines of the log for the <code>ckan</code> Docker instance.

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

=== Dealing with inadequate disk space ===
Once, we were seeing OSError: write error in the CKAN Docker logs and had to increase disk space to make CKAN function again.

Steve on increasing volume size on AWS:
> if you ever have to increase a volume, here’s what i followed to make the filesystem use the new space: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
that server was a <code>Xen</code> instance

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]] [[Category:CKAN]]

CKAN Metadata

2023-11-21T18:22:58Z

DRW: Add "CKAN" as an additional category for this page

== Metadata overview ==
Metadata is a structured framework for documenting data. Some people like to say it's data about data. It's essential if anyone hopes to find and use your data.

The metadata standard we're using is adapted from those used by the City of San Francisco and data.gov (the U.S. Federal government's open data repository).

Each CKAN dataset has default metadata fields. Some of these are filled in automatically by CKAN when a dataset is created or updated, while others are set by the publisher when the dataset is created and can be updated by the publisher or WPRDC staff (or in some cases, programmatically, such as the updating of the Temporal Coverage metadata field by [https://github.com/WPRDC/pocket-watch our watchdog utility].

== WPRDC custom metadata ==
After upgrading our data portal to CKAN 2.7, it became possible to easily create new metadata subfields within the 'extras' metadata field for any dataset. This can be done through API calls or through the CKAN web interface (by editing the dataset package).

Below are partial lists of 'extras' metadata fields in use on https://data.wprdc.org:
{| class="wikitable sortable"
|+ Caption text
|-
! field name !! Use !! Used by
|-
| last_etl_update || Indicates when the ETL job last finished. || rocket-etl
|-
| time_field || Dict specifying the field name in a table that stores each record's timestamp (used for determining dataset freshness). || pocket-watch
|-
| no_updates_on || List of days (e.g., "weekends") coding for when a table is not expected to update. || pocket-watch
|}

[[Category:Onboarding]] [[Category:CKAN]]

CKAN Tricks

2023-11-21T18:22:26Z

DRW: Add "CKAN" as an additional category for this page

== Manage datasets ==
=== Undelete deleted datasets ===
If you know the URL of the deleted dataset AND you are logged in as an administrator, you can go to that URL in your web browser and see the deleted dataset. "[Deleted]" will have been appended to the title of the dataset, and the value of the metadata field <code>state</code> will be equal to "deleted".

To undelete such a dataset, just use the CKAN API or use the
[https://github.com/WPRDC/utility-belt/blob/master/gadgets.py#L767 set_package_parameters_to_values() function] to set the package's <code>state</code> metadata value to "active". For the latter option, invoke the function like this:
<code>set_package_parameters_to_values("https://data.wprdc.org", deleted_package_id, ['state'], ['active'], API_key)</code>

== Queries ==
=== Make queries faster ===
[https://ckan.org/2017/08/10/faster-datastore-in-ckan-2-7/ Speed up datastore_search queries] by including the <code>include_total=False</code> parameter to skip calculation of the total number of rows (which can reduce response time by a factor of 2). The [https://docs.ckan.org/en/ckan-2.7.3/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_search datastore_search API call] lets you search a given datastore by column values and return subsets of the records. There's more on benchmarking CKAN performance [http://urbanopus.net/benchmarking-the-ckan-datastore-api/ here].

Another way to speed up datastore search queries is to index fields used in the filtering. Note that (at least when the primary key is a combination of fields), if you don't list each primary key field as a separate field to index, those fields don't get indexed and queries take way longer.

=== Avoiding stale query caches ===
Queries/API responses can be cached based upon nginx settings. If you find that your SQL query is getting a stale response, try changing your query slightly. For instance, instead of `SELECT MAX(some_field) AS biggest FROM <resource-id>`, you could change the assigned variable name (`SELECT MAX(some_field) AS biggest0413 FROM <resource-id>`) or add another field that you ignore (`SELECT MAX(some_field) AS biggest, MAX(some_field) AS whatever FROM <resource-id>`).

== Scripts that interact with CKAN through the API ==
=== Run those CKAN-monitoring/modifying scripts from multiple servers by centralizing data ===
To avoid keeping local databases about datasets, store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are currently cataloged on the [[CKAN_Metadata]] page.

[[Category:Onboarding]] [[Category:CKAN]]

CKAN Administration

2023-11-21T18:22:02Z

DRW: Add "CKAN" as an additional category for this page

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

=== Canonical Views ===
If you want to set a map or data table to be on the dataset landing page, you create a corresponding "view" under one of the resources in the dataset and then click the "Canonical View" button for that view. The catch is that CKAN does not enforce that only one view may be canonical, so if multiple views have their "Canonical View" button depressed, one of them will be chosen by CKAN to be the displayed one, and you will have to unclick others in order to get the one you want to display on the dataset landing page.

=== Writing dataset descriptions ===
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Managing the Docker containers ===

<code> > cd docker-ckan

> docker ps
</code>

should return a list of running containers, which should include the following container names: <code>datapusher-plus</code>, <code>ckan</code>, <code>solr</code>, and <code>redis</code>.

Use
<code>> sudo docker-compose logs --tail=100 ckan</code>
to show the last 100 lines of the log for the <code>ckan</code> Docker instance.

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

=== Dealing with inadequate disk space ===
Once, we were seeing OSError: write error in the CKAN Docker logs and had to increase disk space to make CKAN function again.

Steve on increasing volume size on AWS:
> if you ever have to increase a volume, here’s what i followed to make the filesystem use the new space: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
that server was a <code>Xen</code> instance

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]] [[Category:CKAN]]

CKAN Administration

2023-09-05T16:40:09Z

DRW: Add tips on managing Docker containers and increasing disk space under AWS

CKAN Administration

2023-09-05T16:36:59Z

DRW: Add tips on managing Docker containers and increasing disk space under AWS

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

=== Canonical Views ===
If you want to set a map or data table to be on the dataset landing page, you create a corresponding "view" under one of the resources in the dataset and then click the "Canonical View" button for that view. The catch is that CKAN does not enforce that only one view may be canonical, so if multiple views have their "Canonical View" button depressed, one of them will be chosen by CKAN to be the displayed one, and you will have to unclick others in order to get the one you want to display on the dataset landing page.

=== Writing dataset descriptions ===
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Managing the Docker containers

```> cd docker-ckan
> docker ps
```
should return a list of running containers, which should include the following container names: "datapusher-plus", "ckan", "solr", and "redis".

Use
`> sudo docker-compose logs --tail=100 ckan`
to show the last 100 lines of the log for the `ckan` Docker instance.

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

== Dealing with inadequate disk space ===
Once, we were seeing OSError: write error in the CKAN Docker logs and had to increase disk space to make CKAN function again.

Steve on increasing volume size on AWS:
> if you ever have to increase a volume, here’s what i followed to make the filesystem use the new space: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html
that server was a Xen instance

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]]

Crime, Courts, and Corrections in the City of Pittsburgh

2023-08-15T15:55:33Z

DRW: /* City of Pittsburgh Annual Police Statistical Reports */ Add newer reports and link to list of all reports

This guide to Crime, Courts, and Corrections was created to make it easier to find, understand, and use information about public safety in the City of Pittsburgh, and the criminal justice and corrections systems in Allegheny County and the State of Pennsylvania. While public agencies are now sharing a growing amount of data, this information does not have value to most people without context or an understanding of underlying systems, laws, policies, and processes.

Release of the data in this document allows the City to be a participant in the National Police Data Initiative. Founded in 2015, the Initiative is an outgrowth of President Obama’s Task Force on 21st Century Policing designed to reduce crime by building trust between citizens and police. Information sharing is viewed as a major trust-building step law enforcement organizations can take, and over 50 jurisdictions are now participating in the Initiative. For more information, please see the National Police Data Initiative’s FAQ.

The Western Pennsylvania Regional Data Center is pleased to have worked with our City and County partners to publish much of the information referenced in this guide as open data. We’re also grateful to have had their help in assembling and editing this document. This guide also illustrates a critical partnership between the City’s Bureau of Police and the County Department of Human Services. Many of the tools featured in this guide were developed through this City-County collaboration.

This guide was inspired by Crime and Punishment in Chicago, a Website developed by the Smart Chicago Collaborative.

= Tools =
A number of tools have been developed to make it easier for people to use and understand the data contained in this guide, and a growing number of these tools are being developed by public sector staff at the City and County, the university community, journalists, and civic technologists. We’ve included links to some of them here. Please let us know if there are others that should be included in this list.

=== Allegheny County Jail Population Management Dashboards ===
Allegheny County developed a series of [https://tableau.alleghenycounty.us/t/PublicSite/views/AC_JailPopulationManagement_Final/Home?iframeSizedToWindow=true&:embed=y&:showAppBanner=false&:display_count=no&:showVizHome=no publicly available, interactive dashboards] to allow users to explore different aspects of the Allegheny County Jail population. Dashboards provide information on the daily population, alternative housing, bookings, releases, length of stay, service usage, and involvement with the justice system prior to booking.

=== City of Pittsburgh Annual Police Statistical Reports ===

* [https://apps.pittsburghpa.gov/redtail/images/18173_2021_Annual_Report_Final.pdf 2021 Annual Report]
* [https://apps.pittsburghpa.gov/redtail/images/14012_FINAL_DRAFT_7_Annual_Report_2020.pdf 2020 Annual Report]
* [https://apps.pittsburghpa.gov/redtail/images/9640_2019_Annual_Report_Final.pdf 2019 Annual Report]
* [https://apps.pittsburghpa.gov/redtail/images/6371_2018_Annual_Report_Draft_-_Final.pdf 2018 Annual Report]
* [https://apps.pittsburghpa.gov/redtail/images/4801_2017_Annual_Report_Final_1.3.19.pdf 2017 Annual Report]
* [https://apps.pittsburghpa.gov/redtail/images/1236_2016_Annual_Report.pdf 2016 Annual Report]
* [http://apps.pittsburghpa.gov/pghbop/ANNUAL_REPORT_DRAFT_2015_May_31.pdf 2015 Annual Report]
* [http://apps.pittsburghpa.gov/pghbop/2014_Annual_Report_Final_draft.pdf 2014 Annual Report]
* [http://apps.pittsburghpa.gov/dps/2013_Annual_Report_draft_(final).pdf 2013 Annual Report]
* [http://apps.pittsburghpa.gov/pghbop/2012_Annual_Report_v2.pdf 2012 Annual Report]

[https://pittsburghpa.gov/police/police-reports List of all Pittsburgh Annual Police Statistical Reports]

=== Homicides ===
Homicide information in the [https://tableau.alleghenycounty.us/t/PublicSite/views/CJ_Homicides_PGH_8-22-17_v2/Home?%3Aembed=y&%3AshowAppBanner=false&%3AshowShareOptions=true&%3Adisplay_count=no&%3AshowVizHome=no City of Pittsburgh] and [https://tableau.alleghenycounty.us/t/PublicSite/views/Homicides_In_Allegheny_County/HomicideTrends?iframeSizedToWindow=true&%3Aembed=y&%3AshowAppBanner=false&%3Adisplay_count=no&%3AshowVizHome=no Allegheny County] are available through two separate dashboards. Data for the City dates back to 2010, with data from the County available as early as 2007. These tools include data on the number of homicides, clearance rates, weapons used, time of the incident, day of the week, demographics on victims and perpetrators, and benchmarks to other cities. They were developed through a collaboration of the Allegheny County Department of Human Services and the Allegheny County Office of the Medical Examiner.

=== Gun Violence ===
Learn more about gun violence in your neighborhood on [https://tableau.alleghenycounty.us/t/PublicSite/views/CJ_GunViolence_PGH_8-22-17_v2/Home?:embed=y&:showAppBanner=false&:showShareOptions=true&:display_count=no&:showVizHome=no this dashboard developed by the Allegheny County Department of Human Services and the City of Pittsburgh]. This tool includes data on shootings, aggravated assaults involving a gun, and shots fired. The data, available back to 2010 also includes information on the time of day shootings occur.

=== Overall Trends in Violence ===
[https://tableau.alleghenycounty.us/t/PublicSite/views/CJ_Overall_Violence_Trends_PGH_8-22-17_v2/Home?:embed=y&:showAppBanner=false&:showShareOptions=true&:display_count=no&:showVizHome=no This dashboard] contains information on trends in violence by neighborhood back to 2010. The tool contains information on homicides, shootings, assaults, and calls for shots fired, and also includes information on the number of incidents involving the use of a firearm. It was developed through a collaboration of the Allegheny County Department of Human Services and the City of Pittsburgh.

=== Crime in Pittsburgh ===
The [https://tableau.alleghenycounty.us/t/PublicSite/views/CJ_UCR_PGH_8-22-17_v3/Home_1?iframeSizedToWindow=true&%3Aembed=y&%3AshowAppBanner=false&%3Adisplay_count=no&%3AshowVizHome=no&%3Aorigin=viz_share_link Crime Dashboard] provides information on the types of incidents occurring in your neighborhood. The tool developed through a collaboration of the Allegheny County Department of Human Services and the City of Pittsburgh also includes information on the number of crimes cleared, the time that incidents occur, and the victims back to 2005.

=== Burgh's Eye View ===
With [https://pittsburghpa.shinyapps.io/BurghsEyeView/?_inputs_&dates=%5B%222017-02-14%22%2C%222017-02-24%22%5D&dept_select=null&filter_select=%22%22&GetScreenWidth=1920&hier=null&origin_select=null&report_ Burgh's Eye View] you can easily see all kinds of data about Pittsburgh—including 311 requests, building permits, code violations, and public safety incidents.But City data isn’t a log of when and where we put our parking chairs. It’s a huge collection of information about the world around us that can help us understand what’s happening in our neighborhoods, and lead us to ideas and decisions that can make where we live better.

= Data =
This guide contains information on the types of data describing aspects of policing, criminal justice, and corrections system in the City of Pittsburgh, Allegheny County, and Pennsylvania. Each of the categories shown here in this section of the guide describe the types of data that are available, provide links to where it can be found, and describe some additional context to help people use and interpret the data accurately and responsibly.

=== Victimization ===
[[Crime, Courts, and Corrections Guide:Victimization]]

Data on the victims of reported homicides in Pittsburgh is available through an interactive data visualization developed through a partnership of the City of Pittsburgh and Allegheny County Department of Human Services.

=== Incidents ===
[[Crime, Courts, and Corrections Guide:Incidents]]

Crime incident reports are often created following a police investigation. Incident reports may also be generated from calls for service.

'''''Police Incident Blotter data updates every day, and can be found on the Regional Data Center’s open data portal.'''''

=== Prison ===
[[Crime, Courts, and Corrections Guide:Prison]]

The Pennsylvania Department of Corrections manages 25 state correctional institutions and a number of other correctional facilities in Pennsylvania. Data on the monthly prison population by facility, and annual admissions and releases by county are available, along with information on inmates through the State’s Inmate locator tool.

=== Non-Traffic Citations ===
[[Crime, Courts, and Corrections Guide:Non-Traffic Citations]]

Non-Traffic Citations in Pittsburgh are given for minor criminal offenses, and are often called summary offenses. The types of offenses that often result in a citation include loitering, disorderly conduct, harassment, public drunkenness, and low-level retail theft.

'''''Non-Traffic Citation data updates every day, and can be found on the Regional Data Center’s open data portal.'''''

=== Jail ===
[[Crime, Courts, and Corrections Guide:Jail]]

The population of the Allegheny County Jail changes on a daily basis as people are arrested, released on bail, exonerated, sentenced, transferred to another facility, paroled, or released.

'''''Jail Census data updates every day, and can be found on the Regional Data Center’s open data portal.'''''

=== Courts ===
[[Crime, Courts, and Corrections:Courts]]

The Allegheny County Department of Court Records’ Criminal Division is the custodian of criminal records in Allegheny County. While the county manages this information, the data is owned by the Unified Judicial System of Pennsylvania.

=== Call for Service ===
[[Crime, Courts, and Corrections:Call for Service]]

In Allegheny County, data on emergency calls for service are managed by the Allegheny County Emergency Services. Non-emergency calls for service are made through the City of Pittsburgh’s 311 system managed by the City’s Department of Innovation and Performance.

'''''Data for non-emergency 311 calls for service is updated hourly, and can be found on the Regional Data Center’s open data portal.'''''

=== Arrests ===
[[Crime, Courts, and Corrections:Arrests]]

Data on people taken into custody by City of Pittsburgh officers are available as open data. More serious crimes such as misdemeanors and felony offenses are more-likely to result in arrests, however arrests may occur for other reasons such as parole violations or failure to appear for trial.

'''''Arrest data updates every day, and can be found on the Regional Data Center’s open data portal.'''''

= Policies =
Policies and procedures govern the way officers interact with citizens, report incidents, and ensure accountability. Sharing policies and data on police-community interactions demonstrates the desire of City leaders to build trust with residents, and highlights their commitment to transparency.

The City of Pittsburgh is making the following data available on the Regional Data Center’s open data portal:

* [https://data.wprdc.org/dataset/police-community-outreach Police Community Outreach Events]
* [https://data.wprdc.org/dataset/officer-training Officer Training]
* [https://data.wprdc.org/dataset/police-civil-actions Police Civil Action Litigation]
* [http://apps.pittsburghpa.gov/dps/Use_of_Force_in_the_City_of_Pittsburgh.pdf Use of Force]

Crime, Courts, and Corrections Guide:Incidents

2023-08-15T15:26:17Z

DRW: Fix typo

''This guide is part of a larger guide on [[Crime, Courts, and Corrections in the City of Pittsburgh]].''

Crime incident reports are often created following a police investigation. Incident reports may also be generated from calls for service.

Incident data is published on a nightly basis by the City of Pittsburgh. The quality of incident level data may improve over time, and data published on the open data portal will reflect these changes. The data is presented in two separate files on the open data portal.

* The Blotter will contain only the previous thirty days of reported crimes.The initial incident data can often change in the month following the initial incident report. Records older than thirty days will be deleted from this file and valid incidents will be moved to the archived dataset.Appropriate use of this file includes notifying community members of recent incidents.
* Once quality control and coding procedures have been run against the data by the Police Bureau, the data will then be published to the archived data file thirty days after the initial report. Data in the archived file will be of greater data quality and is the file most appropriate for reporting crime statistics. Data within this file is also subject to change. Archived data (2005-2015) is also being shared as part of this release.

Pittsburgh Police do respond to incidents outside the City borders from time to time. For this reason, some incidents are mapped to locations outside the City. These can occur when City police assist police in other jurisdictions, when City police pursue a suspect across the City line to another municipality, and when a City officer happens to respond to an incident while outside the City.

=== What's Included in the Data ===

===== Publicly Available =====

* Location generalized to a block (police zone for sex crimes)
* Type of incident
* Date
* Time
* Whether or not the incident has been cleared

===== Not Publicly Available =====

* Victim’s identity
* Actual incident location

=== Where to Find the Data ===

* Most-recent [https://data.wprdc.org/dataset/police-incident-blotter 30 days of incident data (The Blotter)] is published through the Western Pennsylvania Regional Data Center and is updated daily.
* [https://data.wprdc.org/dataset/uniform-crime-reporting-data Uniform Crime Reporting Data Recent Archive (Over 30 days old)] is published through the Western Pennsylvania Regional Data Center and is updated daily.
* [https://pittsburghpa.gov/publicsafety/index.html Pittsburgh Public Safety Press Releases] contain additional information on selected incidents
* This data is also featured on the City of Pittsburgh's [https://pittsburghpa.shinyapps.io/BurghsEyeView/ Burgh's Eye View] mapping tool

=== Things to Know ===

* Sex crimes will only be reported at the police zone level to protect victim confidentiality. All other crimes will be reported at the block level (based on street address range).
* Incident data is published using the UCR hierarchical classification system developed by the FBI. Multiple crimes may be included in the same incident, and incidents are coded by the highest-level offense.
* Unfounded incidents will be removed from the database. As the status or classification of an incident changes, these changes will be reflected on the open data portal. When using data, it is a good practice to cite when it was accessed.
* Incidents solely reported by other police departments operating in the City (campus police, Port Authority, etc) are not captured in this data.
* Archived data (2005-15) may first be published without coordinates, but coordinates will be made available as additional geocoding processes are run on the data.
* Not all incidents are reported. Incident reporting rates may vary from one community or person to the next.

Data Guides

2023-08-10T16:33:47Z

DRW: /* WPRDC Data Guides */ Add link to Property Assessments Data Guide

Links to detailed document about data that includes social contextual information.

== WPRDC Data Guides ==
[[DASH Data Guide]]

[[Guide to Property and Housing Data]]

[[Crime, Courts, and Corrections in the City of Pittsburgh]]

[[City of Pittsburgh 311 Data User Guide]]

[[Allegheny County Property Assessment Data User Guide]]

[https://data.wprdc.org/dataset/2b3df818-601e-4f06-b150-643557229491/resource/cc4bafd2-25b6-41d7-83aa-d16bc211b020/download/allegheny-county-property-assessment-data-user-guide.pdf Property Assessments Data User Guide (pdf)]

Tooling

2023-07-20T19:07:11Z

DRW: /* Data tools */ Add "Modern CSV" and tips for safely editing CSVs in Excel

== Data tools ==
* [https://www.visidata.org/ VisiData] - Terminal user interface for a data exploration/manipulation tool that can handle large datasets.
** If you REALLY don't want to use VisiData because you don't want to use the terminal, [https://www.moderncsv.com/ Modern CSV] looks like a decent CSV editor that can handle large files and won't mangle your CSV file (unlike Excel).
*** If, in a pinch, you absolutely have to use Excel to edit a CSV file (which you really shouldn't), 1) make a copy of your CSV file, saving it as a text file (with the "txt" extension), 2) open it in Excel using the text import function, and 3) make all of the fields text fields. 4) Edit the file in Excel. 5) When finished "Save as..." a CSV file.
* [https://github.com/jqnatividad/qsv qsv] - "command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast and composable." Forked from xsv by CKAN Joel, so it's got some CKAN-specific features in the works.

=== Anonymization ===
* [https://www.open-diffix.org/ Open Diffix] - Free, open-source desktop tool (and eventually Postgres extension) for anonymizing data.

[[Category:Onboarding]]

ETL

2023-06-03T21:27:57Z

DRW: Add "Inventorying ETL jobs" section

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas (and especially if your data table includes representations of race/ethnicity). The General Recommendations on page 41 provide a good overview.''

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

=== Schema-source comparison ===

When running ETL jobs, you will sometimes see console output indicating that 1) fields in the source file are not being used in the schema or 2) fields in the schema cannot be found in the source file. These are checks to ensure that the schema matches and accounts for all the fields in the source file. (While marshmallow does have some support for some operations, we've written our own code for handling these comparisons.)

If there's a field in the source file that you don't want to publish (and you don't want to keep getting console output about it), you can either list it in the <code>exclude</code> option, in the schema's <code>Meta</code> class, or you can add the field to the schema, but set it to <code>load_only=True</code>.

If there's a field in the schema that is supposed to be published ("dumped" in Marshmallow jargon) and is supposed to be loaded from the source file, but it can't be found in the source file, an error message will be printed to the console. If additionally the job is trying to push data to the CKAN datastore, an exception will be raised.

These checks are really helpful when writing/testing/modifying an ETL job, as they make it easy to find typos in field names or other errors that are preventing source data from getting to the output correctly.

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.

== Retiring ETL jobs ==
Every ETL job has a life cycle. When an ETL job comes to the end of its existence because the data source has disappeared, we designate this dataset "orphaned". Metadata values for the update frequency fields should be changed to the value that has "Historical" in its description. The dataset's "_etl" tag should be removed and replaced with the "_orphaned_etl" tag. If there is still a manual inventory of ETL jobs, that inventory should be updated.

== Inventorying ETL jobs ==

When you add, delete, or modify ETL jobs, update [https://docs.google.com/spreadsheets/d/1amPsZ1RA9d9m41MXkMmMB8-VDmwAOxU72A35DlSWbT4/edit#gid=0 this Google Sheet]. (Eventually, we'll make ETL jobs self-tracking somehow, probably by adding the location of the script (and any other useful metadata) to the package <code>extras</code> metadata field.)

Currently ETL jobs are reporting as metadata things like last_etl_update date and the source file hash. Even the last source file name is being used in the new Data Rivers/Google Cloud Platform to check whether a given file is newer or older than the last one used.

[[Category:Onboarding]]

ETL

2023-06-03T20:31:41Z

DRW:

ETL

2023-06-02T17:19:28Z

DRW:

ETL

2023-06-02T17:17:56Z

DRW: /* Testing ETL jobs */ Add schema-source comparison section

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas (and especially if your data table includes representations of race/ethnicity). The General Recommendations on page 41 provide a good overview.''

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

=== Schema-source comparison ===

When running ETL jobs, you will sometimes see console output indicating that 1) fields in the source file are not being used in the schema or 2) fields in the schema cannot be found in the source file. These are checks to ensure that the schema matches and accounts for all the fields in the source file. (While marshmallow does have some support for some operations, we've written our own code for handling these comparisons.)

If there's a field in the source file that you don't want to publish (and you don't want to keep getting console output about it), you can either list it in the <code>exclude</code> option, in the schema's <code>Meta</code> class, or you can add the field to the schema, but set it to <code>load_only=True</code>.

If there's a field in the schema that is supposed to be published ("dumped" in Marshmallow jargon) and is supposed to be loaded from the source file, but it can't be found in the source file, an error message will be printed to the console. If additionally the job is trying to push data to the CKAN datastore, an exception will be raised.

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-02T16:43:50Z

DRW: /* Schema design */

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas (and especially if your data table includes representations of race/ethnicity). The General Recommendations on page 41 provide a good overview.''

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-02T16:40:42Z

DRW: /* Schema design */

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas, particularly those on page 41 (General Recommendations).''

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-02T16:35:10Z

DRW: /* Schema design */

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

''After writing this section, I discovered that some of the ideas above also appear in the Urban Institute's
[https://www.urban.org/sites/default/files/publication/104296/do-no-harm-guide.pdf Do No Harm Guide: Applying Equity Awareness in Data Visualization]. While it is focussed on visualizations, some of its suggestions can be applied to designing data schemas.''

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-01T21:17:04Z

DRW: Reorder two statement in an ordered list.

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. c) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-01T20:58:59Z

DRW:

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). c) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. This principle also applies to lists of geographic regions and other features. d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-01T20:49:56Z

DRW: Finish schema design section

== ETL overview ==

ETL (an acronym for "Extract-Transform-Load") describes a data process that obtains data from some source location, transforms it, and delivers it to some output destination.

Most WPRDC ETL processes are written in [https://github.com/WPRDC/rocket-etl/ rocket-etl], an ETL framework customized for use with a) CKAN and b) the specific needs and uses of the [https://data.wprdc.org Western Pennsylvania Regional Data Center open-data portal]. It has been extended to allow the use of command-line parameters to (for instance) override the source and destination locations (pulling data instead from a local file or outputting data to a file, for the convenience of testing pipelines). It can pull data from web sites, FTP servers, GIS servers that use the data.json standard, and Google Cloud storage, and can deliver data either to the CKAN datastore or the CKAN filestore. It supports CKAN's Express Loader feature to allow faster loading of large data tables.

Some WPRDC ETL processes are still in an older framework; once they're all migrated over, it will be possible to extract a catalog of all ETL processes by parsing the job parameters in the files that represent the ETL jobs.

== Getting data ==

Some of the sources we get data from:

* FTP servers
** Here's the [https://docs.ipswitch.com/MOVEit/Transfer2019_1/API/Rest/#_getapi_v1_files_id_download-1_0 API documentation for the MOVEit FTP server ]
* APIs
** Google Cloud infrastructure could count as an API
** Some custom-built APIs by individual vendors
* GIS servers
** Historically this was done through CKAN's "Harvester" program.
** Now we are switching to writing ETL code to analyze the data.json file and pull the desired files over HTTP.
* Plain old web sites

== Writing ETL jobs ==

A useful tool for writing ETL jobs is [https://github.com/WPRDC/little-lexicographer Little Lexicographer]. While initially designed to just facilitate the writing of data dictionaries (by scanning each column and trying to determine the best type for it, then dumping field names and types into a data dictionary template), Little Lexicographer now also has the ability to output a proposed Marshmallow schema for a CSV file. Its type detection is not perfect, so manual review of the assigned types is necessary. Also, Little Lexicographer is often fooled by seemingly numeric values like ZIP codes; if a value is a code (like a ZIP code or a US Census tract), we treat it as a string. This is especially important in the case of codes that may have leading zeros that would be lost if the value were cast to an integer.

=== Schema design ===

After running [https://github.com/WPRDC/little-lexicographer Little Lexicographer] on the source file you want to write an ETL job for and reviewing the proposed schema types for correctness, review the column names.
# '''Make column names clear.''' If you don't understand the meaning of the column from reading the column name and looking at sample values, figure out the column (by reading the data dictionary and documentation or asking someone closer to the source of the data) and then give it a meaningful name.
# '''Use snake case.''' Whenever possible, format column names in [https://en.wikipedia.org/wiki/Snake_case snake case]. This means you should convert everything to lower case and change all spaces and punctuation to underscores (so <code>FIELD NAME</code> becomes <code>field_name</code> and <code># of pirates</code> should be changed to <code>number_of_pirates</code>). Reasons we prefer snake case: a) Marshmallow already converts field names to snake case to some extent automatically. b) Snake case field names do not need to be quoted or escaped in PostgreSQL queries (making queries of the CKAN datastore easier).
# '''Standardize column names.''' Choose names that are already in use in other data tables published by the same publisher. For instance, if the source data calls the geocoordinates <code>y</code> and <code>x</code> (or <code>lat</code> and <code>long</code>) but <code>latitude</code> and <code>longitude</code> are already being used by other data tables, switch to <code>latitude</code> and <code>longitude</code>.
# '''Standardize column values.''' Where possible transform columns to standardize their values. The first step is to look at the histogram of every column (<code>Shift+F</code> in VisiData!) and see if anything is irregular. For instance, if the <code>municipality</code> column has 1038 records with <code>municipality</code> == "Pittsburgh" and two with <code>municipality</code> == "PGH", add to the schema a @pre_dump decorator function to change all instances of "PGH" to "Pittsburgh". In some cases, just converting an address field to upper case will go a long way toward standardizing it. You can think of this step as pre-cleaning the data. The Holy Grail of column standardization would be using the same values in every identically named column across the entire data portal. Maybe someday!
# '''Organize the column names.''' Often the source file comes with some record IDs on the left, followed by some highly relevant fields (e.g., names of things), but then the rest of the columns may be semirandomly ordered. Principles of column organization: a) '''The "input" should be on the left and the "output" should be on the right.''' Which fields is the user likeliest to use to look up a record (like you would look up a word in a dictionary)? Put those furthest to the left (or, at the top of the schema). Primary keys and unique identifiers should go on the far left. Things like the results of inspections are closer to outputs, and should be moved to the right. b) '''Prioritize important stuff'''. If there are fields you think are likely to be of most interest to the user, shift them as far left as you can (subject to other constraints). The further left the field is, the better chance the user will be able to see it in the Data Tables view (or their tabular data explorer of choice). c) '''Group similar fields together.''' Obviously street address, city, state, and ZIP code should be grouped together and presented in the canonical order. d) '''Maximize readability'''. Think like a user. How can you order the columns so that the sequence is logical?

=== Pitfalls ===

* The [http://wingolab.org/2017/04/byteordermark byte-order mark] showing up at the beginning of the first field name in your file. Excel seems to add this character by default (unless the user tells it not to). As usual, the moral of the story is "Never use Excel".
* Using a local timestamp instead of a UTC timestamp as a primary key often leads to problems. Because of Daylight Savings Time, one day each year (prior to 2023) in a series of hourly local timestamps skips an hour and another day (prior to 2022) has the same local timestamp twice. The [https://www.caktusgroup.com/blog/2019/03/21/coding-time-zones-and-daylight-saving-time/ general] [https://www.jamesridgway.co.uk/why-storing-datetimes-as-utc-isnt-enough/ advice] is to store (and publish) both the UTC timestamp and the local timestamp. We use the UTC timestamp for primary keys and other data operations, but also publish the local timestamp to make it easier for the user to understand the data.

== Testing ETL jobs ==

Typical initial tests of a rocket-etl job can be invoked like this:

<code>> python launchpad.py engine/payload/<name for publisher/project>/<script name>.py mute to_file</code>

where the <code>mute</code> parameter prevents errors from being sent to the "etl-hell" Slack channel and the to_file parameter writes the output to the default location for the job in question. For instance, the job

<code>> python launchpad.py engine/payload/robopgh/census.py mute to_file</code>

would write its output to a file in the directory <code><PATH TO rocket-etl>/output_files/robopgh/</code>. Note that the namespacing convention routes the output of <code>robopgh</code> jobs to a different directory than that of <code>wormpgh</code> jobs, but if there were two jobs in the <code>robopgh</code> payload folder that write to <code>population.csv</code>, each job would overwrite the output of the other. As this namespacing is for the convenience of testing and development, this level of collision avoidance seems sufficient for now. It's always possible to alter the default output file name by specifying the 'destination_file' parameter in the dict of parameters that define the job (found in, for instance, <code>robopgh/census.py</code> file).

After running the job, examine the output. [https://www.visidata.org/ VisiData] is an excellent tool for rapidly examining and navigating CSV files. As a first step, it's a good idea to go through each column in the output and make sure that the results make sense. Often this can be done by opening the file in VisiData (<code>> vd output_files/robopgh/population.csv</code>) and invoking <code>Shift+F</code> on each column to calculate the histogram of its values. This is a quick way to catch empty columns (which is either a sign that the source file has only null values in it or that there's an error in your ETL code, often because there's a typo in the name of the field you're trying to load from... How [https://marshmallow.readthedocs.io/en/2.x-line/why.html marshmallow] transforms the field names can often be non-intuitive.).

Try to understand what the records in the data represent. Are there any transformations that could be made to help the user understand the data?

Does the set of data as a whole make sense? For instance, look at counts over time (either by grouping records by year+month and aggregating to counts than you can visually scan or by [https://www.visidata.org/docs/graph/ plotting] record counts by date or timestamp).

Are the field names clear? If not, change them to something clearer. Are they unreasonably long when a shorter name would do? Shorten them to something that is still clear.

If you can't figure out something about the data, ask someone else and/or the publisher.

Once you're satisfied with the output data you're getting, you can rerun the job with the <code>test</code> parameter to push the resulting output to the default testbed dataset (a private dataset used for testing ETL jobs):

<code>> python launchpad.py engine/payload/robopgh/census.py mute test</code>

Development instances of rocket-etl should be configured to load to this testbed dataset by default (that is, even if the <code>test</code> parameter is not specified) as a safety feature. The parameter that controls this setting is <code>PRODUCTION</code>, which can be found in the <code>engine/parameters/local_parameters.py</code> file and which should be defined like this:
<code>PRODUCTION = False</code>

Only in production environments should <code>PRODUCTION</code> be set to <code>True</code>.

In a development environment, to run an ETL job and push the results to the production version of the dataset, do this:
<code>> python launchpad.py engine/payload/robopgh/census.py mute production</code>

== Deploying ETL jobs ==
Once tested, an ETL job can be deployed by 1) moving the source code for the ETL job to a production server and 2) scheduling the job to run automatically.

Assuming that you are developing the ETL job on a separate computer and in a dev branch of <code>rocket-etl</code>, this is a typical deployment workflow:
# Use <code>> git add -p</code> to construct atomic commits (each of which should thematically cluster changes) and <code>> git commit -m "<Meangingful commit description>")</code> to commit them. Repeat until all the code that needs to be deployed has been committed. If you need to add a new file (like "sky_maintenance.py"), try <code>> git add sky_maintenance.py</code> and <code> > git commit -m "Add ETL job for sky-maintenance data"</code>.
# If you have any other changes to your dev branch that aren't ready for deployment, type <code>> git stash save</code> to temporarily stash those changes (so you can switch to the <code>master</code> branch).
# <code>> git checkout master</code> lets you switch to the <code>master</code> branch.
# <code>> git merge dev</code> merges the changes committed to the <code>dev</code> branch into the <code>master</code> branch.
# Push the changes to GitHub: <code>> git push</code>
# Switch back to the <code>dev</code> branch: <code>> git checkout dev</code>
# Restore the stashed code: <code>> git stash pop</code>
# Shell into the production server with <code>ssh</code>.
# Navigate to wherever the <code>rocket-etl</code> directory is.
# Pull the changes from GitHub: <code>> git pull</code>
# At this point, it's usually best to test the ETL job to make sure it will work in the production environment. Either the <code>test</code> or <code>to_file</code> command-line parameters can be used if you're not ready to publish data to the production dataset. Failure at this stage usually means that some code or parameter that was supposed to be committed to the git repository didn't get committed or is not defined on the production server.
# Schedule the job by writing a cron job: <code>> crontab -e</code> + duplicate a launchpad line that's already in the crontab file + edit it to run the new ETL job and edit the schedule to match the desired ETL schedule.
[[Category:Onboarding]]

ETL

2023-06-01T20:22:28Z

DRW: Add schema design section to ETL page

CKAN Administration

2023-03-20T18:40:27Z

DRW: Add Canonical Views section

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

=== Canonical Views ===
If you want to set a map or data table to be on the dataset landing page, you create a corresponding "view" under one of the resources in the dataset and then click the "Canonical View" button for that view. The catch is that CKAN does not enforce that only one view may be canonical, so if multiple views have their "Canonical View" button depressed, one of them will be chosen by CKAN to be the displayed one, and you will have to unclick others in order to get the one you want to display on the dataset landing page.

=== Writing dataset descriptions ===
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]]

Scraping Data

2023-02-15T18:37:40Z

DRW: Fix markup

So you want to scrape data from some web page on a regular basis, huh?

Here are some ways you could do it:

* A pure Google-Sheets-based approach, which can be set to run hourly, daily, or on other schedules: https://www.computerworld.com/article/3684733/how-to-create-automatically-updating-google-sheet.html
* A pure GitHub-based approach (exploiting GitHub actions): https://simonwillison.net/2020/Oct/9/git-scraping/
** [https://githubnext.com/projects/flat-data Flat Data] - An extensions of Simon Willison's git-scraping that includes a GUI for viewing the captured data (as long as it's stored in a public GitHub repo)

Scraping Data

2023-02-15T18:34:40Z

DRW: Add Flat Data to Scraping Data page

So you want to scrape data from some web page on a regular basis, huh?

Here are some ways you could do it:

* A pure Google-Sheets-based approach, which can be set to run hourly, daily, or on other schedules: https://www.computerworld.com/article/3684733/how-to-create-automatically-updating-google-sheet.html

* A pure GitHub-based approach (exploiting GitHub actions): https://simonwillison.net/2020/Oct/9/git-scraping/
* [https://githubnext.com/projects/flat-data Flat Data] - An extensions of Simon Willison's git-scraping that includes a GUI for viewing the captured data (as long as it's stored in a public GitHub repo)

Scraping Data

2023-02-10T18:48:58Z

DRW: Create "Scraping Data" page

So you want to scrape data from some web page on a regular basis, huh?

Here are some ways you could do it:

A pure Google-Sheets-based approach, which can be set to run hourly, daily, or on other schedules:
https://www.computerworld.com/article/3684733/how-to-create-automatically-updating-google-sheet.html

A pure GitHub-based approach (exploiting GitHub actions):
https://simonwillison.net/2020/Oct/9/git-scraping/

CKAN Administration

2022-10-06T19:48:07Z

DRW: Add notes about CKAN's description support for Markdown

== Changes that can be made through the frontend ==
There's a lot of documentation on publishing data on our CKAN portal [https://github.com/WPRDC/data-guide/tree/master/docs here].

A few samples (to eventually migrate over):
* [https://github.com/WPRDC/data-guide/blob/master/docs/PublishingCKAN.md Our documentation for publishers on publishing data on the WPRDC]
* [https://github.com/WPRDC/data-guide/blob/master/docs/data_dictionaries.md How to create data dictionaries] ==> [[Data Dictionaries]]
* [https://github.com/WPRDC/data-guide/blob/master/docs/metadata_extras.md Some of our standard extra metadata fields]

== Writing dataset descriptions ==
The description field supports some limited markup, which appears to be a subset of Markdown.

* Starting a line with a single pound sign (#) indicates that the line should be in bigger, title text, but two pound signs (##) do not give a different font size, as they do in standard Markdown.

* Dashes can be used to denote elements in an unordered list (though I haven't been able to get nested lists to work).

* Use backticks to indicate that a sans serif font should be use to represent code, like `this`.

Images, links, bold and italic text all work.

It seems like it's limited to the original [https://www.markdownguide.org/cheat-sheet/#basic-syntax](very basic specification of Markdown).

== Changes that can be made through the backend ==
=== Configuring the CKAN server ===
(The contents of this section were initially taken from the <code>ORIENTATION</code> file in <code>/home/ubuntu</code> on the CKAN production server.)

* The main CKAN config file is at <code>/etc/ckan/default/production.ini</code>

* To monitor HTTP requests in real-time: <code>> tail -f /var/log/nginx/access.log</code>

* Service-worker activity (like the Express Loader uploading files to the datastore and background geocoding) can be found in: <code>/var/log/ckan-worker.log</code>

* Edit templates here (changes to templates should show up when reloading the relevant web pages): <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/ckanext/wprdc/templates</code>

* <code>templates/terms.html</code> is the source for the pop-up version of the Terms of Use. There appears to be no template linked to the "Terms" hyperlink.

* Create a file <code>templates/foo.html</code> and then run <code>> sudo service supervisor restart</code> and THEN load <code>data.wprdc.org/foo.html</code> in your browser, and the page will be there.

* Presumably <code>data.wprdc.org/foo/</code> can be populated by creating a file at <code>templates/foo/index.html</code>.

=== Managing the CKAN server ===
* To restart the Express Loader: <code>> sudo supervisorctl restart ckan-worker:*</code>

* To edit the background worker configuration (including increasing the number of background workers),
*# Edit the config file: <code>> vi /etc/supervisor/conf.d/supervisor-ckan-worker.conf</code>
*# Tell Supervisor to use the new configuration: <code>> sudo supervisorctl reread</code>
*# Update the deployed configuration to start the desired number of workers: <code>> sudo supervisorctl update</code>

* Activate the virtual environment that lets you run <code>paster</code> commands: <code>> . /usr/lib/ckan/default/bin/activate</code>

=== Adding/changing departments of publishers ===
To add or change the departments belonging to a particular publisher organization edit the <code>dataset_schema.json</code> file: <code>> vi /usr/lib/ckan/default/src/ckanext-scheming/ckanext/scheming/dataset_schema.json</code>

Then run <code>> sudo service apache2 reload</code>

The extra tricky part about this one is that [https://github.com/WPRDC/ckanext-wprdctheme our GitHub repository that includes this JSON file] is installed in a different directory: <code>/usr/lib/ckan/default/src/ckanext-wprdctheme/</code> but changes to the files in that directory (and subdirectories) do nothing.

== Other changes ==
=== Using CKAN metadata instead of local caches ===
To avoid keeping local databases about datasets (for instance, when writing code to track some aspect of datasets), store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are cataloged on the [[CKAN Metadata]] page.

=== Hacky workaround for adding new users to publishers ===
In additional to adding the users to the organizations through the CKAN front-end, you also have to add them to groups, using this URL: [http://wprdc.org/group-adder/]

[[Category:Onboarding]]

ETL

2022-07-11T14:11:08Z

DRW:

Tutorials

2022-06-06T02:47:43Z

DRW:

* [https://kbroman.org/dataorg/ Organizing Data in Spreadsheets] - Tips on how to store data in a spreadsheet in a way that will make it easiest to sustainably work with. (Highly recommended if you ever want to publish your data.)
* [https://www.youtube.com/watch?v=7Ma8WIDinDc Data Cleaning Principles] - A video of a talk from csv,conf6 on how to approach and think about cleaning data. The slides, annotated with notes, are [https://kbroman.org/Talk_DataCleaning/data_cleaning_notes.pdf here] and the corresponding GitHub repo is [https://github.com/kbroman/dataorg here].
* [https://www.youtube.com/watch?v=Ssso_5X1UPs Creating Effective Figures and Tables] - A video of a talk about how to communicate data clearly with graphs and tables. [https://www.biostat.wisc.edu/~kbroman/presentations/graphs2018.pdf A PDF of the slides] and [https://github.com/kbroman/Talk_Graphs the GitHub repo] are also available.

* [https://missing.csail.mit.edu/ The Missing Semester] - A series of MIT class videos teaching a lot of practical things that developers need to know.
* [https://gitlab.com/slackermedia/bashcrawl bashcrawl] - A text adventure to teach shell skills.
* [https://mystery.knightlab.com/ SQL Murder Mystery] - "The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime."
- "... If you really want to learn a lot about SQL, you may prefer a complete tutorial like [https://selectstarsql.com/ Select Star SQL]."
* [https://jsvine.github.io/intro-to-visidata/index.html An Introduction to VisiData] - One way to learn the best power tool for exploring and analyzing CSV files, SQLite databases, Excel files, and many other file formats. VisiData is also handy for editing, manipulating, and joining CSV files. Click [https://www.visidata.org/install/ here] to install it.

[[Category:Onboarding]]

Tutorials

2022-06-06T02:23:57Z

DRW: Add talk on creating plots and tables and add a link to the data-cleaning entry

* [https://kbroman.org/dataorg/ Organizing Data in Spreadsheets] - Tips on how to store data in a spreadsheet in a way that will make it easiest to sustainably work with.
* [https://www.youtube.com/watch?v=7Ma8WIDinDc Data Cleaning Principles] - A video of a talk from csv,conf6 on how to approach and think about cleaning data. The slides, annotated with notes, are [https://kbroman.org/Talk_DataCleaning/data_cleaning_notes.pdf here] and the corresponding GitHub repo is [https://github.com/kbroman/dataorg here].
* [https://www.youtube.com/watch?v=Ssso_5X1UPs Creating Effective Figures and Tables] - A video of a talk about how to communicate data clearly with graphs and tables. [https://www.biostat.wisc.edu/~kbroman/presentations/graphs2018.pdf A PDF of the slides] and [https://github.com/kbroman/Talk_Graphs the GitHub repo] are also available.

* [https://missing.csail.mit.edu/ The Missing Semester] - A series of MIT class videos teaching a lot of practical things that developers need to know.
* [https://gitlab.com/slackermedia/bashcrawl bashcrawl] - A text adventure to teach shell skills.
* [https://mystery.knightlab.com/ SQL Murder Mystery] - "The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime."
- "... If you really want to learn a lot about SQL, you may prefer a complete tutorial like [https://selectstarsql.com/ Select Star SQL]."
* [https://jsvine.github.io/intro-to-visidata/index.html An Introduction to VisiData] - One way to learn the best power tool for exploring and analyzing CSV files, SQLite databases, Excel files, and many other file formats. VisiData is also handy for editing, manipulating, and joining CSV files. Click [https://www.visidata.org/install/ here] to install it.

[[Category:Onboarding]]

Tutorials

2022-06-06T02:10:44Z

DRW:

* [https://kbroman.org/dataorg/ Organizing Data in Spreadsheets] - Tips on how to store data in a spreadsheet in a way that will make it easiest to sustainably work with.
* [https://www.youtube.com/watch?v=7Ma8WIDinDc Data Cleaning Principles] - A video of a talk from csv,conf6 on how to approach and think about cleaning data. The slides, annotated with notes, are [https://kbroman.org/Talk_DataCleaning/data_cleaning_notes.pdf here].

* [https://missing.csail.mit.edu/ The Missing Semester] - A series of MIT class videos teaching a lot of practical things that developers need to know.
* [https://gitlab.com/slackermedia/bashcrawl bashcrawl] - A text adventure to teach shell skills.
* [https://mystery.knightlab.com/ SQL Murder Mystery] - "The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime."
- "... If you really want to learn a lot about SQL, you may prefer a complete tutorial like [https://selectstarsql.com/ Select Star SQL]."
* [https://jsvine.github.io/intro-to-visidata/index.html An Introduction to VisiData] - One way to learn the best power tool for exploring and analyzing CSV files, SQLite databases, Excel files, and many other file formats. VisiData is also handy for editing, manipulating, and joining CSV files. Click [https://www.visidata.org/install/ here] to install it.

[[Category:Onboarding]]

Tutorials

2022-06-06T02:02:58Z

DRW: Add tutorials on data organization and data cleaning

* [https://kbroman.org/dataorg/ Organizing Data in Spreadsheets] - Tips on how to store data in a spreadsheet in a way that will make it easiest to sustainably work with.
* [https://www.youtube.com/watch?v=7Ma8WIDinDc Data Cleaning Principles] - A video of a talk from csv,conf6 on how to approach and think about cleaning data. The slides are [https://kbroman.org/Talk_DataCleaning/data_cleaning.pdf here].

* [https://missing.csail.mit.edu/ The Missing Semester] - A series of MIT class videos teaching a lot of practical things that developers need to know.
* [https://gitlab.com/slackermedia/bashcrawl bashcrawl] - A text adventure to teach shell skills.
* [https://mystery.knightlab.com/ SQL Murder Mystery] - "The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime."
- "... If you really want to learn a lot about SQL, you may prefer a complete tutorial like [https://selectstarsql.com/ Select Star SQL]."
* [https://jsvine.github.io/intro-to-visidata/index.html An Introduction to VisiData] - One way to learn the best power tool for exploring and analyzing CSV files, SQLite databases, Excel files, and many other file formats. VisiData is also handy for editing, manipulating, and joining CSV files. Click [https://www.visidata.org/install/ here] to install it.

[[Category:Onboarding]]

Programming Tutorials

2022-06-06T01:56:57Z

DRW: DRW moved page Programming Tutorials to Tutorials: Generalizing from programming tutorials to just tutorials

#REDIRECT [[Tutorials]]

Tutorials

2022-06-06T01:56:57Z

DRW: DRW moved page Programming Tutorials to Tutorials: Generalizing from programming tutorials to just tutorials

* [https://missing.csail.mit.edu/ The Missing Semester] - A series of MIT class videos teaching a lot of practical things that developers need to know.
* [https://gitlab.com/slackermedia/bashcrawl bashcrawl] - A text adventure to teach shell skills.
* [https://mystery.knightlab.com/ SQL Murder Mystery] - "The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime."
- "... If you really want to learn a lot about SQL, you may prefer a complete tutorial like [https://selectstarsql.com/ Select Star SQL]."
* [https://jsvine.github.io/intro-to-visidata/index.html An Introduction to VisiData] - One way to learn the best power tool for exploring and analyzing CSV files, SQLite databases, Excel files, and many other file formats. VisiData is also handy for editing, manipulating, and joining CSV files. Click [https://www.visidata.org/install/ here] to install it.

[[Category:Onboarding]]

ETL

2022-04-28T17:27:58Z

DRW: Fix markup on /* Deploying ETL jobs */

CKAN Tricks

2022-04-28T17:06:59Z

DRW: Add CKAN Tricks page

ETL

2022-04-22T14:26:27Z

DRW: Update daylight-savings-time warnings

ETL

2022-04-22T14:06:02Z

DRW: /* Deploying ETL jobs */

Tooling

2022-03-22T20:19:57Z

DRW: /* Data tools */

== Data tools ==
* [https://www.visidata.org/ VisiData] - Terminal user interface for a data exploration/manipulation tool that can handle large datasets.
* [https://github.com/jqnatividad/qsv qsv] - "command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast and composable." Forked from xsv by CKAN Joel, so it's got some CKAN-specific features in the works.

=== Anonymization ===
* [https://www.open-diffix.org/ Open Diffix] - Free, open-source desktop tool (and eventually Postgres extension) for anonymizing data.

[[Category:Onboarding]]

Tooling

2022-03-22T20:16:44Z

DRW: Add Tooling page

== Data tools ==
* [https://www.visidata.org/ VisiData] - Terminal user interface for a data exploration/manipulation tool that can handle large datasets.
* [https://github.com/jqnatividad/qsv qsv] - "command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast and composable."

=== Anonymization ===
* [https://www.open-diffix.org/ Open Diffix] - Free, open-source desktop tool (and eventually Postgres extension) for anonymizing data.

[[Category:Onboarding]]

ETL

2022-03-13T04:28:52Z

DRW: Add timestamp pitfalls

File Formats

2022-03-09T20:48:14Z

DRW: Add File Formats page

== GeoJSON ==
* [https://macwright.com/2015/03/23/geojson-second-bite.html More than you ever wanted to know about GeoJSON]

[[Category:Onboarding]]

Data Dictionaries

2022-03-09T20:44:44Z

DRW: Move table out of list to improve parsing of Markdown

== What is a data dictionary? ==

Much like a regular dictionary defines words and tells the reader how to use them, a data dictionary explains all the columns (fields) of your data with enough detail that the reader can start using your data.

It lists each field name that appears in the table of data, provides a definition for the field, specifies the type of data in that field (e.g., "string" or "integer"), and gives an example value that could occur in that field.

== Why are data dictionaries important? ==

Published data often has field names that are deliberately short, and they may wind up being cryptic or unclear, particularly to someone unfamiliar with the topic that the data describes. The field might be "weight", but if the value is 8, the user won't know if the item weighs 8 pounds or 8 kilograms or 8 tons.

Data dictionaries are a simple way to explain such details and make the data you publish more accessible to other people.

== Suggested formats ==

Our integrated data dictionaries use a four-field data dictionary to define your data's fields.

The fields we recommend are:
* "column": The name of the field. (Preferably formatted in all lowercase letters and with underscores instead of spaces or other punctuation.)
* "type": The type of the field, coded as something like "text" or "float" or "int". The [http://docs.ckan.org/en/latest/maintaining/datastore.html#field-types full list of types for values that go into the CKAN datastore] (which the WPRDC data portal runs on) is shown in Table 1 below. Note that when using CKAN's integrated data dictionaries, the type of a data dictionary entry is set by the type of the field in the CKAN datastore, so you shouldn't have to set this yourself (unless you need to override the type that CKAN thinks that field is [this is only for a non-ETL upload]).
* "label": A short human-readable label for this field. If the field name is <code>zip</code>, an appropriate label might be "ZIP Code" or "Postal Code".
* "description": A definition of the field (you could, for instance, include in here that the units of the field value are furlongs). This is also the place to put any other information relevant to the field, including information about how the field was calculated from another field or how the field was transformed for publication.

{| class="wikitable"
|+ Table 1
|-
! CKAN type !! description
|-
| text || text string
|-
| int || integer
|-
| float || real number
|-
| boolean || a Boolean value (True or False)
|-
| date || a date without a time
|-
| time || a time without a date
|-
| timestamp || a date and a time together (a.k.a., a "datetime")
|-
| json || a JSON representation of some data (superuseful but by far the most obscure type on this list)
|}

We also like [https://opendata.stackexchange.com/a/319 the Frictionless Data JSON Table Schema example] approach to data dictionaries, but we're not quite ready for that yet.

Other suggestions: We like to name fields by making all the letters lowercase and converting spaces and other punctuation to the underscore character \(\_\). So, we would convert the field name "Walrus Count" to the name "walrus_count". This is called "snake case".

== How to create a data dictionary ==

Option 1 (preferred and easiest): Use our fancy new integrated data dictionaries, which you can create through the CKAN web interface.

Option 2: You can use a spreadsheet program and then export the results to CSV.

If you do it this way, check it over by opening it in a text editor, to make sure that Excel didn't format anything (like dates) weirdly.

Option 3: You can type it up by hand. It's not that hard if you have an example.
(See [https://github.com/WPRDC/little-lexicographer/tree/master/examples here] for examples about books.)

Option 4: Use this handy Python script I wrote: [https://github.com/WPRDC/little-lexicographer little-lexicographer].

== Integrated data dictionaries ==

Our latest version of the data-portal software includes nifty built-in data-dictionary capabilities. As the publisher, you can edit the data dictionary through the management interface and then the user can view it right below the corresponding data table.

=== How to edit a resource's integrated data dictionary ===

1) From the resource page, click the “Manage” button.
2) Click on the “Data Dictionary” tab. You will see a long form with selectors and blanks for each field.
3) Optional: Use the “Type Override” selector to change the types for any fields that need to be changed.
4) You can also provide human-readable names for the fields in the “Label” blank and a longer description in the catch-all “Description” field.
5) Click “Save” at the bottom of the page.

=== Uploading integrated data dictionaries ===

[https://github.com/WPRDC/little-lexicographer#uploading-integrated-data-dictionaries Little lexicographer] supports uploading properly formatted CSV files to the integrated data dictionary of an existing resource: https://github.com/WPRDC/little-lexicographer#uploading-integrated-data-dictionaries

== Beyond data dictionaries ==

Some datasets benefit from extended documentation. For these, we have [https://tools.wprdc.org/guides/ Data Guides]!

We can also recommend the [https://arxiv.org/abs/1803.09010 Datasheets for Datasets standard] for a comprehensive approach to documenting data that you are publishing.

[[Category:Onboarding]]