CKAN Tricks

Revision as of 17:06, 28 April 2022 by DRW (talk | contribs) (Add CKAN Tricks page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Manage datasets

Undelete deleted datasets

If you know the URL of the deleted dataset AND you are logged in as an administrator, you can go to that URL in your web browser and see the deleted dataset. "[Deleted]" will have been appended to the title of the dataset, and the value of the metadata field state will be equal to "deleted".

To undelete such a dataset, just use the CKAN API or use the set_package_parameters_to_values() function to set the package's state metadata value to "active". For the latter option, invoke the function like this: set_package_parameters_to_values("https://data.wprdc.org", deleted_package_id, ['state'], ['active'], API_key)

Queries

Make queries faster

Speed up datastore_search queries by including the include_total=False parameter to skip calculation of the total number of rows (which can reduce response time by a factor of 2). The datastore_search API call lets you search a given datastore by column values and return subsets of the records. There's more on benchmarking CKAN performance here.

Another way to speed up datastore search queries is to index fields used in the filtering. Note that (at least when the primary key is a combination of fields), if you don't list each primary key field as a separate field to index, those fields don't get indexed and queries take way longer.

Avoiding stale query caches

Queries/API responses can be cached based upon nginx settings. If you find that your SQL query is getting a stale response, try changing your query slightly. For instance, instead of `SELECT MAX(some_field) AS biggest FROM <resource-id>`, you could change the assigned variable name (`SELECT MAX(some_field) AS biggest0413 FROM <resource-id>`) or add another field that you ignore (`SELECT MAX(some_field) AS biggest, MAX(some_field) AS whatever FROM <resource-id>`).

Scripts that interact with CKAN through the API

Run those CKAN-monitoring/modifying scripts from multiple servers by centralizing data

To avoid keeping local databases about datasets, store such information (such as the last time an ETL job was run on a given package) in the 'extras' metadata field of the CKAN package, as much as possible. This stores information in a centralized location so ETL jobs can be run from multiple computers without any other coordination. The extras metadata fields are currently cataloged on the CKAN_Metadata page.