By Mike Lynch |
ARDC funding - Data and Services Discovery projects - Institutional Role in a Data Commons
There are a lot of specialized repository applications, from small (Omeka) to large (Hydra, Fedora), all designed as special-purpose homes for datasets and metadata which provide APIs for getting things in and out.
Experience has shown that these solutions don’t scale. Eventually, an institution will have to store a dataset that’s too big either to get in or out, or to store, and will have to look at a workaround like putting the data on disk and pointing to it from a record in the repository.
The Oxford Common File Layout (OCFL) is a standard for laying out arbitrary collections of files as structured repositories. It grew out of a push from the repositories community for a repository structure that wasn’t locked in to a particular application, didn’t have scaling problems, and was easy to migrate to and from.
An OCFL repository is a collection of OCFL Objects, laid out according to a simple standard like PairTree. Objects are immutable and versioned. More details can be found at our FAIR Repositories presentation.
We’re using OCFL for our data publications repository, and will later use it for an internal data repository with access control.
RO-Crate is a standard for describing individual datasets which evolved from two previous standards, DataCrate and Research Objects. Datasets are described using JSON-LD, a standard for building linked data descriptions in JSON. The ontology is Schema.org, which is widely supported in industry, and used by Google’s new dataset search engine.
A “crated” dataset is a directory with an arbitrary file hierarchy inside it, and an RO-Crate JSON-LD document with contextual metadata (title, description, contributors, licences) and descriptions of some or all of the contents. An RO-Crate doesn’t have to describe every file inside it, as this would be impractical for some datasets.
An RO-Crate doesn’t even have to have any files other than the RO-Crate files. This allows the system to support metadata-only publication, where the actual data is only available on request.
An RO-Crate also contains an HTML document which is generated automatically from the JSON-LD. This provides a human-readable view of the metadata with links to the data payloads. It can be used locally or act as the landing page for an RO-Crate published on a web server.
OCFL and RO-Crate are the standards we’re using to lay out the repository. To deliver them, we’re using Solr, an efficient search engine, and nginx, a high-performance open-source web server.
Metadata from the RO-Crates is indexed into Solr, including licences representing lists of users who are allowed to access the dataset. Solr is also used to provide a discover interface via a lightweight single-page application.
A simple extension to nginx allows it to resolve incoming URLs to paths in the OCFL repository and serve the appropriate version of the RO-Crate metadata and payload files.
Before serving a file, the nginx handler looks up the Solr index and checks whether the authenticated user has authority to download it.
Nginx enforces licences on both the search results and payloads.
The OCFL/RO-Crate and nginx stack is meant to accommodate datasets from a wide range of disciplines and sources, from small web uploads, existing repositories and data capture.
As part of the ARDC Data and Services Discovery project, we collaborated with Nick Thieberger and Marco La Rosa of PARADISEC and Euwe Ermita’s team at the State Library of NSW. Both of these institutions have rich digital humanities collections stored in custom repositories behind APIs, and in short day- or two-day workshops we were able to make substantial progress in crosswalking their datasets into OCFL.
This work is licensed under a Creative Commons Attribution 3.0 Australia License.