Useful data: developing standards for describing clinical research data across repositories

Movements towards increased transparency have resulted in more and more documents, journal publications and original datasets relating to specific clinical trials being included in a wide range of repositories. This is a great direction for the field but if we are to use these ‘data objects’ in the most effective manner, we need to establish consistency in the way they are described (and therefore searchable) across these repositories. Steve Canham and Christian Ohmann from the European Clinical Research Infrastructure Network have recently published a Methodology in Trials proposing standards for consistent descriptions of these data objects and here, both authors highlight its importance for the field and summarize their proposal.

Trialists are under increasing pressure, from funders, journal editors, and a general cultural shift towards full transparency in science, to make the original data and documents of their studies available to others.

Systems are slowly evolving to support such access, including the development of various data repositories. But a fundamental issue, of ‘discoverability’, needs to be resolved before the full promise of this new transparency can be realized, given the fact that the various ‘data objects’ (the generic term for any document or dataset available in electronic format) may be scattered between different repositories, publishers and institutions.

…the clinical research community needs to agree a simple, consistent metadata scheme for clinical research data objects, and deploy it…

Specifically, we need systems, for both humans and machines, that can be used to locate and characterize the various datasets and documents that exist for a study, together with the information needed to apply for access to them. But we believe that can only happen in a cost-effective (and therefore sustainable) way if the various data objects are described in a consistent fashion at source – providing data that can be harvested by software systems on a regular basis.

In short, we assert that the clinical research community needs to agree a simple, consistent metadata scheme for clinical research data objects, and deploy it, or at least map to it, across all the various locations where such data objects are to be found.

Any such schema has to:

  1. Unambiguously identify the research study or studies that the data object is about (or generated from / used within).
  2. Characterize the research object itself – e.g. its type, authorship, contents, size, language, etc.
  3. Describe the object’s location and the access regime under which it is available. If not public, the regime needs to be described in sufficient detail for a potential user to be able to apply for access.
  4. Be light weight enough to be easily applicable, especially by those that generate the data objects in the first place.
  5. So far as possible, make use of elements from existing data schemas

Published today in Trials, we propose such a schema, which is based on the widely used DataCite standard to describe the data or document itself, but with two extensions to cover the specific needs of clinical researchers.

These provide: a) study identification data, including clinical trial registry IDs, and b) data covering location, ownership and access to the data object. Table 1 summarizes our proposals, and indicates which of the data points we see as mandatory, recommended or optional.

Mandatory Recommended Optional
A.1 Source Study Title* A.2 Study Identifier records*

A.3 Study Topics*

B.1 DOI (1)

B.3 Object Title

B.5 Version

 

B.2 Object Other Identifiers*

B.4 Object Additional Titles*

C.1 Creators* C.2 Contributors*
D.1 Creation year D.2 Dates*
E.1 Resource Type General E.2 Resource Type

E.3 Description*

E.5 Language

E.6 Related Identifiers*

E.4 Subjects (of data object)*
F.1 Publisher

F.3 Access Type

F.4 Access Details (2)

F.5 Access Contact (2)

F.6 Resources*

F.2 Other Hosting Institutions*

 

F.7 Rights*

 

 

 

 

(1) Mandatory for publicly accessible data objects, recommended for all others;

(2) Mandatory if access is non-public.

We argue instead that the use of a common metadata schema is an absolute pre-requisite for implementing systems that can discover and index clinical research data objects on an ongoing basis.

We are well aware of the problem, summarized by the well-known cartoon at https://xkcd.com/927/, that trying to develop any new common standard risks merely adding one more to the list of existing standards (even if these are, currently, usually specific to particular repository systems). We also concede there may be issues with developing unambiguous identifiers for both studies and data objects, but we do not believe that these are insurmountable.

We argue instead that the use of a common metadata schema is an absolute pre-requisite for implementing systems that can discover and index clinical research data objects on an ongoing basis. It is therefore key to supporting data sharing and the effective cataloging of clinical research resources in general.

Our initial proposals are made with the intention of initiating a debate amongst interested stakeholders, and inviting others – repositories, trialists and standard development organizations – to comment upon them and to discuss ways in which these proposals, or any schema that evolves from them, can be implemented as a widely used standard. Please, contact us with your comments!

View the latest posts on the On Medicine homepage

Comments