The world needs One Repo

There are many, many repositories out there for scientists to store their research, and although this is great, searching for one thing across them all is not possible. Mike Taylor introduces a new project aiming to overcome this issue, The One Repo.

Suppose you’re a dinosaur lover, and you want to find information about sauropods. You go to Google, type ‘sauropod’ and hit the Search button, and you get a pretty good list of the best Web-pages about sauropods. Easy.

Now suppose you’re a dinosaur researcher, and you want to find open-access scholarly papers about sauropods. More than that: suppose you want to text-mine them? Or analyze the papers’ metadata to see how open-access publication rates have changed through time, or what proportion of the authors are female?

What if you need to verify that the papers conform to an institutional open-access policy? For this kind of thing, you need structured results, not just links to pages – which means that Google doesn’t give you what you need.

Google Scholar is a bit better – it understands the concept of a scholarly paper, as opposed to a general Web page, and it has some metadata. But there is no way to access this data except via the Web-facing user interface. (And you’re explicitly not allowed to use crawlers to scrape the metadata out of the page.)

Repositories are the answer

The problem is that there are at least 4000 of them out there, and no good way to search across them all.

The solution is repositories, mostly run by universities and other institutions and so known as Institutional Repositories (IRs). The problem is that there are at least 4000 of them out there, and no good way to search across them all.

Instead, to find every hit, you’d need to submit your query to 4000 separate IRs – all different cosmetically and often varying in much deeper ways – and somehow collate all the results. Clearly that’s a non-starter.

The good news is that some important steps have been taken towards aggregating the contents of IRs. In the UK, JISC’s CORE (Connecting Repositories) project brings together content from most of Britain’s universities; the OpenAIRE repository contains the papers from many European-funded projects; the SHARE Notify service gathers similar information for federally funded research in the USA.

But while these projects include content from some repositories outside their geographical areas, they understandably focus on their homes: for example, while CORE aggregates 248 UK IRs, only 15 are currently included from Asia; and of the 78 content providers presently aggregated by SHARE, 58 are in the USA.

All in one place

What the wider world needs is one repo: a single place where you can find any open-access paper, wherever it was written or published.

That’s perfectly reasonable: they’re concentrating on their missions. But what the wider world needs is one repo: a single place where you can find any open-access paper, wherever it was written or published. A place where you can search on a Web user interface, or via an API, or just download all the data – and where all the data and services are free to use, redistribute and remix.

Introducing The One Repo

That’s what we’re working on at Index Data. We’re building The One Repo, with sponsorship from SPARC Europe and advice from some of the wisest heads in open access. We’re re-aggregating the aggregations from CORE, SHARE and others, and – crucially – filling in the long tail.

We’re mopping up all the repositories that slip through the cracks. Some of them, like SSRN, the Social Sciences Research Network, are big and important, but don’t get aggregated by existing services because they don’t provide a harvesting API.

But we can handle sites like this, because we have mature tools that let us screen-scrape services that are only available as user-facing Web-sites. It turns out that many of the known IRs lack APIs: for example, more than a thousand of the repositories registered with OpenDOAR do not support the OAI-PMH protocol. The long tail is long; but it’s hugely important.

Equally important are the many small publishers: for example, scholarly societies that publish one or two journals in their fields. We want to pick all these up, as well as the big, well-known open-access publishers (BioMed Central, PLOS, and so on).

Developing the repository

Needless to say, this is a big job. We have most of the software infrastructure in place, but the much longer-term task of content acquisition stretches out ahead.

Needless to say, this is a big job. We have most of the software infrastructure in place, but the much longer-term task of content acquisition stretches out ahead. We need to build harvesters for those thousand-plus repositories that have no APIs, and for numerous publishers. So we’re keeping an eye out for partner organisations that can sponsor the development of some of these harvesters. (Do drop me a line if you’re interested.)

Our commitment is to keep the service completely free – not just zero-cost, but making all the data fully open for any and every purpose. There are lots of cool things that we and our partners are looking forward to doing with it. But our fondest hope is that other people will find all sorts of new uses for the data that we’ve not thought of.

View the latest posts on the Research in progress blog homepage

Comments