Contribute to improve the OpenAIRE Research Graph

You can explore and test the beta release of the OpenAIRE Research Graph via the OpenAIRE BETA Explore Portal or via data dumps made available in Zenodo.

Help us making the graph ready for its 1st production release by providing your feedback by the end of December 2019.
Go to the OpenAIRE Research Graph Trello Board to report content quality issues, including missing metadata records, wrong values, mistakes in the detection of duplicates and anything else that looks "weird" or wrong.

Find the complete information about the OpenAIRE Research Graph, how to test it and contribute to improve it on our blog.

OpenAIRE Research Graph Dumps

The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.

Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. OpenAIRE is pleased to announce the beta release of its Research Graph, a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.

As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10,000 data sources trusted by scientists, including repositories registered in OpenDOAR, Open Access journals registered in DOAJ, Crossref, Unpaywall, ORCID and Microsoft Academic Graph. After cleaning, deduplication, and fine-grained classification processes, they narrow down to ~100Mi publications, ~8Mi datasets, ~200K software research products, 8Mi other products linked together with semantic relations. More than 10Mi full-texts of Open Access publications are mined by algorithms to enrich metadata records with additional properties and links among research products, funders, projects, communities, and organizations. Thanks to the mining algorithm, the graph is completed with 480Mi semantic relations.

The OpenAIRE Research graph is available via our BETA Explore Portal and you can download it from Zenodo.

Get the dumps

The OpenAIRE Research Graph is exported as several dump files available on Zenodo (go to DOI), so you can download the parts you are interested into.

  • publications: metadata records about research literature (includes types of publications listed here)
  • datasets:: metadata records about research data (includes the subtypes listed here)
  • software:: metadata records about research software (includes the subtypes listed here)
  • orps: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
  • organizations: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
  • content_providers: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
  • results_by_funder: metadata records about research results funded by a given funder. Each result includes information about its type (publications, datasets, software or other) and its specific sub-type (check the list of sub-types for publications, datasets, software, and other research products).

The up-to-date list of funders available on OpenAIRE BETA can be find here on the BETA Explore portal.

In the same Zenodo community you can also find the dumps of ScholeXplorer and DOIBoost.

The dumps contain XML records compliant to the OpenAIRE data model and to the oaf metadata format (the same format of the records exported via OAI-PMH):

Keep reading for instructions on how to consume the dumps.

Consume the dumps

Each dump is a gzipped json file with many lines. Each line is in the form of: {"_id":{"$oid":"59b82504895be144859a9804"},"body":{"$binary":"base64(zip(XML_record))","$type":"00"}}
where the body field contains the base64 econding of the compressed XML record.
In order to get the XMLs you have to:
  1. Unzip the file
  2. Get only the value of the $binary field
  3. Read each line and base64 decode it
  4. Unzip the decoded string
For example, to print the XMLs on the standard output you can run this command on MacOS/Unix/Linux based systems: gunzip -c file.json.gz | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done
where
  • file.json.gz is the name you gave to the downloaded file dump;
  • jq is a command to parse json files. It is not installed by default, but you can easy find it on official repositories. Click here for installation instructions.
  • base64 and bsdtar are two libraries that are typically pre-installed.
Note that you should decide what to do with it (keep parsing XML inline or store them somewhere). We suggest to start with few records to test and decide what to do, by adding a head command after the gunzip, like: gunzip -c file.json.gz | head -n 10 | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done

Cite us

If you use the OpenAIRE Research Graph for research purposes, please cite it as:
Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516917
If you want to cite a specific version, please follow the suggestion on Zenodo. For the current version (1.0.0-beta), please use:
Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump (Version 1.0.0-beta) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516918
The OpenAIRE Research graph includes data from Microsoft Academic Graph (MAG): please acknowledge also MAG following this guideline.

License

The OpenAIRE Research Graph is released under CC-BY license.

OpenAIRE is working to produce dumps that only contains metadata records that can be re-distributed with the CC0 license: stay tuned!

OpenAIRE
flag black white lowOpenAIRE-Advance receives funding from the European Union's Horizon 2020 Research and Innovation programme under Grant Agreement No. 777541.
   Unless otherwise indicated, all materials created by OpenAIRE are licenced under CC ATTRIBUTION 4.0 INTERNATIONAL LICENSE.