Contribute to improve the OpenAIRE Research Graph
Help us making the graph ready for its 1st production release by providing your feedback by the end of December 2019.
Go to the OpenAIRE Research Graph Trello Board to report content quality issues, including missing metadata records, wrong values, mistakes in the detection of duplicates and anything else that looks "weird" or wrong.
Find the complete information about the OpenAIRE Research Graph, how to test it and contribute to improve it on our blog.
OpenAIRE Research Graph Dumps
The OpenAIRE Research Graph is one of the largest open scholarly record collections worldwide, key in fostering Open Science and establishing its practices in the daily research activities. Conceived as a public and transparent good, populated out of data sources trusted by scientists, the Graph aims at bringing discovery, monitoring, and assessment of science back in the hands of the scientific community.
Imagine a vast collection of research products all linked together, contextualised and openly available. For the past ten years OpenAIRE has been working to gather this valuable record. OpenAIRE is pleased to announce the beta release of its Research Graph, a massive collection of metadata and links between scientific products such as articles, datasets, software, and other research products, entities like organisations, funders, funding streams, projects, communities, and data sources.
As of today, the OpenAIRE Research Graph aggregates around 450Mi metadata records with links collecting from 10,000 data sources trusted by scientists, including repositories registered in OpenDOAR, Open Access journals registered in DOAJ, Crossref, Unpaywall, ORCID and Microsoft Academic Graph. After cleaning, deduplication, and fine-grained classification processes, they narrow down to ~100Mi publications, ~8Mi datasets, ~200K software research products, 8Mi other products linked together with semantic relations. More than 10Mi full-texts of Open Access publications are mined by algorithms to enrich metadata records with additional properties and links among research products, funders, projects, communities, and organizations. Thanks to the mining algorithm, the graph is completed with 480Mi semantic relations.
Get the dumps
- publications: metadata records about research literature (includes types of publications listed here)
- datasets:: metadata records about research data (includes the subtypes listed here)
- software:: metadata records about research software (includes the subtypes listed here)
- orps: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
- organizations: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
- content_providers: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
- results_by_funder: metadata records about research results funded by a given funder. Each result includes information about its type (publications, datasets, software or other) and its specific sub-type (check the list of sub-types for publications, datasets, software, and other research products).
The up-to-date list of funders available on OpenAIRE BETA can be find here on the BETA Explore portal.
In the same Zenodo community you can also find the dumps of ScholeXplorer and DOIBoost.
The dumps contain XML records compliant to the OpenAIRE data model and to the oaf metadata format (the same format of the records exported via OAI-PMH):
- See the description of the OpenAIRE data model
- See the oaf XML schema
- See the oaf XML schema documentation (generated via Oxygen XML Editor)
Keep reading for instructions on how to consume the dumps.
Consume the dumps
bodyfield contains the base64 econding of the compressed XML record.
In order to get the XMLs you have to:
- Unzip the file
- Get only the value of the
- Read each line and base64 decode it
- Unzip the decoded string
gunzip -c file.json.gz | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done
file.json.gzis the name you gave to the downloaded file dump;
jqis a command to parse json files. It is not installed by default, but you can easy find it on official repositories. Click here for installation instructions.
bsdtarare two libraries that are typically pre-installed.
headcommand after the
gunzip -c file.json.gz | head -n 10 | jq '.body."$binary"' -r | while IFS= read -r line; do echo "$line" | base64 --decode | bsdtar -x -O ; done
If you use the OpenAIRE Research Graph for research purposes, please cite it as:
Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516917
If you want to cite a specific version, please follow the suggestion on Zenodo. For the current version (1.0.0-beta), please use: Manghi, Paolo, Atzori, Claudio, Bardi, Alessia, Shirrwagen, Jochen, Dimitropoulos, Harry, La Bruzzo, Sandro, … Summan, Friedrich. (2019). OpenAIRE Research Graph Dump (Version 1.0.0-beta) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3516918
The OpenAIRE Research graph includes data from Microsoft Academic Graph (MAG): please acknowledge also MAG following this guideline.
The OpenAIRE Research Graph is released under CC-BY license.
OpenAIRE is working to produce dumps that only contains metadata records that can be re-distributed with the CC0 license: stay tuned!