What if you had to drive cross-country to somewhere you’d never been before using a map made up of only landmarks — no highway systems or street names — on roads with varying or non-existent traffic rules. Would you be able to reach your destination? Odds are yes, eventually, but it would likely take a while, and you probably wouldn’t end up using the most direct route.
The idea of a world without consistent traffic laws, modern-day road maps, or (gasp!) GPS is hard to imagine. But, in reality, it’s not very far off. For example, in the US, the interstate highway system is only 65 years old. And the earliest generalized road maps and state-wide driving laws date back less than 120 years — which is arguably surprising given the breadth of stagecoach travel and traffic accidents.
In many ways, navigating the digital scholarly communication landscape is like the early days of driving — the routes and “rules of the road” are still being figured out. Only, instead of using automobiles to get from point A to point B, discovery services are the vehicles, and the “roads” are data pathways that machines can travel by. Now, the challenge is building out and maintaining those data pathways, codifying standards to connect them, and developing norms for moving research information between digital tools and systems to support discovery, assessment, and reuse.
What steps should scholarly publishers be taking to promote better routes and “rules of the road” for research travel, and what are the possibilities? Below are some highlights from the 2021 NISO Plus conference.
PIDs to support (meta)data linking
To create routes between research outputs, giving discovery services machine-readable metadata they can use to find and draw links between content is essential — and the more standardized, the better.
Most are familiar with registering Digital Object Identifiers (DOIs), a type of Persistent Identifier (PID), to create lasting records for online research outputs. Registering DOIs for journal articles and other scholarly content and adding DOI links to references when possible is one of the best steps publishers can take to support research linking and discovery. But publishers shouldn’t stop at creating DOIs for articles. There are many other PIDs to consider adding to article-level metadata to support research discovery, assessment, and reuse. Additional PIDs can also expand the potential reach of content outputs when included in metadata registered with discovery services like Crossref.
During the NISO Plus session “Linked Data and the Future of Information Sharing,” Christian Herzog, CEO of Dimensions, and Shelley Stall, Senior Director of Data Leadership at the American Geophysical Union, spoke to emerging PIDs for linking research outputs by not only the content referenced in them but also the scholars, institutions, and funders associated with them. Among the PIDs they said all publishers should consider adding to their metadata are:
- ORCID identifiers for authors and their history of research contributions
- Institutional IDs such as those developed by GRID, which is the seed data set for the community-led ROR open research organization identifier registry
- Grant IDs and funder IDs, such as those in The Open Funder Registry
Speaking to an example Dimensions record, Herzog explained how adding the few elements above to research metadata can open up innumerable content linking possibilities: “taking only person relationships into account, citing relationships between documents, affiliation resolution, or who funded the activities represented in this document, as you can see, we already have 2.2 billion of the most basic relationships in the database.” He added, “the sky’s the limit” for the possible applications of data linking — from using funder IDs to find APC subsidy options for authors to using author IDs to find relevant peer reviewers for articles.
Building off of Herzog’s presentation, Stall discussed the importance of applying PIDs to underlying research datasets in addition to journal articles. She emphasized that journals should encourage scholars to cite research data used in papers to (harkening back to our intro analogy) build out more robust “roads” for research discovery, reuse, and assessment. Speaking to the importance of using ORCIDs to track authorship of not only articles but also data sets, she added, “it’s really important to realize that authors of a paper are not necessarily the creators of the data sets the paper uses.”
Stall acknowledged that one of the main challenges to supporting data linking for publishers is producing consistent data citations. She encouraged publishers to look at how they are validating citations to ensure their machine-readable metadata includes accurate PIDs and to provide authors with examples of properly formatted data references.
FAIR data principles to promote new pathways for research discovery and reuse
Producing standardized machine-readable metadata and encouraging data citations are ways publishers can help create new routes to find and connect research. But, to ensure they don’t lead to dead-end data “roads,” there’s a need to go a step further — or, as some would say, GO FAIR. By that, we mean adopting the FAIR data principles to make research metadata and data Findable, Accessible, Interoperable, and Reusable.
During the NISO Plus session “FAIR data principles and why they matter,” Brian Cody, Scholastica’s CEO and Co-founder, Stephen Howe, Senior Product Manager at Copyright Clearance Center, and Paul Stokes, Product Manager at Jisc, discussed opportunities and challenges surrounding publishers adopting the FAIR data principles.
Kicking off the session, Cody first walked through some “un-FAIR” data examples, including articles without data citations, data citation links going to dead ends, and discovery services being unable to decipher unstandardized article-level metadata. Speaking to concrete steps publishers can take to make their (meta)data FAIR(er), he outlined the following recommendations:
- Find an appropriate FAIR data repository and recommend that authors upload data sets to that repository in journal guidelines
- Have authors fill out a data availability statement as part of their manuscript submission
- Link to data sets in article citations and include publication types, appropriate PIDs, and titles for cited datasets in article-level metadata
- Collect ORCIDs from all authors to include in article-level metadata
- Join GO FAIR groups around different data conventions by discipline
Howe and Stokes next spoke to the challenges and opportunities of FAIR data, beginning with a presentation by Howe on the types of projects FAIR data can support. He used the COVID Author Graph, a knowledge graph that CCC created showing connections between researchers working on coronavirus-related topics, as an example. The Author Graph was designed to help publishers find qualified peer reviewers for COVID-19 papers more quickly, demonstrating one of the many potential benefits of data linking to expedite research discovery and even publishing processes.
Howe said the primary challenge to building tools like the COVID Author Graph is finding necessary FAIR data to reliably extract author information and relationships from. “For example, less than 10% of the authors in the raw data — and I’m talking about over 800,000 author instances — had an ORCID […] and then ISNI and GRID were even less prevalent, with sometimes as little as 10 records. A standard identifier is only good if it is used.” In this way, going beyond developing standards for sharing data to also enforcing or at least encouraging proper formatting, could be akin to developing data traffic rules.
Playing the devil’s advocate, Stokes followed up with a presentation on why he believes “FAIR is not good, or rather not good enough.” Stokes argued that there are flaws in the concept of FAIR data that stakeholders need to address to ensure data is accurate and sustainable, both in terms of organizational and world resources, given the high monetary and environmental costs of data storage. He proposed an extension to the FAIR principles: “FAIREST — Findable, Accessible, Interoperable, Reusable, Environmentally friendly, Sustainable, and Trustworthy.”
Despite the challenges of implementing FAIR data principles, the session made clear that to support research discovery and assessment, improving the findability, accessibility, and interoperability of available research data is essential. Publishers can begin by looking to norms developing within their disciplines to determine best practices for adopting data guidelines like those outlined in the FAIR principles and continually improving their data sharing requirements and citation practices over time.
With (meta)data standards the possible routes are endless
The ability for machines to draw connections between research is virtually boundless, as exhibited by tools like Dimensions and projects like the CCC COVID Author graph. During the “Linked Data and the Future of Information Sharing” session, Stall emphasized how meta(data) production and sharing standards can broaden the potential for data route creation and mapping beyond the development of major discovery tools or projects. She discussed the “150 years of Nature“ data graphic chart as an example. “Wouldn’t it be interesting if we took that structure and made it even more connected? What if we were able to, instead of building it for just 150 year celebrations, have that kind of structure immediately? We could see the connected data sets, the connected software, the models, if there was a clinical trial, et cetera. And we could really take this further.”
When it comes to the widespread implementation of publisher (meta)data standards, there’s still a ways to go to answer that nagging question, “are we there yet?” But as publishers take steps like supporting data linking via PIDs and furthering the FAIR data principles, there’s no doubt we’re getting closer.