In pursuit of rich article-level metadata: 5 elements journal publishers should prioritize

In the quest to improve journal discoverability, “rich metadata” has become somewhat of a Holy Grail. And, while not quite as abstract as the object of Arthurian lore, it’s proven a pretty elusive target.

Today, most journal publishers are aware of the value of adding machine-readable metadata to research outputs — or data that provides information about digital objects making it possible for online databases and search engines to discover and draw connections between them. However, their visions of what it means to have “rich metadata” and those of other stakeholders can vary significantly.

As Jabin White, Vice President of Content Management for JSTOR and Portico, so clearly put it, “my idea of ‘rich metadata’ and your idea of ‘rich metadata’ are likely two different things.” For example, a publisher may have extensive article-level metadata to support search via its website. But if that metadata doesn’t adhere to shared industry norms, an external discovery service likely won’t be able to derive much value from it. That’s why initiatives aimed at understanding the metadata needs of different scholarly communication stakeholders, like Metadata 2020, and the development of metadata formatting conventions are so important.

To get closer to attaining coveted “rich” metadata and ultimately contribute to a “richer” scholarly communication ecosystem, journals first need to have machine-readable metadata that is clean, consistent, and as interoperable as possible. For publishers, that means collecting metadata in their journal submission forms (ideally via verified fields) and producing XML article-level metadata for deposit-based scholarly indexes and discovery services in the Journal Article Tag Suite (JATS) markup language developed by NISO. You can learn more about JATS here. From there, journals should add as many descriptive elements to their metadata as possible to support the needs of different research stakeholders. This blog post overviews five elements of richer metadata most commonly requested by stakeholders across disciplines that publishers should prioritize — once they’ve built out a foundation of basic metadata, of course (i.e., journal title, article title, author names, etc.).

A quick note before we dig in: you may not be able to add all of the below elements to your article-level metadata right away, and that’s OK. We recommend approaching journal metadata strategy with an Agile mindset by identifying one or two low-hanging fruit steps you can take to make your metadata better, completing that, and then identifying the next best step to take. With each iteration, you’ll be expanding the discovery potential of your articles.

1. Persistent Identifiers (PIDs) beyond the DOI

We’re starting off with more of a metadata category than an element per se, but we think grouping Persistent Identifiers is helpful in this case. For those unfamiliar, Persistent Identifiers — aka PIDs — are long-lasting online references that can be applied to digital content (i.e., articles or datasets), people, or organizations. In addition to creating persistent labels for online content or entities, PIDs literally serve as persistent links to them. This is very important since the original links to information online change a lot more than you’d think.

Most journal publishers are familiar with registering Digital Object Identifiers (DOIs) for their articles to ensure readers can always find the version of record even if the name, location, or some other aspect of the original content changes. But there are other PIDs that publishers should seek to adopt to have more consistent and robust metadata (and there’s even a whole festival dedicated to the breadth of PIDs out there!), including:

ORCID identifiers for primary and contributing authors (the more ORCIDs you can include in metadata the better)
ROR institutional identifiers
Grant and funder identifiers, such as those in The Open Funder Registry from Crossref

Including PIDs in article-level metadata enables stakeholder search engines and systems throughout the scholarly communication ecosystem to more easily find, ingest, exchange, and draw links between articles and other digital outputs associated with them (i.e., data sets). For example, a discovery tool could use funder metadata to link all articles funded by the same organization to help scholars quickly find content associated with that organization and see the types of research it tends to support. During the COVID-19 pandemic, ORCIDs have actually been used by discovery services to help publishers quickly find scholars with relevant research experience to support more rapid peer review of coronavirus-related submissions. The metadata linking potential of PIDs was a theme discussed during NISO Plus 2021. PIDs also support research usage and impact tracking.

Of the PIDs above, funder IDs are arguably becoming one of the most important for publishers if they plan to publish fully Open Access (OA) articles. Some OA funders like members of cOAlition S, the organizations behind Plan S, now require funding information to be included in the article-level metadata of the research they support.

2. Copyright licenses

Speaking of metadata necessary for journals to comply with new OA initiatives, copyright license information is another big one. For example, journals must include copyright licenses in their article-level metadata to be Plan S compliant.

Beyond fulfilling OA funder mandates, including copyright information in metadata can also make articles more discoverable, especially for OA journals. Many indexes and discovery services (including Google Scholar) support search and filtering by copyright license to make it easier for readers to find OA content. In recent years, Creative Commons even launched its own search engine to help people find content they know they can re-use under a CC license.

Since the copyright licenses applied to journal articles rarely change once set, this is a metadata element that publishers may be able to hard code into their HTML and/or XML article-level metadata, making it an “easy” metadata enrichment win.

3. Article Abstracts

Up next, another way publishers can enrich their article-level metadata is by including open abstracts in it. Since the launch of the “Initiative for Open Abstracts“ (I4OA), there has been mounting support from a range of scholarly communication stakeholders to make the abstracts of all research outputs openly available via trusted repositories in machine-readable formats. I4OA encourages publishers to submit open abstracts with all of their DOI registration metadata to Crossref whenever possible.

When article-level metadata includes open abstracts, it provides discovery services with rich descriptive information they can use to better “understand” the contents of the article, surface it, and return more detailed search results to potential readers. In this way, including open abstracts in machine-readable metadata can help maximize the visibility and impacts of articles. Including abstracts in machine-readable metadata also opens up innumerable possibilities for text and data mining. Discovery services, and even individual researchers, can develop automated text and data mining processes to analyze large amounts of information more efficiently and use it to inform research or create new discovery tools.

By depositing machine-readable abstracts into discovery services, journal publishers can help enable text and data mining of content and, in turn, the development of Machine Learning (ML) and Artificial Intelligence (AI) research tools on a wide scale. For example, when publishers submit open abstracts to Crossref, they become available via both its database search and open REST API, which discovery services and even individual researchers can use to access Crossref metadata directly for machine analysis and learning. Many proponents of I4OA believe making all research abstracts open is an achievable first step towards making all research more openly accessible in the future.

4. Open citations

Like open abstracts, including open citations in the machine-readable metadata of journal articles is another way to increase their discoverability, analysis, and linking potential. And the need for and benefits of open citations is being championed by many of the same research stakeholders as those in the I4OA initiative via the sister “Initiative for Open Citations“ (I4OC). Like I4OA, I4OC calls for publishers to include open machine-readable citations in the article-level metadata they deposit to discovery services such as Crossref.

As noted on the I4OC website, the number of scholarly publications is “estimated to double every nine years.” With this explosion of new content, the need for machine-readable citations to support the development of tools to help scholars more quickly identify/follow links between research and analyze the nature of citations (i.e., supporting, contradictory, or neutral) has never been greater.

To make journal citation information open, publishers must, of course, first add citations to their article-level metadata. From there, they need to indicate that those citations are open.

If you’re already depositing XML article-level metadata to Crossref, you can request that Crossref turn on “reference distribution” to make all of your citations open by default. This is another quick metadata enrichment “win” for many publishers since it doesn’t require any changes to the actual metadata code. Once article citation metadata is open in Crossref, it is available to any interested party through all of Crossref’s Metadata Delivery services, including the REST API.

5. Core HTML meta tags for Google Scholar

When it comes to article-level metadata, most journal publishers are hyper-focused on producing machine-readable XML, which is indeed so important since it’s necessary for indexing in many scholarly discovery services. But, there’s another type of metadata that publishers should also be prioritizing — adding meta tags to the HTML of their journal article web pages. Crawler-based search engines like Google and Google Scholar rely on the metadata in HTML meta tags to find and interpret digital content.

All of the above metadata elements that we’ve discussed can be added to HTML article-level metadata to enrich it. For publishers just getting started adding HTML meta tags to their article webpages — and we can’t stress enough how important it is to host each of your articles on individual web pages, whether you publish via HTML or PDF, because it is a Google Scholar inclusion requirement — there are three core HTML metadata elements you absolutely need for Google Scholar indexing:

The title of the article
The full name of at least the primary author
The year of publication

If an article webpage is missing any of the above HTML meta tag fields, even if it has other HTML metadata, Google Scholar will process it as if it has NO meta tags at all. So to show up in Google Scholar searches, you need to at least have these core elements in the HTML meta tags for all of your articles.

Putting it all together

As noted at the start of this post, don’t be discouraged if you can’t add all of the above elements to your article-level metadata. Remember, any metadata enrichment step you take, no matter how big or small, will make a difference in your article discoverability. We recommend taking an iterative approach to metadata enrichment and, above all, focusing on quality over quantity. So be sure to first establish a means of producing clean JATS XML metadata and HTML meta tags for all of your articles and then begin layering on new metadata elements from there.

You don’t have to go the metadata enrichment journey alone either. Service providers can help you produce machine-readable metadata for all of your articles and enrich it. For example, Scholastica automatically generates machine-readable JATS XML metadata for all articles typeset by our digital-first production service, including automated citation metadata enrichment via machine learning. We also automatically produce machine-readable HTML metadata for all articles published using our Open Access journal hosting platform. And when journals use Scholastica’s peer review management system, they can automatically apply metadata collected during peer review to articles submitted to our production service and/or published via Scholastica’s OA journal hosting platform to save even more time. You can learn more about how Scholastica is helping journals produce richer machine-readable metadata in this blog post.

Also, be sure to use the metadata checking resources available to you. For example, if you’re registering DOIs for articles via Crossref, you can use the Crossref Participation Reports tool to quickly see which elements your metadata includes and which are missing.

We hope these metadata enrichment tips have been helpful! If you have any questions or additional suggestions, please feel free to post them in the comments section!

In pursuit of rich article-level metadata: 5 elements journal publishers should prioritize