Image Credit: Franck V. on Unsplash

Web-based abstracting and indexing databases (A&Is) are one of the top means that scholars use to conduct research. These virtual treasure troves of information seem to “understand” the content that they house and can respond to search queries in a matter of seconds.

Of course, A&Is can’t “read” human text (at least not yet!). They process content using information available in machine-readable markup languages or computer code. Journal publishers that want their articles to show up in relevant A&Is must submit article information to them in machine-readable formats.

If you only publish journal articles in human-readable formats, like PDFs, you’re likely missing out on valuable indexing opportunities. Let’s dig deeper to explore:

  1. How indexes process information
  2. Ways to produce machine-readable article files and submit them to indexes
  3. JATS compliant XML - the standard indexing format

Finally, we’ll put it all together to overview how you can start producing machine-readable articles!

Indexes ingest information in machine-readable formats

Let’s take a walk in the proverbial shoes of an academic index, shall we? Indexes are hungry for knowledge! But they can only ingest information given to them in machine-readable formats.

There are two ways to feed hungry indexes:

  1. Manually entering article metadata into index deposit forms
  2. Submitting machine-readable article files to indexes

If you don’t produce machine-readable article files, manual data entry is your only option. In this case, the form acts as a conduit to convert the article data you enter into machine-readable metadata that the index can understand.

From the offset, the manual approach is limited as not all indexes offer the option of manual data entry. Many indexes, like MEDLINE, will only accept articles submitted as XML files (more on that later). When indexes do allow for manual data entry, it’s a tedious process for publishers. And, even if publishers can carve out time and resources for this level of manual work, manual data inputs are often inadequate. Indexes require rich metadata to process articles in a meaningful way.

The second option, depositing machine-readable article files into indexes, is better for publishers and A&Is. First, it’s a lot faster for publishers because it eliminates the need for manual data entry. Indexes can ingest and “understand” machine-readable article files as they are. Machine-readable article files also result in higher-quality indexing because they contain rich metadata.

Extensible Markup Language or XML is the standard markup language used by academic journal indexes. Let’s take a look at the options for producing machine-readable article files and depositing them into A&Is.

Ways to produce machine-readable article files and submit them to indexes

There are two types of machine-readable files that indexes use to process article information: front-matter XML article files and full-text XML article files. Depending on the index you’re applying to you may need to produce full-text article files. It’s safe to say that all indexes will require font-matter article files. Let’s take a look at both types of files and what they should include.

Front-matter XML files contain the front matter of the article - but do not include the article’s actual body text. Core front matter metadata includes:

  • Journal title
  • Publisher name
  • Article title
  • Authors’ names
  • Article abstract

Front-matter XML files can also include other rich metadata such as authors’ ORCIDs.

As the name suggests, full-text XML article files contain the complete article text in machine-readable language. Both of these formats are superior to manual data entry. Full-text XML is the most robust option allowing for text and data mining.

When publishers are ready to deposit either front-matter or full-text XML files into indexes they can usually do so in one of two ways: either uploading article files to indexes in batches (usually via an FTP server) or setting up automatic article deposits via an API content deposit feed. API stands for “Application Programming Interface” and is essentially a channel that different software applications can use to communicate with each other.

JATS XML - the standard indexing format

In conversations and documentation regarding indexing, you’ve likely come across the term “JATS” at some point, and you may be wondering what it means. Whereas XML is a language, JATS is a type of syntax. JATS stands for “Journal Article Tag Suite.” It is a specific way of formatting XML files developed by the National Information Standards Organization (NISO). JATS is considered the technical standard for journal articles and is preferred or required by many academic indexes, including all National Library of Medicine (NLM) indexes - PubMed, PubMed Central, and MEDLINE.

Formatting articles in JATS XML is a best practice and will enable you to add journal articles to indexes more quickly and easily.

Putting it all together and getting started

Now that you know why producing journal articles in machine-readable formats is better for abstracting and indexing you may be wondering how to get started. As you’ve likely gathered from this post, machine-readable article production is pretty technical.

You may have access to technical staff that can help with XML article production at your publishing organization. If not, you can still get the XML article files you need with the help of a service provider like Scholastica. Scholastica automatically produces front-matter JATS XML files for all journals that use our OA publishing software and full-text JATS XML files for all journals that use our typesetting service. You can learn more about how Scholastica is helping OA journal publishers automate indexing steps in this post.