Image Credit: Stefan Vladimirov on Unsplash
Image Credit: Stefan Vladimirov on Unsplash

With the image of a burning paper in the background, the title of a 2018 article in The Atlantic declares somewhat ominously, “The Scientific Paper Is Obsolete.” The subheader below provides hope — “here’s what’s next.”

What is next? Are we getting there? The article, written by former contributing editor at The Atlantic, James Somers, isn’t arguing the end of scientific publishing, as one might quickly infer from the title, but rather that the predominantly paper-based presentation of research is becoming outmoded. Somers argues that printing journals and publishing paper-based PDF articles online does not allow for the dynamic research presentation needed for digital scholars and readers. He poses the question, “What would you get if you designed the scientific paper from scratch today?” The article goes on to explore computational notebooks such as Mathematica as an alternative.

Whether computational notebooks will become a mainstream publishing format is yet to be determined. As Somers notes, it would require some drastic changes in the academic research and publishing landscape. And whether the PDF is indeed obsolete remains a matter of debate. We’ve discussed the limitations of PDFs in the past, including not supporting multi-media files and being painstaking to read on most mobile devices. But PDFs also have their benefits. PDFs are the best format for scholars who prefer reading physical copies of papers to print digital journal articles because they maintain a consistent format. PDFs can also be saved to e-readers, making them easily portable. So there are reasons for not giving up on PDFs just yet.

However, putting the debate over the utility of PDFs aside, what is evident is using a paper-based publishing format online does not allow for the most robust user experience possible for human readers, or the growing number of machine readers of scholarly content (think search engines, indexes, aggregators, etc.). Machines need to be able to process information in computer markup languages, and scholars need to be able to add dynamic elements to papers, mine data, and simply read articles on their mobile devices without having to squint uncomfortably.

A change that is taking root at journals across disciplines, which has the potential to vastly expand the online accessibility and utility of journal articles, is producing articles in digital-first formats in addition to the PDF, namely HTML and full-text XML. Creating these additional article formats doesn’t necessarily have to require added production steps. Using digitally-driven typesetting processes, journals can produce multiple article files at the same time. You can learn more about how Scholastica does this here.

Why these other file types in addition to the PDF? Let’s break it down.

Search is king in scholarly research

Let’s first look at one of the biggest digital optimization barriers of PDF publishing to be addressed — search limitations. In the digital publishing landscape, search is king. The 2018 “How Readers Discover Content in Scholarly Publications“ report, produced by Renew Publishing Consultants, states “academic researchers (therefore excluding students, etc.) rate academic search engines as the most important discovery resource when searching for journal articles.” This includes searchable academic indexes and databases (e.g. PubMed Central), and more mainstream scholarly search engines like Google Scholar.

There are two main models for how indexes and search engines collect and process information to know what to return in search results: (1) metadata and content deposits, and (2) web crawlers. In both of these scenarios, the PDF is substandard. In the case of making metadata and content deposits, PDFs are not a machine-readable option. Most academic indexes and archives prefer or require metadata and content deposits to be submitted to them in JATS compliant XML. Journals that don’t produce machine-readable XML article files can manually enter metadata into some indexes, but this is a tedious process and often results in somewhat meager metadata inputs that don’t provide indexes with the rich information needed to process content as effectively as possible.

In the case of crawler-based search engines and indexes, PDFs are also limiting. While PDFs can be indexed by many search engines, they are overwhelmingly not search optimized for a few reasons. First, PDF articles often lack adequate HTML metadata because many journals do not host them on designated website pages but rather link out directly to the PDF itself. To index journal websites effectively, crawlers must be able to locate each journal article on an individual webpage that has HTML article-level metadata associated with it. PDF files by themselves do not provide crawlers with the descriptive metadata they need.

To optimize articles for search engines, journals should host each of their articles on a designated article webpage (unique subdomain) that includes bibliographic HTML article-level metadata. This is actually a requirement to be indexed in Google Scholar. Journals can fulfill this best practice by having individual webpages for PDF versions of each of their articles, ideally with an in-webpage PDF viewer, rather than having article pages link to PDF files. Displaying PDF articles within webpages helps search engines “understand” how they connect with other content on the website and makes it more likely for them to crawl additional pages.

However, even when hosted on individual webpages, PDFs can still present linking limitations. It is often tedious to add clickable references throughout PDFs, whereas HTML is more conducive to adding interactive links within the text. HTML articles also provide a better user experience when it comes to linking because they enable users to open references in separate tabs while maintaining their place within the article they’re reading. PDFs often open links in the same tab resulting in the reader losing their place.

Finally, a major limitation of PDFs in modern search is that they are not mobile-friendly. Today, many search engines, including Google, are moving to mobile-first indexing, wherein crawlers index and rank mobile-friendly content before content that is not responsive on mobile devices. So to be competitive in mainstream search, which is becoming more important in many disciplines, mobile-friendliness is key.

Readers are becoming more mobile

Beyond search, there are also obvious reader benefits for producing articles in mobile-friendly HTML. As mobile usage grows among the general public, it is also increasing within academia. The 2015 “How Readers Discover Content in Scholarly Publications” report found that, while researchers still primarily used desktop computers and laptops for research and writing, there was a clear increase in mobile usage among respondents across disciplines. This was particularly true in low-income countries, where reported mobile usage nearly doubled. As the entire world becomes more mobile, scholars must be able to quickly and easily access research on the go. And for low-income countries, mobile devices are often more accessible than computers.

Opportunities for enhanced reader experience and usage

Publishing PDFs alone is also holding back many journals from new opportunities to expand their article reading and usage potential online. For example, some publishers and content databases are beginning to introduce enhanced metadata to articles. Enhanced metadata includes semantic elements about content that can help databases and search engines better process not only what the content is but what it is about. Jabin White, Vice President of Content Management for JSTOR and Portico, gave a great overview of enhanced metadata in a recent interview. As he noted, having full-text XML is an important precursor for publishers, or content aggregators and distributors like JSTOR, to be able to efficiently create enhanced metadata.

In addition to supporting the creation of enhanced metadata, full-text XML files are also more conducive for text and data mining, wherein online scripts or machine-learning tools are used to analyze article information. For example, a scholar might employ text and data mining to compile an aggregate of articles that reference a particular subject, or to analyze related data sets within different articles. New machine-learning tools like Scite, which analyzes article citations to determine if they are supporting or contradictory, have demonstrated that while it is possible to mine PDFs, it is technically very complex. For machine-learning tools like Scite, which currently has to analyze both PDF and XML files, because PDFs are the predominant publishing format, producing all articles in machine-readable XML would be a major step forward.

A flexible reading experience for different needs

In considering the pros and cons of the PDF, as well as different human- and machine-readable file types, what becomes apparent is that each file type serves different needs. For journals to give readers the option to still easily print out copies of articles, PDFs remain best. But to have an effective digital article reading experience for humans and machines producing articles in digitally compatible HTML and XML files is paramount.

What is next in the future of academic publishing? Time will tell, but it starts with producing truly digital content for modern readers.

OA Publishing Guide