So much to read, and so little time — that’s the dilemma faced by researchers across academic disciplines. The number of scholarly journal articles published each year continues to climb, now growing at a rate of 4% according to the 2018 STM report. And, with the expansion of interdisciplinary research in recent years, scholars have more content to keep up with than ever before. How can scholars stay abreast of all of the literature related to their disciplines and know what to focus their precious reading time on?
A once thought futuristic solution that’s being discussed in more tangible terms is artificial intelligence (AI), or the development of computer applications capable of performing tasks that would normally require human intellect. In recent years, AI applications like scite, a deep learning platform that analyzes article citations to see if they are supporting or contradictory, have emerged to help scholars sift through digital research piles and identify the signal from the noise.
How are AI applications like scite developing? And what are key AI opportunities and challenges in the academic publishing sector?
We caught up with scite co-founder and CEO Josh Nicholson to learn more about scite and his thoughts on the role of AI in academic publishing now and in the future. He shares his thoughts in the interview below.
Q&A with Josh Nicholson
What do you think is the present state of AI in academic publishing, and are there specific industry-wide changes or initiatives you think will be needed to support AI tools like scite in the future?
JN: In general there has been some apprehension to nearly all companies using artificial intelligence in academic publishing. Almost invariably if you look at recent conferences in this space you’ll find sessions with titles like, “Trust, bias and other issues in AI.” This is not surprising to me, however, given what we and others are offering is in many cases entirely new. Our perspective is that we want to work with members of the scientific community to find how to make science more reliable, together. And while these meetings have highlighted some apprehension they have also highlighted the willingness of service providers, librarians, publishers, and scientists to come together and learn from one another. I think it is precisely these interactions that will be key to the success of tools using AI, like scite. Obviously this doesn’t mean all AI tools should be or will be adopted, but I think those that engage with and learn from members of the community have a better chance.
For scite, we’re obviously biased but we think making it easy for researchers to see how an article has been cited by making it easy to browse or search citation contexts opens up a wealth of scientific articles that would otherwise go unread for the lack of time. I think this represents something valuable to the space not because it is AI per se but because it helps readers, publishers, authors, and virtually anyone looking at scientific articles.
Do you think AI tools like scite that can help process information for researchers are becoming more necessary today than before (with increased research outputs, etc.), or has it just taken time for technology to get to the point of making tools like scite possible? Is it a combination of the two?
JN: I think it is a combination of the two. It’s been said as early as the 1950s that “scholars are going to drown in a flood of information,” which if you consider the volume of publications then versus now makes that “flood” look like a puddle. Moreover, Eugene Garfield discussed introducing “citation markers,” or what we would call citation classifications at scite, as early as 1964. Thus, the need and even idea for scite has been around a while but has not been able to be implemented until now. What has previously held back the implementation of something like scite, has been the scale and difficulty of the problem. There are over a billion citations in the academic record and no uniform writing style. Advances in deep learning and natural language processing models have made it possible to solve this problem now, which is what we are doing at scite, albeit it is still very difficult.
scite works with XML and PDF versions of articles — how have the strengths and limitations of different file types factored into your work? Are there future content production needs that you think will or could factor into the success of machine learning tools like scite?
JN: To classify a citation statement as supporting, contradicting, or mentioning, we first need to extract the citation statement itself. This represents not only a technical challenge but a business challenge as well. The business challenge is that most publications are not open access, thus one needs to either pay publishers to get access to text-mine scientific articles or provide value in another way so they can be freely licensed. With scite, we are able to gain access because we only display a snippet of text (the citation context) with a link back to the full-text version of record. Thus, by letting scite index content, publishers get increased discovery of their articles with no cost to them.
Once we have access to full-text articles, we need to parse the citation context (three sentences surrounding the in-text cite), match the citation to the reference and then match the reference to the correct metadata in Crossref. This process is achieved relatively easily when working with structured text like JATS XML (we can process over 1M XML documents a day) but becomes extremely difficult when working with PDFs where we must utilize multiple machine learning models to extract the information. The problem of extracting and matching citations in PDFs is exacerbated by the fact that there are a wide variety of PDF layouts, citation styles, and information in references.
Do you see the potential for widespread use of AI in the academic sector in the future? What are the main opportunities and challenges in your opinion?
JN: Yes, I think there are already numerous examples of AI being used in the academic sector and there will be increasingly more. AI, to me, is just like software development in general. Right now the word connotes something magical and perhaps because of this makes it seem untrustworthy. But software development was also received in the same way when it was being introduced. As more uses of AI become mainstream, solving real problems, people will begin to trust it more and it will be just another part of the tech stack. The challenge, just like software development in general, is making sure that it is used to solve a problem that people care about. This relates to user experience, business models, marketing, and other things like this, just as much as AI.