
2025 may well be declared “the year of AI pilots” in scholarly journal publishing (more than likely carrying over into 2026!) as stakeholder discussions shift from theoretical applications for AI to the experiences of editorial teams testing out new AI tools.
What can we learn from recent case studies?
I sat down with Scholastica Co-Founder and CEO Brian Cody to discuss some key takeaways from emerging AI use cases shared at industry conferences and events this year. Below is a transcript of our conversation (edited for readability).
You can also watch the recording on YouTube here!
Discussion transcript
DP: I thought I’d kick off the conversation with a broad question. What surprised you the most in conversations about AI in peer review and publishing in general at conferences or during Peer Review Week?
BC: This year, I went to more conferences than I have in the past, including EASE, ALPSP, SSP, ISMTE, and The Peer Review Congress. I think what was very stark to me, compared to last year, was how much the conversation had shifted from how AI is going to ruin publishing to “what do we need to fight?” And I think that reflects a maturing of the conversation. It also reflects people being more open to some of the benefits that AI might have, whereas two years ago, it felt very doomsday. I think now there’s sort of a mix of discussions about threats to scholarly quality that come with AI, but also potential improvements to scholarly quality or efficiency that can come with it.
DP: It’s true, we’re kind of moving from the idea of AI into actual implementation as tools emerge and people get more comfortable with them. Now is a period to try tools and determine what works and what doesn’t, carefully and thoughtfully, especially when it comes to peer review processes. That leads into my next question.
DP: What can we learn from publisher pilots in terms of distinguishing the signal from the noise in AI research integrity checks?
BC: Speaking with publishers at conferences and publishers we work with at Scholastica, I think with the change in the trajectory of AI conversations that I was talking about before, more people started considering that there are different tools now and thinking “I need a tool for this, right?”
In RFPs, you would see people saying they wanted to add an AI tool to their workflow, even if they weren’t using it yet. They were sort of anticipating, I’m going to need that soon. It’s sort of like, well, I’m building a house. I need some hammers. I need some screwdrivers, of course.
As people have done pilots, I think the outcomes are, first, that most of these solutions aren’t plug-and-play. I mean, there are some things, such as duplicate image checkers or image manipulation checkers, that are probably on the gradient closest to that. But, for many tools for scope checks, ethical checks, data analysis, and identifying LLM-generated text, there are different problems that affect fields differently. These are things that need to be calibrated or fine-tuned for your particular goals or corpus.
For many of these different tools, you’ll get feedback from them, but they don’t give a “yes” or “no” answer — they give you a score. I think a similar example familiar to most in the space is iThenticate or Similarity Check, which doesn’t sort of give you a thumbs up or thumbs down. It gives you some data that can inform your decision, which is helpful. Similarly, a lot of these AI tools give you a reading, but you have to incorporate that into your decision tree. I think many people started to see the reality of the need for calibration. Like, if I get a 67% what does my team need to do with that? Or it’s not green or red, I got a yellow. What do I do with that?
So, I think people started to see that for many of these tools, it’s not a case of just being able to add it off the shelf. You need to fine-tune it. Then you also need to decide how your workflow should change based on the feedback. I think there’s a much more sophisticated conversation about the cost-benefit of different tools. For example, what would have been missed that we’re catching with this tool now? But then, also, how much time are we spending either interpreting or incorporating those data or findings into our workflow, or checking for missing things?
DP: For sure, I really enjoyed the EASE session that you moderated on research integrity checks. Kim Eggleton was discussing piloting different tools at IOP Publishing and how their team was considering logistics, like: Where in the workflow should we use this tool? Do we have to run it on every single manuscript, or just a subset of manuscripts? How do we factor that in with budgeting? And it’s interesting because it seems, as she said, too, that there’s just lots of different considerations of how to fit the tool into the workflow, even before you get to the point of using it. And then you have to train your team on how to interpret the results of the data, because, to your point, it’s not a “yes” or “no” binary answer.
BC: In discussions at conferences the issue of cost came up too. If you have many submissions coming in, that can add up. You’re using services that use a lot of data and electricity, right?
So, when do you run the check? Do you do it at submission? Do you do it after an initial technical check? Do you do it only right before a manuscript is accepted? There are pros and cons to consider. For some people, if they’re really worried about how they’re utilizing their reviewer resources, they might want to run more checks up front. But, again, that increases the cost. Especially for smaller publishers, that might actually be cost-prohibitive. They have to consider the bang for the buck. How much are we catching without the tool? How big an issue is this research integrity check for our specific publisher? They have to decide how much of a problem there is versus, you know, the money costs, the time costs, and the workflow effects.
DP: There has been much discussion about how to identify false positive AI research integrity warnings and also the risk of silent failures. I’m curious to hear your thoughts on managing those situations. What insights have you garnered from case study examples?
BC: I think this is a plus one to the case studies people are sharing at conferences and during the Peer Review Week celebration. Now that there have been pilots, people can share results, not just talk about how we’re trying to figure this out. One example that comes to mind is from ISMTE. There was a speaker from a urology publisher talking about how they tried a tool for a particular research integrity issue. They ran a couple of hundred manuscripts through it, sort of knowing which ones had problems. And what they realized at the end was they had been really worried about false positives, or the tool flagging things incorrectly, and they tuned for that, but then they found there were more false negatives. That made them pull back and say, going back to the workflow discussion, where is this check positioned in our workflow, and how much trust are we giving it?
I appreciated the audience discussion at ISMTE. People were considering things like, if you catch five issues with a tool, but without it, you would only have caught one, what’s the improvement? Right? If your old way of doing an initial manual check would have caught 10, and the tool is also going to catch 10, do you layer it in? Or do you say it’s not worth the additional cost, and in some cases, that cost can be significant.
I think about the COPE guidelines. Human intervention can’t be supplanted by tools. I think tools that can flag things we might have missed or give additional data to indicate it might be worth doing another check can be really helpful. I think people are correctly asking, if we didn’t have these tools, what time would we put into this? If the tool can help us conduct a more thorough check that we might not have done, then it’s a win-win. However, the idea of trusting tools to replace a process is generally not panning out to be as ripe a trajectory as some people were hoping a year ago, I think.
DP: At Scholastica, we have an ear to the ground to determine what customers need as tools evolve and emerge. I’m curious, how are you thinking about AI implementations for Scholastica products based on industry learnings and customer feedback?
BC: One of our North Stars is to follow the publisher. We’ve conducted assessments on what it would take to integrate with new third-party tools from a technical standpoint, but we want to see publishers start using them first. I’ve spoken to multiple publishers that have gone through a pilot process for a tool, and at the end, the ROI was less clear. Again, this will be different for everyone. There are publishers, either in particular fields or at a particular scale, that are clearly reaping benefits from new research integrity tools. There are lots of publishers where it’s less clear. They might not have been targeted as much by paper mills, for example. Hence, those tools might not be as high a priority. We’ve had technical assessments where, at the end, the publisher ultimately decided they weren’t going to use the tool. For many categories of research integrity tools, it’s still early on. There aren’t lots of tried and true solutions yet.
Internally, we’re also working to develop solutions that offer clear time-saving benefits to editors. I think there’s a lot of opportunity to help with tasks that are not related to editorial decision-making, but take up time, like workflow steps that can be automated. And so we see a lot of promise in that and related features we’re developing.
DP: I know that one of the areas that we’ve gotten customer questions around and that many people are talking about in general is AI writing detection tools. Recently, there have been discussions about whether we’ll need to shift from focusing on detection to declarations as AI writing tools become more ubiquitous. What do you think are the primary considerations for publishers right now in that area?
BC: This is something I’ve been hearing a lot about across conferences this year. There have been many different examples that I’ll summarize as the end of AI writing detection tool pilots boiling down to, when do we care?
One example was a pilot at a publisher trying to catch AI writing very early, starting with the abstract, because, again, there’s a volume cost to running full manuscripts through tools. If you stop at the abstract, you’re avoiding downstream cost, right? That was one of the pilots that I found really interesting. They were able to fine-tune the tool to ensure low false positive and false negative rates, and they felt very confident, but then they had this issue where they said, we’re identifying AI-generated content in these abstracts, but do we care? And I think compared to a year ago, that makes sense. Now, the discussion might be, is this actually saving us time? If we have authors who are non-native English speakers publishing in English-only journals, could AI writing support save time on copyediting and other steps down the road? I think that question “when do we care?” is really important.
Another question with declarations is what to do if the authors declare one thing and we detect something else? If someone says, I didn’t use this tool and we say it looks like you did, when do we need to escalate that? There are all these cases where you need to consider the relationship with the author and the flag being raised.
I spoke to someone in an ethics department where, anytime something was detected, it went to them. They had to sort of create rules on the fly. When do we address this with the author? When do we not? That was a huge amount of internal process to develop. By the end of it, they were saying we actually don’t need a lot of these cases to come to us, because that’s just creating work. It’s slowing down peer review.
Again, because all this is changing so rapidly, I think that’s a healthy place to be. I think churn in our processes right now is healthy. We’re iterating and improving. Especially for early adopters — they’ll be the ones figuring this out, and then others will be able to sort of follow those models.
DP: As a final question, I’m curious if there’s anything about AI implementation you’ve been reading, watching, or absorbing in any format that you’d like to share?
BC: I write code for a small part of my time at Scholastica, and I’m using AI tools around that. It’s really interesting for things like writing tests. It can make you way more productive. I think it helps most when you’re at a manager level with a task where you can evaluate the LLM output versus if you’re asking it to do something you don’t know anything about. Where I can evaluate the code, I can evaluate the testing, and that’s really interesting to me.
I would say, think outside your peer review process. Think about back office operations. I’m trying to read about emerging tools and use cases that aren’t necessarily specific to journals, but deal with automating tasks that are not interesting to humans. I’ll give the example of copying things from a PDF into a new table for reporting. You might be able to have LLMs help with that and even check the work. You might say, given this output, compare it against these five PDFs, to see if it looks right.
I think we’re in a zone where a lot of LLMs are very good for something that takes you time and you’re an expert at, and I think part of the trick right now is to identify those tasks in your day-to-day. I’m good at this, but it takes me time. I could have my LLM do it for me, and I can evaluate its output. So that’s a lot of what I’m reading and thinking about.
DP: I like the idea of thinking of your sort of AI assistant in those possible situations.
Well, thank you for taking the time for this chat! We’re excited to share it, and we look forward to other people’s continued insights as well. This conversation is certainly going to continue for a very long time.
BC: Thanks, I really enjoyed it.
Have a question related to AI pilots in scholarly publishing or additional examples to share? Let us know in the blog comments or on social media. You can find Scholastica on LinkedIn and Bluesky.