Image Credit: Artem Sapegin on Unsplash
Image Credit: Artem Sapegin on Unsplash

In the expanding online scholarly communication ecosystem, humans are no longer the only consumers of journal articles. Today, there are various machine “readers” — from aggregators to institutional repositories to search engines. Machines, of course, cannot “read” human text. They process content via the information made available to them in computer markup languages, relying heavily on machine-readable metadata.

Consequently, the potential for a piece of content to be recognized, organized, and distributed by and between different online systems to reach the screens of human readers depends in large part on the quality of its metadata. It’s up to publishers to think through all of the potential consumers of their content, and the information those different systems and the people using them will need to find that content and process it effectively based on their goals.

Mapping out all of the potential end-users of content and producing necessary machine-readable metadata for them can be a daunting endeavor for many publishers. Even approaching a clear definition of metadata can be tricky. As noted by Jabin White, Vice President of Content Management for JSTOR and Portico, in a Metadata 2020 guest article, while metadata is often described in academic publishing, accurately but rather opaquely, as “data about data,” it is in truth much more. White explains that “at its best and most effective, metadata is a description of content and an expression of the goals that someone has for how that content can be used.”

We caught up with White to learn more about the many roles of metadata in the scholarly publishing ecosystem, and what publishers need to know to produce more effective machine-readable metadata. In the interview below, he shares his thoughts on how metadata quality can be improved and how publishers can move from basic metadata concepts to creating enhanced metadata.

Q&A with Jabin White

In a recent report surveying executives from 25 leading publishers, metadata was rated as the top element “essential to our business” but the second lowest in terms of capability. Why do you think that is?

JW: I have two thoughts about that. First, people love saying the word metadata, but they don’t necessarily love doing what it takes to create quality metadata. I also have this notion that metadata is a selfish thing. In other words, you have metadata for your content set that is great for what you’re trying to do, and I have metadata for my content set that is great for what I’m trying to do. The problem comes when I say, “I have some great metadata to send you, here you go.” And then you look at it and say, “this doesn’t meet my needs at all!”

We can’t agree on what the definition of good metadata is because our needs are selfish. I need different things from metadata than you do. The trick is trying to find common ground so that, when I send you my metadata, it has enough information to explain the value of my content to you. We haven’t quite reached that point yet.

Where do you think the publishing community stands in terms of establishing metadata norms to create a better shared understanding of metadata?

JW: I think we’re doing alright, but we’re still kind of all in silos. That’s why I’m so enthusiastic about Metadata 2020. They are very clear — all they want to do is start conversations about metadata. Those conversations will lead to people exchanging metadata with more regard for what others are going to do with it. Five or ten years ago, none of that was happening. We fell into the trap of saying, “my metadata is fantastic for me, I don’t know what your problem is.” The only way to get out of that mindset is by talking to each other.

When we’re looking at author affiliations and other common things, we do pretty well. But everybody still has their own stuff that they need to do with content and different metadata needs for that. I’m optimistic that we can talk and make the collective metadata experience better. But I’m not super optimistic that we’re going to reach one metadata standard to rule them all. I don’t think that’s practical, nor do I think it’s achievable.

What are key concepts publishers should know to produce better metadata? And to what extent is knowledge of the technical coding aspect of metadata necessary?

JW: I don’t think you have to have a technical background to make a good metadata plan. Let’s try doing a thought experiment. Say I’m a book publisher, and I’ve got a book with 20 chapters on the Civil War. Who is going to be interested in that? Let’s start with libraries, so they can have it in their digital equivalent of a card catalog, which OCLC sort of feeds into. Then there are search engines, sales systems, readers, and so on. You have to think about all of the different people and systems that are going to need to know what the book is about.

So I’ve got 20 chapters in this book. One chapter is an introduction to how we got to this place, and the other 19 are about battles. I don’t think we do this a lot yet, because it’s a new concept to most publishing types, but every one of those chapters can and probably should exist on its own. The same would go for journal articles. We never used to have to think about that because we had these convenient things called “books” and “journals” that were bound and never changed. Online, that world is over. When somebody comes along and says, “nice book you’ve got there, I’m going to break it up into the 20 chapters and send them all off in different directions,” unless you have good metadata, it’s going to be an unholy mess. So you need to think, if this chapter is standing alone and it’s on the Battle of Vicksburg, how am I going to let the user know that it’s a Civil War battle that took place in Mississippi and all of the other pieces of metadata they need?

This is all being technically minded, to be fair, but you don’t need to know how to code to do that. If you start to think about the different ways your content is going to be consumed and used and sold and put into other systems, that is all you need to do to think about metadata. Then the question becomes, “now that I’ve got all of these use cases, what are the pieces of information about the individual chapters and the book as a whole that these people and systems are going to need to do their thing?” At the end of the day, if I’m a publisher or I’m an author, all I care about is people using my content. So when it comes to metadata, I should be doing whatever I need to do to tag this chapter, or this book, or this journal article to be everything to every system or person that wants to use it.

In your Metadata 2020 article, you say publishers need to be careful about limiting metadata uses. How might publishers limit metadata uses, and how can they avoid doing this?

JW: Sticking with the Civil War book example, let’s say I have a chapter about the first time that they used a hot air balloon to scout the enemy’s position. That article is about the Civil War, but it’s also about ballooning, right? If all I did was tag everything for the Civil War, and then somebody came along and did a search for the history of ballooning and my chapter didn’t show up, without knowing it I would have limited the effectiveness of my work. That’s just one example. If I thought for 10 minutes, I bet I could come up with 30 different uses for that same chapter.

I’m not saying that you have to think about all of this at the time of publication. But you don’t want to put a hard stop on your metadata. Ideally, you want to be able to have floating metadata that can be updated. So we have to build systems that are flexible enough to account for the people and use cases that come about after the time of publication. And that can bend people’s minds.

When people use the terms “rich” or “enhanced” metadata what do they mean? Or what do you think they should mean?

JW: This is my opinion — and I’m not the metadata president, nor am I running! I want to preface with that. Personally, I don’t like the term “rich metadata.” Going back to my original point, my idea of “rich metadata” and your idea of “rich metadata” are likely two different things.

Enhanced metadata, on the other hand, is getting at an important difference between descriptive metadata and semantic metadata. So first we have descriptive metadata, which some people like to refer to as bibliographic metadata. This includes things like the title, the publication date, the publisher, and so on. Then there is enhanced metadata, which, in my opinion, is when you introduce more semantic metadata to start to explain what the content is about and help systems understand how it relates to other content. When someone searches for a topic that your book chapter or journal article covers, enhanced metadata can increase the chances that a search engine is going to pick up the keys in your metadata to say this is the best article to serve up. Showing up higher in search results can happen for a number of factors. It can be because of semantic metadata. Or it can be because of usage data, where 10 million people searched the web for that same term and chose the same article, so the search engine deduces that’s the best one. Google does all of this really well. All of a sudden that chapter or that article gets embedded with enhanced metadata that says, “send this up the line when someone does a search for this thing.” That’s the way it should work.

The next level is to be able to say, “you did a search on the Civil War and you found an article on the Battle of Vicksburg. I know semantically, because I embedded the knowledge in my metadata, that Grant was the Union general in the battle of Vicksburg. Did you also know that we have these articles over here on Grant?” So I can actually discern from a search engine or a browsing experience that if you’re interested in the Battle of Vicksburg you may be interested in Grant. And I can do all kinds of things with that metadata to enrich the experience of the user.

What is the relationship between full-text XML and semantic metadata?

JW: Full-text XML doesn’t necessarily lead to semantic metadata — you still have to put work in for that — but it is a leg up in getting started. When you have full-text XML, you just knocked it out of the park with structural metadata. At the end of the day, it’s still structural metadata. All you’re doing is describing the content. But, by having full-text XML, you just made your life a lot easier if you ever want to start creating semantic metadata.

How do you think small publishers can best approach metadata with limited resources?

JW: A lot of smaller publishers don’t have the resources right now to get into enhanced metadata. That is one of the benefits of joining an aggregator like JSTOR. We can do certain things at a multiple-title level that are not feasible for some smaller publishers.

But even small publishers can have a controlled metadata vocabulary. The challenge, of course, becomes when you try to share that with outsiders. Then you have to do this really scary thing called “cross walking,” where you’re going from your thesaurus to theirs and things might not match up. That’s the biggest pain. The only thing that would prevent a small publisher from doing that is scale. You have to be able to put work into building and maintaining your metadata vocabulary. But it is possible, and you can decide the extent of that vocabulary as a publisher and make it as good as you can.