Where’s My Link?
A Perspective on Why Linking is Not Perfect
Diana Bittern
Director, Software Product Management
Ovid Technologies
If an institution has purchased the rights to content,
the institutional users should have access to that content
from any application, regardless of where that content resides.
Introduction
Institutions continue to invest significant resources to purchase
electronic full text for their patrons. Then, additional investments
and resources are required to purchase and implement link
resolution systems that promise to connect end-user students
and researchers from a bibliographic citation to the full
text article in the cited electronic journal. In essence,
much time and money is spent to fulfill the concept of “one-click
access” to full text.
It is hardly surprising to hear the battle
cries of librarians when their users innocently ask, “Where’s
my link?” to the full text that they, the subscribing
users, expect to take them to the article online. As information
providers, publishers and aggregators struggle with the task
of bringing order out of chaos and giving users reliable resource
integration through linking, this paper attempts to illustrate
some of the realities of linking and the steps being taken
to overcome the myriad anomalies in matching bibliographic
citations to the full text.
To help clear the muddy water around the
linking issues and the solutions which have at least been
proposed include:
- Metadata (and their Information Providers)
- Proliferation of ISSNs
- Embargoed titles
- CrossRef, the Silver Bullet
The Metadata [non]Standard
The creation of a link relies on several
key ingredients. Linking is about matching metadata, or the
data that defines an article. OpenURL attempts to standardize
matching by defining required metadata elements and an originating
resource, so that the necessary information will be available
to make a match when presented to a target destination.
The most common metadata matching information
for article level linking is IVIP, or ISSN,
Volume, Issue, Page. A URL “template” establishes
a placeholder for this information, extracted from a bibliographic
database citation, and is sent to the ‘target’
where – presuming a match can be made –a full
text experience results. Some URL templates include additional
syntax such as journal title, author(s), and article title.
Using the IVIP model is preferred because
it reduces the possibilities of variations that result in
mismatches. Title and Author metadata can be troublesome,
because exact matching is marred by the existence (or lack)
of punctuation, articles, name conventions author’s
last name; author first initial, honorific, etc, which often
differ among information providers.
Yet, even the most ‘standard’
metadata can produce problems. For example, take the ISSUE
metadata element. If we presume that the starting point for
researchers is the bibliographic database?acting as an online
card catalog?figure that there are scores of information providers
who deliver bibliographic databases, it is not hard to imagine
differences emerging. Out of 10 bibliographic database providers,
there could be 10 different ways of representing a supplement,
a combined issue, or an issue part.
Some databases combine source information
into a single field, such as 2001(6)2;112: publication year
2001; Volume 6, Issue 2, Page 112. It is possible to extract
required metadata elements from the source field, provided
that there are no variations. Enter a twist, such as combined
issues 2 and 3, or page S33, and the extraction formula needs
to be augmented with filters or Perl expressions, rules for
handling anomalies (assuming that all the anomalies can be
identified).
Multiply the number of designations for
title and author information, by the variations for extracting
source information, across all databases used in your institution.
The development and maintenance of linking formulae becomes
a daunting task.
The International Standard Serial Number (ISSN)
Speaking of standards, the journal identifier
or ISSN, is THE identifier for a serial, right? Not quite.
Although imperfect, the ISSN has long defined a journal in
the traditional sense, and in theory, a single ISSN identified
a single print version. In the pre-digital world, ISSN’s
were somewhat unreliable, since an ISSN could arbitrarily
change (or not) with minor title changes, publisher changes,
etc.
In the digital world, ISSN’s are even less reliable,
since suddenly, they are being registered for the electronic
version of a journal, which may contain different content,
often break the IVIP rule – who needs Page numbers in
an electronic version? -- and can precede the arrival of the
print version of the publication, making it a more attractive
target for online linking services.
Linking software relies on ISSN as a match element, but it
no longer works as reliably as it once did. End users first
encountered this problem a few years back when NLM, with no
warning, began using electronic ISSN (EISSN) as the indexing
standard for its Medline database. Linking products use a
‘knowledgebase’ to identify journals and coverage
information. In other words, in a bibliographic citation,
we check against a user profile of ISSN’s and years
of coverage to determine whether to show a Full Text link.
Ovid’s OpenLinks product, for example, successfully
used the Print ISSN as the basis of its knowledgebase, to
link across its 100 bibliographic databases. That is, until
Medline’s change broke this link model.
The rules for assigning ISSN’s to different versions
of a journal (print, CD-ROM, electronic), as well as rules
for naming journal subsets are currently under review as part
of the ISO (International Organization of Standardization)
3297, the official number for the ISSN standard. NISO’s
directive is to answer the requirements of publishers who
need to delineate different media of their titles, while at
the same time answering the technical uses of ISSN to provide
links and coverage information in software solutions.
For those who confront A&I database linking, reference
linking, and OpenURL, this challenge is already well known.
I recently participated in a survey from the ISSN Review Committee,
and made the point that for many, the key use of ISSN is for
linking from a bibliographic database or cited reference to
the full text of the journal content.
Once the standard-makers have spoken, the resulting mandate
must be adopted by the publishers and information providers.
Where changes require that bibliographic databases be reloaded,
it will take time and pressure from the user community to
enforce the revised standard.
Meanwhile, the pressure from end users for high quality link
reliability has pushed technology leaders to come up with
their own answers to the problem. At Ovid, the problem is
addressed through a Pub-ID database that stores ISSN and name
elements, and provides what we refer to as ISSN-Mapping across
resources. All ISSN numbers are linked in a relational database,
by name (fuzzy matching logic at work here) and number. If
a journal changes publishers, its name may change and its
ISSN definitely changes. The database stores ISSN’s
that refer to the same journal name (Print/Electronic) and
maps older versions of the journal.
So, whether the MEDLINE database indexes EISSN, whether the
customer’s journal import list includes some ‘old’
ISSN’s, or whether a journal changes hands, the A&I
database user will be presented with a valid link and will
reach the Holy Grail: the full text.
While our efforts are duly noted, overall, relying on individual
publishers and aggregators to come up with sophisticated,
technology work-arounds to make linking “perfect”,
ultimately does not best serve the end-user.
Embargoes
The world of free journal linking brings
enthusiastic response from customers, until they have to deal
with the question of embargoes. What can be done to address
rolling coverage or that 6-month subscriber embargo or a journal
that offers open access to its archives after 12 months?
Journal embargoes are of varying durations,
ranging from 6 months, to 9 months, and even 24 months, and
can be applied by aggregators, where access to the content
is delayed by a period of 3+ months; or by publishers who
make access to certain journals free after a period of 12+
months.
In order for linking applications to process
a request that accounts for an embargo, it must have reliable
information about the MONTH of the publication. An OpenURL
can be crafted to include a Month field. However, the majority
of desirable resources - bibliographic databases and cited
references – do not have reliable Month metadata.
Many Information Providers do not use Month at all, but use
Publication Year, or Entry Week. Of Ovid’s top 10 bibliographic
databases, only MEDLINE has a field called EP (Electronic
Date of Publication) from which one could reliably parse Month
information. So, the existence of this information benefits
a very small number of databases – and virtually no
cited references.
There are scenarios to address this issue;
whereby librarians can select the publication year to determine
whether to show a link for an embargoed title. However, neither
is ideal, since one results in dead-end links, and the other
excludes otherwise valid links. To illustrate, let’s
use a title called the Journal of Parapsychology that has
a 3 month embargo. The options are:
- Define coverage by rounding Backwards
to the Previous Year: This journal would be missing
links for up to 9 months, reflecting the most recent issues.
If the coverage is set to 2002, and it is now December 2003,
the result is that a user would not see links for citations
from January through September of 2003 which are
valid, under terms of the embargo.
- Make coverage reflect the Current
Year: This journal with a 3 month embargo will show
‘dead links’ for the embargo period.
Ovid is addressing the embargo issue programmatically
by providing two types of embargo coverage: subscriber and
free.
In Ovid’s knowledgebase, journal identifiers
will now include embargo information, where relevant. For
a specific supplier, whose terms mandate them, subscriber
embargos can be defined as N months, and links will appear
only if the embargo period is honored. For journals that become
“free-after-N-months,” that information will be
stored in the knowledgebase and activated for all users automatically
through the linking application.
CROSSREF
CrossRef is a technical and marketing phenomenon.
In its short lifetime, it has gained tremendous brand awareness
with a following of over 300 member publishers, and has expanded
from its original theme of reference linking among member
publishers to a much broader one of improving overall access
to scholarly information on the internet.
CrossRef’s expanded mission now includes
forays into areas such as cross-search and forward linking
(links to articles that were cited by a viewed article). But
as with any large and fast-growing enterprise, there have
been many challenges along the way, and there is plenty of
work to solidify the original goal of CrossRef linking: providing
persistent [link] identifiers to the full text.
Member publishers deposit DOI’s or Digital Object Identifiers
to CrossRef’s metadata database for the sole purpose
of facilitating linking. The original purpose of the CrossRef
initiative was to promote cited-reference linking among participating
e-publishers. Member publishers agreed to submit ‘persistent
identifiers” (DOI numbers) with relevant metadata (such
as ISSN, volume, issue, page, and other information such as
author(s), journal title, article title, etc) and a URL to
the article’s location.
A publisher, who is preparing to publish an article electronically,
can query the CrossRef metadata database with reference metadata,
and where there is a successful match, embed full text links
to references. The reader clicks on a reference link to be
transported to the full text of that cited article at the
publisher’s web site.
CrossRef members are honor-bound to deposit DOI’s for
all of their electronically published content, both current
and archival. Most have done a thorough job of depositing,
but the database is far from complete. CrossRef is self-policing
and recognizes the varying technical capabilities and resource
constraints among its members. However, a missing CrossRef
DOI is a missed full text link, plain and simple.
Metadata requests to the CrossRef database assume the presence
of certain elements in order to create and resolve links.
We return to the familiar IVIP model where volume, issue and
page information, combined with the journal name, author(s)
and article title are used, in some combination, to create
link queries. The absence of one or more ‘required’
metadata elements causes a link not to appear.
Wolters Kluwer is one of several large publishers that locally
hosts the CrossRef Database. Recently, Ovid (part of the Wolters
Kluwer Health Division) undertook an exhaustive analysis to
compare our respective link repositories, and uncovered data
that was not present in Ovid’s local CrossRef database.
CrossRef and Ovid have since been working closely to implement
validation processes to ensure that locally hosted CrossRef
databases deliver the same high quality link information as
CrossRef’s own metadata database.
In the course of our investigations, we
also identified a significant number of DOI’s that were
deposited by publishers into CrossRef’s database with
insufficient data to allow metadata matching under the most
common link rules. In some cases, these DOI’s were deposited
in advance of print, so that Issue or Page information was
missing. In other cases, the DOI is an article from an Electronic-Only
journal and has no Issue or Page information.
CrossRef responded by notifying the member
publishers about missing metadata so that it can be re-deposited
to allow successful matching. In addition, vendors are exploring
alternative metadata matching schemes, using different combinations
of ‘required’ metadata and new algorithms to extend
the reach of link reliability.
CrossRef has been working overtime to beef up its infrastructure,
to reduce the time it takes to process DOI deposits, and to
expand its own capabilities to provide error reporting and
statistical information to its membership. While CrossRef
is not perfect in its implementation, it is still a valuable
resource and boon to the information industry.
Wrap Up
So what have we learned? Today’s linking systems continue
to deliver increasingly reliable link performance. Over time,
this painstakingly slow work of creating order out of linking
chaos, will improve as the awareness of all the details become
apparent to the parties responsible for creating and maintaining
the primary resources and delivery systems.
The issues covered in this review
are not unique to a single application vendor; they are common
to all, and require constant vigilance, analysis and innovation
in order to deliver reliable resource integration. Success
will be achieved when end-users proclaim, “Here’s
my link!”
|