Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Untitled Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries Anderson A. Ferreira Adriano Veloso Marcos André Gonçalves Alberto H. F. Laender Departamento de Ciência da...

1 answer below »

Untitled
Effective Self-Training Author Name Disambiguation in
Scholarly Digital Li
aries
Anderson A. Fe
eira Adriano Veloso Marcos André Gonçalves Alberto H. F. Laende
Departamento de Ciência da Computação
Universidade Federal de Minas Gerais
XXXXXXXXXXBelo Horizonte, Brazil
{fe
eira, adrianov, mgoncalv, laender}@dcc.ufmg.
ABSTRACT
Name ambiguity in the context of bibliographic citation reco-
ds is a hard problem that affects the quality of services and
content in digital li
aries and similar systems. Supervised
methods that exploit training examples in order to distin-
guish ambiguous author names are among the most effective
solutions for the problem, but they require skilled human
annotators in a laborious and continuous process of manu-
ally labeling citations in order to provide enough training
examples. Thus, addressing the issues of (i) automatic ac-
quisition of examples and (ii) highly effective disambiguation
even when only few examples are available, are the need of
the hour for such systems. In this paper, we propose a novel
two-step disambiguation method, SAND (Self-training As-
sociative Name Disambiguator), that deals with these two
issues. The first step eliminates the need of any manual
labeling effort by automatically acquiring examples using a
clustering method that groups citation records based on the
similarity among coauthor names. The second step uses a
supervised disambiguation method that is able to detect un-
seen authors not included in any of the given training exam-
ples. Experiments conducted with standard public collec-
tions, using the minimum set of attributes present in a cita-
tion (i.e., author names, work title and publication venue),
demonstrated that our proposed method outperforms rep-
esentative unsupervised disambiguation methods that ex-
ploit similarities between citation records and is as effective
as, and in some cases superior to, supervised ones, without
manually labeling any training example.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information
Retrieval; I.5.2 [Pattern Recognition]: Classifier design
and evaluation
General Terms
Algorithms, Experimentation
Permission to make digital or hard copies of all or part of this work fo
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
ear this notice and the full citation on the first page. To copy otherwise, to
epublish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
JCDL’10, June 21–25, 2010, Gold Coast, Queensland, Australia.
Copyright 2010 ACM XXXXXXXXXX/10/06 ...$10.00.
Keywords
Name Disambiguation, Bibliographic Citations
1. INTRODUCTION
Several scholarly digital li
aries (DLs), such as DBLP1,
CiteSeer2, MEDLINE3 and BDBComp4, provide features
and services that facilitate literature research and discovery
as well as other types of functionality. Such systems may list
millions of bibliographic citation records (here understood as
a set of bibliographic features such as author and coautho
names, work title and publication venue title, of a partic-
ular publication) and have become an important source of
information for academic communities since they allow the
search and discovery of relevant publications in a centralized
way. Also, studies about the DL content can lead to inter-
esting results such as coverage of topics, tendencies, quality
and impact of publications of a specific sub-community o
individuals, patterns of collaboration in social networks, etc.
These types of analysis and information, which are used, fo
example, by funding agencies for decisions on grants and fo
individual’s promotions, presuppose high quality content [20,
22].
Citation management within DLs involves a number of
tasks. One in particular, author name disambiguation, has
equired a lot of attention from the DL research community
due to its inherent difficulty. Specifically, name ambiguity is
a problem which occurs when a set of citation records con-
tains ambiguous author names (the same author may appea
under distinct names, or distinct authors may have simila
names). This problem may be caused by a number of rea-
sons, including the lack of standards and common practices,
and the decentralized generation of content (e.g., by means
of automatic harvesting).
The name disambiguation task may be formulated as fol-
lows. Let C = {c1, c2, ..., ck} be a set of citation records.
For each author in a citation record ci, an authorship record
i is created to represent his/her participation in that cita-
tion. The objective is to produce a disambiguation function
which is used to partition the set of authorship records into
n sets {a1, a2, . . . , an}, so that each partition ai contains (all
and ideally only all) the authorship records in which the ith
author appears.
1http:
dblp.uni-trier.de
2http:
citeseer.ist.psu.edu
3http:
medline.cos.com
4http:
www.lbd.dcc.ufmg.
dbcomp
39
To disambiguate the bibliographic citations of a digital li-
ary, first we may split the set of authorship records into
groups of ambiguous authors, called ambiguous groups (i.e.,
groups of citations having authors with similar names). The
ambiguous groups may be obtained, for instance, by using
a blocking method [27]. Blocking methods address scala-
ility issues, avoiding the need for comparisons among all
authorship records.
The challenges of dealing with name ambiguity in biblio-
graphic DLs have led to a myriad of disambiguation meth-
ods [4, 5, 8, 9, 14, 15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 33, 34,
35, 41]. However, despite the fact that most of these meth-
ods were demonstrated to be relatively effective (in terms
of e
or rate or similar metrics), none of them provides a
perfect and final solution for the problem (i.e., they pro-
duce e
ors). Existing disambiguation methods usually fol-
low either an unsupervised or a supervised approach. In the
former case, the methods exploit similarities between au-
thorship records in order to place in the same group those
ecords that belong to the same author. In the latter case,
the methods exploit a set of training examples, from which
a disambiguation function is derived and then used to place
authorship records in the co
esponding group.
Supervised methods are usually the most effective ones fo
name disambiguation. In more details, we are given as input
a set of authorship records called the training data (denoted
as D) that consists of examples or, more specifically, records
for which the co
ect authorship is known. Each example
is composed of a set F of m features {f1, f2, . . . , fm} along
with a special variable called the author. This author vari-
able draws its value from a discrete set of labels {a1, a2, . . . ,
an}, where each label uniquely identifies an author. The
training examples are used to produce a disambiguation
function (i.e., the disambiguator) that relates the features
in the training examples to the co
ect author. The test
set (denoted as T ) for the disambiguation task consists of a
set of authorship records for which the features are known
while the co
ect author is unknown. The disambiguator,
which is a function from {f1, f2, . . . , fm} to {a1, a2, . . . , an},
is used to predict the co
ect author for the records in the
test set. In this context, the disambiguator essentially di-
vides the records in T into n sets {a1, a2, . . . , an}, where ai
contains (ideally all and only all) the authorship records in
which the ith author is included. Alternatively, the disam-
iguator may take as input a pair of authorship records and
outputs a binary decision whether these records belong to
same author.
Although successful cases of the application of supervised
methods have been reported [9, 14, 17, 36, 35, 37], the ac-
quisition of training examples requires skilled human anno-
tators to manually label authorship records. DLs are very
dynamic systems, thus manual labeling of large volumes of
examples is unfeasible. Further, the disambiguation task
presents nuances that impose the need for methods with
specific abilities. For instance, since it is not reasonable to
assume that all possible authors are included in the train-
ing data, disambiguation methods must be able to detect
unseen authors, for whom no label was previously assigned
(i.e., there is no authorship records for these authors in the
training data).
Unsupervised methods, on the other hand, require no
manual labeling effort, since they simply group authorship
ecords into clusters by maximizing intra-cluster similar-
ity while minimizing inter-cluster similarity. Obviously, the
choice of a proper similarity measure is of paramount impor-
tance, and a natural choice is to employ similarity measures
ased on highly discriminative features, such as coautho
names. In this case, the resulting clusters are very likely to
e pure, in the sense that each cluster is likely to contain
only authorship records of the same author. The drawn-
ack, however, is that some authors are likely to have thei
authorship records fragmented into several (pure) clusters,
compromising the effectiveness of unsupervised methods.
In this paper, we propose a hy
id disambiguation method,
which will hereafter be refe
ed to as SAND (standing fo
Self-training Associative Name Disambiguator). SAND ex-
ploits the strengths of both unsupervised and supervised
methods. Specifically, it works in two steps. In the unsu-
pervised step, recu
ing patterns in the coauthorship graph
are exploited in order to produce pure clusters of author-
ship records. Then, in the supervised step, a subset of
the extracted clusters is provided as training, from which
a disambiguation function is derived. The final result is a
highly effective and extremely practical disambiguator, as
will be shown in a set of experiments using citation records
extracted from the DBLP and BDBComp collections. The
esults show that SAND outperforms unsupervised meth-
ods in more than 27% on the DBLP collection and 4% on
the BDBComp collection. Improvements when compared
against supervised methods are also reported.
The rest of this paper is organized as follows. In Section
2 we discuss related work. In Section 3 we describe the
proposed hy
id method, SAND. In Section 4 we present
the evaluation of SAND and compare its effectiveness with
the effectiveness provided by other representative methods.
Finally, in Section 5 we conclude paper.
2. RELATED WORK
The name disambiguation methods proposed in the liter-
ature adopt a wide spectrum of solutions [32] that include
approaches based on manual assignment by li
arians [31],
collaborative efforts5, unsupervised techniques [4, 5, 8, 15,
16, 18, 19, 23, 25, 26, 28, 33, 34, 41] and supervised tech-
niques [9, 14, 17, 36, 35, 37].
The unsupervised methods, i.e.,
Answered 2 days After Jul 09, 2022

Solution

Priyang Shaileshbhai answered on Jul 11 2022
80 Votes
LITERATURE REVIEWS
WRITEUP LIBRARY USER’S TUTORIAL SYSTEM
In their study, Vlasenko and Ivanova found that the Internet and its information channels are more than just a repository for knowledge; they are also popular modes of communication that can assist interlocutors in gaining beneficial experience. Students increasingly use digital technology to virtualize their education and work, which provides various advantages such as easy access, faster information transmission, and more opportunities for learning.    
This kind of practical training deprives education of the procedures and work principles typical of the professional environment, including those that building facilities might require.
Although the benefits of multimedia for language teaching are undeniable, there are many disadvantages as well. Many of these are related to issues rooted in language learning or the educational process itself. For example, it has been shown that students often misunderstand and misinterpret various social and production processes when performing them virtually rather than in person.
However, in today's world, where virtual conversation has become the norm, many people do not engage in discussion and end up misunderstanding each other. This is why the increasing use of virtual means of communication is ineffective — in-person communication produces more effective results in terms of professional competencies and their formation. Virtualization of the educational...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here