UntitledEffective Self-Training Author Name Disambiguation inScholarly Digital LiariesAnderson A....

Question

UntitledEffective Self-Training Author Name Disambiguation inScholarly Digital LiariesAnderson A. Feeira Adriano Veloso Marcos André Gonçalves Alberto H. F. LaendeDepartamento de Ciência da ComputaçãoUniversidade Federal de Minas Gerais XXXXXXXXXXBelo Horizonte, Brazil{feeira, adrianov, mgoncalv, laender}@dcc.ufmg.ABSTRACTName ambiguity in the context of bibliographic citation reco-ds is a hard problem that affects the quality of services andcontent in digital liaries and similar systems. Supervisedmethods that exploit training examples in order to distin-guish ambiguous author names are among the most effectivesolutions for the problem, but they require skilled humanannotators in a laborious and continuous process of manu-ally labeling citations in order to provide enough trainingexamples. Thus, addressing the issues of (i) automatic ac-quisition of examples and (ii) highly effective disambiguationeven when only few examples are available, are the need ofthe hour for such systems. In this paper, we propose a noveltwo-step disambiguation method, SAND (Self-training As-sociative Name Disambiguator), that deals with these twoissues. The first step eliminates the need of any manuallabeling effort by automatically acquiring examples using aclustering method that groups citation records based on thesimilarity among coauthor names. The second step uses asupervised disambiguation method that is able to detect un-seen authors not included in any of the given training exam-ples. Experiments conducted with standard public collec-tions, using the minimum set of attributes present in a cita-tion (i.e., author names, work title and publication venue),demonstrated that our proposed method outperforms rep-esentative unsupervised disambiguation methods that ex-ploit similarities between citation records and is as effectiveas, and in some cases superior to, supervised ones, withoutmanually labeling any training example.Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: InformationRetrieval; I.5.2 [Pattern Recognition]: Classifier designand evaluationGeneral TermsAlgorithms, ExperimentationPermission to make digital or hard copies of all or part of this work fopersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesear this notice and the full citation on the first page. To copy otherwise, toepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.JCDL’10, June 21–25, 2010, Gold Coast, Queensland, Australia.Copyright 2010 ACM XXXXXXXXXX/10/06 ...$10.00.KeywordsName Disambiguation, Bibliographic Citations1. INTRODUCTIONSeveral scholarly digital liaries (DLs), such as DBLP1,CiteSeer2, MEDLINE3 and BDBComp4, provide featuresand services that facilitate literature research and discoveryas well as other types of functionality. Such systems may listmillions of bibliographic citation records (here understood asa set of bibliographic features such as author and coauthonames, work title and publication venue title, of a partic-ular publication) and have become an important source ofinformation for academic communities since they allow thesearch and discovery of relevant publications in a centralizedway. Also, studies about the DL content can lead to inter-esting results such as coverage of topics, tendencies, qualityand impact of publications of a specific sub-community oindividuals, patterns of collaboration in social networks, etc.These types of analysis and information, which are used, foexample, by funding agencies for decisions on grants and foindividual’s promotions, presuppose high quality content [20,22].Citation management within DLs involves a number oftasks. One in particular, author name disambiguation, hasequired a lot of attention from the DL research communitydue to its inherent difficulty. Specifically, name ambiguity isa problem which occurs when a set of citation records con-tains ambiguous author names (the same author may appeaunder distinct names, or distinct authors may have similanames). This problem may be caused by a number of rea-sons, including the lack of standards and common practices,and the decentralized generation of content (e.g., by meansof automatic harvesting).The name disambiguation task may be formulated as fol-lows. Let C = {c1, c2, ..., ck} be a set of citation records.For each author in a citation record ci, an authorship recordi is created to represent his/her participation in that cita-tion. The objective is to produce a disambiguation functionwhich is used to partition the set of authorship records inton sets {a1, a2, . . . , an}, so that each partition ai contains (alland ideally only all) the authorship records in which the ithauthor appears.1http:dblp.uni-trier.de2http:citeseer.ist.psu.edu3http:medline.cos.com4http:www.lbd.dcc.ufmg.dbcomp39To disambiguate the bibliographic citations of a digital li-ary, first we may split the set of authorship records intogroups of ambiguous authors, called ambiguous groups (i.e.,groups of citations having authors with similar names). Theambiguous groups may be obtained, for instance, by usinga blocking method [27]. Blocking methods address scala-ility issues, avoiding the need for comparisons among allauthorship records.The challenges of dealing with name ambiguity in biblio-graphic DLs have led to a myriad of disambiguation meth-ods [4, 5, 8, 9, 14, 15, 16, 17, 18, 19, 23, 25, 26, 27, 28, 33, 34,35, 41]. However, despite the fact that most of these meth-ods were demonstrated to be relatively effective (in termsof eor rate or similar metrics), none of them provides aperfect and final solution for the problem (i.e., they pro-duce eors). Existing disambiguation methods usually fol-low either an unsupervised or a supervised approach. In theformer case, the methods exploit similarities between au-thorship records in order to place in the same group thoseecords that belong to the same author. In the latter case,the methods exploit a set of training examples, from whicha disambiguation function is derived and then used to placeauthorship records in the coesponding group.Supervised methods are usually the most effective ones foname disambiguation. In more details, we are given as inputa set of authorship records called the training data (denotedas D) that consists of examples or, more specifically, recordsfor which the coect authorship is known. Each exampleis composed of a set F of m features {f1, f2, . . . , fm} alongwith a special variable called the author. This author vari-able draws its value from a discrete set of labels {a1, a2, . . . ,an}, where each label uniquely identifies an author. Thetraining examples are used to produce a disambiguationfunction (i.e., the disambiguator) that relates the featuresin the training examples to the coect author. The testset (denoted as T ) for the disambiguation task consists of aset of authorship records for which the features are knownwhile the coect author is unknown. The disambiguator,which is a function from {f1, f2, . . . , fm} to {a1, a2, . . . , an},is used to predict the coect author for the records in thetest set. In this context, the disambiguator essentially di-vides the records in T into n sets {a1, a2, . . . , an}, where aicontains (ideally all and only all) the authorship records inwhich the ith author is included. Alternatively, the disam-iguator may take as input a pair of authorship records andoutputs a binary decision whether these records belong tosame author.Although successful cases of the application of supervisedmethods have been reported [9, 14, 17, 36, 35, 37], the ac-quisition of training examples requires skilled human anno-tators to manually label authorship records. DLs are verydynamic systems, thus manual labeling of large volumes ofexamples is unfeasible. Further, the disambiguation taskpresents nuances that impose the need for methods withspecific abilities. For instance, since it is not reasonable toassume that all possible authors are included in the train-ing data, disambiguation methods must be able to detectunseen authors, for whom no label was previously assigned(i.e., there is no authorship records for these authors in thetraining data).Unsupervised methods, on the other hand, require nomanual labeling effort, since they simply group authorshipecords into clusters by maximizing intra-cluster similar-ity while minimizing inter-cluster similarity. Obviously, thechoice of a proper similarity measure is of paramount impor-tance, and a natural choice is to employ similarity measuresased on highly discriminative features, such as coauthonames. In this case, the resulting clusters are very likely toe pure, in the sense that each cluster is likely to containonly authorship records of the same author. The drawn-ack, however, is that some authors are likely to have theiauthorship records fragmented into several (pure) clusters,compromising the effectiveness of unsupervised methods.In this paper, we propose a hyid disambiguation method,which will hereafter be refeed to as SAND (standing foSelf-training Associative Name Disambiguator). SAND ex-ploits the strengths of both unsupervised and supervisedmethods. Specifically, it works in two steps. In the unsu-pervised step, recuing patterns in the coauthorship graphare exploited in order to produce pure clusters of author-ship records. Then, in the supervised step, a subset ofthe extracted clusters is provided as training, from whicha disambiguation function is derived. The final result is ahighly effective and extremely practical disambiguator, aswill be shown in a set of experiments using citation recordsextracted from the DBLP and BDBComp collections. Theesults show that SAND outperforms unsupervised meth-ods in more than 27% on the DBLP collection and 4% onthe BDBComp collection. Improvements when comparedagainst supervised methods are also reported.The rest of this paper is organized as follows. In Section2 we discuss related work. In Section 3 we describe theproposed hyid method, SAND. In Section 4 we presentthe evaluation of SAND and compare its effectiveness withthe effectiveness provided by other representative methods.Finally, in Section 5 we conclude paper.2. RELATED WORKThe name disambiguation methods proposed in the liter-ature adopt a wide spectrum of solutions [32] that includeapproaches based on manual assignment by liarians [31],collaborative efforts5, unsupervised techniques [4, 5, 8, 15,16, 18, 19, 23, 25, 26, 28, 33, 34, 41] and supervised tech-niques [9, 14, 17, 36, 35, 37].The unsupervised methods, i.e.,

Priyang Shaileshbhai · Accepted Answer

LITERATURE REVIEWS 
WRITEUP LIBRARY USER’S TUTORIAL SYSTEM 
In their study, Vlasenko and Ivanova found that the Internet and its information channels are more than just a repository for knowledge; they are also popular modes of communication that can assist interlocutors in gaining beneficial experience. Students increasingly use digital technology to virtualize their education and work, which provides various advantages such as easy access, faster information transmission, and more opportunities for learning.	
This kind of practical training deprives education of the procedures and work principles typical of the professional environment, including those that building facilities might require.
Although the benefits of multimedia for language teaching are undeniable, there are many disadvantages as well. Many of these are related to issues rooted in language learning or the educational process itself. For example, it has been shown that students often misunderstand and misinterpret various social and production processes when performing them virtually rather than in person.
However, in today's world, where virtual conversation has become the norm, many people do not engage in discussion and end up misunderstanding each other. This is why the increasing use of virtual means of communication is ineffective — in-person communication produces more effective results in terms of professional competencies and their formation.

Untitled Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries Anderson A. Ferreira Adriano Veloso Marcos André Gonçalves Alberto H. F. Laender Departamento de Ciência da...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment