×

Data sets for author name disambiguation: an empirical analysis and a new resource. (English) Zbl 1378.62150

Summary: Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.

MSC:

62P99 Applications of statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68U35 Computing methodologies for information systems (hypertext navigation, interfaces, decision support, etc.)
91D30 Social networks; opinion dynamics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Arehart, M., & Miller, K. J. (2008). A ground truth dataset for matching culturally diverse romanized person names. In Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco, 28–30 May 2008 (pp. 1136–1139).
[2] Bagga, A., & Baldwin, B. (1998). Algorithms for scoring coreference chains. In Proceedings of the 1st international conference on language resources and evaluation, Granada, Spain, 28–30 May 1998 (pp. 563–566).
[3] Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. · doi:10.1002/asi.23329
[4] Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870. · doi:10.1002/asi.21363
[5] Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the sixth international workshop on information integration on the web, IIWeb ’07 (pp. 32–37).
[6] Diesner, J., Evans, C. S., & Kim, J. (2015). Impact of entity disambiguation errors on social network properties. In Proceedings of the 9th international conference on web and social media, University of Oxford, Oxford, UK, 26–29 May 2015 (pp. 81–90).
[7] Esperidião, L. V. B., Ferreira, A. A., Laender, A. H. F., Gonçalves, M. A., Gomes, D. M., Tavares, A. I., et al. (2014). Reducing fragmentation in incremental author name disambiguation. Journal of Information and Data Management, 5(3), 293–307.
[8] Fan, X., Jianyong Wang, X. P., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 10:1–10:23. · doi:10.1145/1891879.1891883
[9] Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012a). A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2), 15–26. · doi:10.1145/2350036.2350040
[10] Ferreira, A. A., Gonçalves, M. A., Almeida, J. M., Laender, A. H. F., & Veloso, A. (2012b). A tool for generating synthetic authorship records for evaluating author name disambiguation methods. Information Sciences, 206, 42–62. · Zbl 06099567 · doi:10.1016/j.ins.2012.04.022
[11] Frey, B. S., & Rost, K. (2010). Do rankings reflect research quality? Journal of Applied Economics, 13(1), 1–38. · doi:10.1016/S1514-0326(10)60002-5
[12] Grossman, J. W., & Ion, P. D. F. (1995). On a portion of the well-known collaboration graph. Congressus Numerantium, 108, 129–132. · Zbl 0903.05046
[13] Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS joint conference on digital libraries, Tucson, AZ, USA, 7–11 June 2004 (pp. 296–305).
[14] Han, H., Xu, W., Zha, H., & Giles, C. L. (2005a). A hierarchical naive bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium on applied computing, Santa Fe, NM, USA, 13–17 March 2005 (pp. 1065–1069).
[15] Han, H., Zha, H., & Giles, C. L. (2005b). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries, Denver, CO, USA, 7–11 June 2005 (pp. 334–343).
[16] Jin-Zhong, G., Qing-Hua, C., & You-Gui, W. (2011). Statistical distribution of chinese names. Chinese Physics B, 20(11), 118901–1–118101–7.
[17] Jost, M., Roy, N. D., & Teschke, O. (2016). Another update on the collaboration graph. European Mathematical Society Newsletter, 100, 58–60. · Zbl 1353.68287
[18] Kang, I.-S., Kim, P., Lee, S., Jung, H., & You, B.-J. (2011). Construction of a large-scale test set for author disambiguation. Information Processing & Management, 47(3), 452–465. · Zbl 06015886 · doi:10.1016/j.ipm.2010.10.001
[19] Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461. · doi:10.1002/asi.23489
[20] Lee, D., On, B.-W., Kang, J., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of the 2nd international workshop on information quality in information systems, Baltimore, MD, USA, 17 June 2005 (pp. 69–76).
[21] Ley, M. (2009). DBLP: Some lessons learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500. · doi:10.14778/1687553.1687577
[22] Ley, M., & Reuther, P. (2006). Maintaining an online bibliographical database: The problem of data quality. EGC 2006. Revue des Nouvelles Technologies de l’Information, RNTI-E-6:5–10.
[23] Liu, W., Dogan, R. I., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781. · doi:10.1002/asi.23063
[24] McKay, D., Sanchez, S., & Parker, R. (2010). What’s my name again?: sociotechnical considerations for author name management in research databases. In Proceedings of the 22nd conference of the computer-human interaction special interest group of Australia on computer-human interaction, Brisbane, Australia, 22–26 November 2010 (pp. 240–247).
[25] Mihaljevic-Brandt, H., Müller, F., & Roy, N. (2014). Author profile pages in zbMATH–Improving accuracy through user interaction. In Joint proceedings of the MathUI, OpenMath and ThEdu workshops and work in progress track at CICM, Coimbra, Portugal, 7–11 July 2014.
[26] Milojević, S. (2013). Accuracy of simple, initials-based methods for author name disambiguation. Journal of Informetrics, 7(4), 767–773. · doi:10.1016/j.joi.2013.06.006
[27] Ng, V. (2010). Supervised noun phrase coreference research: The first fifteen years. In Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010 (pp. 1396–1411).
[28] Qian, Y., Zheng, Q., Sakai, T., Ye, J., & Liu, J. (2015). Dynamic author name disambiguation for growing digital libraries. Information Retrieval Journal, 18(5), 379–412. · doi:10.1007/s10791-015-9261-3
[29] Reitz, F., & Hoffmann, O. (2011). Did they notice?–A case-study on the community contribution to data quality in DBLP. In S. Gradmann, F. Borri, C. Meghini & H. Schuldt (Eds.), Research and advanced technology for digital libraries–International conference on theory and practice of digital libraries, TPDL 2011, Berlin, Germany, September 26-28, 2011. Proceedings, Vol. 6966, Lecture Notes in Computer Science (pp. 204–215). Springer.
[30] Reuther, P. (2006). Personal name matching: New test collections and a social network based approach. Technical Report: Department for Databases and Information Systems, University of Trier, Trier, Germany.
[31] Salo, D. (2009). Name authority control in institutional repositories. Cataloging & Classification Quarterly, 47(3–4), 249–261. · doi:10.1080/01639370902737232
[32] Santana, A. F., Gonçalves, M. A., Laender, A. H. F., & Ferreira, A. A. (2015). On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method. International Journal on Digital Libraries, 16(3–4), 229–246. · doi:10.1007/s00799-015-0158-y
[33] Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. · doi:10.1007/s11192-014-1289-4
[34] Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries, Vancouver, Canada, 18–23 June 2007 (pp. 342–351).
[35] Teschke, O. (2009). On authors and entities. European Mathematical Society Newsletter, 71, 43–44. · Zbl 1188.00003
[36] Teschke, O., & Wegner, B. (2011). Author profiles at Zentralblatt MATH. European Mathematical Society Newsletter, 79, 43–44. · Zbl 1290.68128
[37] Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries, Austin, TX, USA, 15–19 June 2009 (pp. 39–48).
[38] Wang, X., Tang, J., Cheng, H., & Yu, P. S. (2011). ADANA: Active name disambiguation. In Proceedings of the 11th IEEE international conference on data mining, Vancouver, Canada, 11–14 December 2011 (pp. 794–803).
[39] Weingart, P. (2005). Impact of bibliometrics upon the science system: Inadvertent consequences? Scientometrics, 62(1), 117–131. · doi:10.1007/s11192-005-0007-7
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.