Publication:
An improved method of locality-sensitive hashing for scalable instance matching

dc.contributor.authorAydar, Mehmet
dc.contributor.authorAyvaz, Serkan
dc.contributor.institutionAydar, Mehmet, Department of Computer Science, Kent State University, Kent, United States
dc.contributor.institutionAyvaz, Serkan, Department of Software Engineering, Bahçeşehir Üniversitesi, Istanbul, Turkey
dc.date.accessioned2025-10-05T16:01:21Z
dc.date.issued2019
dc.description.abstractIn this study, we propose a scalable approach for automatically identifying similar candidate instance pairs in very large datasets. Efficient candidate pair generation is an essential to many computational problems involving calculation of instance similarities. Calculating similarities of instances with a large number of properties and efficiently matching a large number of similar instances in a scalable way are two significant bottlenecks of candidate instance pair generation. In our approach, we utilize locality-sensitive hashing (LSH) technique to greatly improve the scalability of candidate instance pair generation. Based on the candidate similarity threshold, our algorithm automatically discovers the optimum number of hash functions in each band in LSH. Moreover, we evaluated the scalability of our approach and its effectiveness in instance matching task using real-world very large datasets. © 2021 Elsevier B.V., All rights reserved.
dc.identifier.doi10.1007/s10115-018-1199-5
dc.identifier.endpage294
dc.identifier.issn02193116
dc.identifier.issn02191377
dc.identifier.issue2
dc.identifier.scopus2-s2.0-85046024215
dc.identifier.startpage275
dc.identifier.urihttps://doi.org/10.1007/s10115-018-1199-5
dc.identifier.urihttps://hdl.handle.net/20.500.14719/11220
dc.identifier.volume58
dc.language.isoen
dc.publisherSpringer London
dc.relation.sourceKnowledge and Information Systems
dc.subject.authorkeywordsCandidate Pairs Generation
dc.subject.authorkeywordsInstance Matching
dc.subject.authorkeywordsInstance Similarity
dc.subject.authorkeywordsLocality-sensitive Hashing
dc.subject.authorkeywordsScalability
dc.subject.authorkeywordsHash Functions
dc.subject.authorkeywordsScalability
dc.subject.authorkeywordsCalculating Similarities
dc.subject.authorkeywordsCandidate Pairs Generation
dc.subject.authorkeywordsComputational Problem
dc.subject.authorkeywordsInstance Matching
dc.subject.authorkeywordsInstance Similarity
dc.subject.authorkeywordsLocality Sensitive Hashing
dc.subject.authorkeywordsScalable Approach
dc.subject.authorkeywordsSimilarity Threshold
dc.subject.authorkeywordsLarge Dataset
dc.subject.indexkeywordsHash functions
dc.subject.indexkeywordsScalability
dc.subject.indexkeywordsCalculating similarities
dc.subject.indexkeywordsCandidate Pairs Generation
dc.subject.indexkeywordsComputational problem
dc.subject.indexkeywordsInstance matching
dc.subject.indexkeywordsInstance Similarity
dc.subject.indexkeywordsLocality sensitive hashing
dc.subject.indexkeywordsScalable approach
dc.subject.indexkeywordsSimilarity threshold
dc.subject.indexkeywordsLarge dataset
dc.titleAn improved method of locality-sensitive hashing for scalable instance matching
dc.typeArticle
dcterms.referencesAchichi, Manel, Results of the Ontology Alignment Evaluation Initiative 2016, CEUR Workshop Proceedings, 1766, pp. 73-129, (2016), Aumueller, David, Schema and ontology matching with COMA++, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 906-908, (2005), Workshop on Intelligent Exploration of Semantic Data Iesd2015 Co Located with Iswc2015, (2015), Ayvaz, Serkan, Building Summary Graphs of RDF Data in Semantic Web, Proceedings - IEEE Computer Society's International Computer Software and Applications Conference, 2, pp. 686-691, (2015), Berlin, Jacob, Database schema matching using machine learning with feature selection, Lecture Notes in Computer Science, 2348, pp. 452-466, (2002), Bilenko, Mikhail, Adaptive name matching in information integration, IEEE Intelligent Systems, 18, 5, pp. 16-23, (2003), Bilke, Alexander, Schema matching using duplicates, Proceedings - International Conference on Data Engineering, pp. 69-80, (2005), Bizer, Christian, Linked data - The story so far, International Journal on Semantic Web and Information Systems, 5, 3, pp. 1-22, (2009), Broder, Andrei Z., On the resemblance and containment of documents, pp. 21-29, (1997), Castano, Silvana, Instance matching for ontology population, pp. 121-132, (2008)
dspace.entity.typePublication
local.indexed.atScopus
person.identifier.scopus-author-id57063196900
person.identifier.scopus-author-id56676074300

Files