Papers I've Read
Canonicalization of Database Records using Adaptive Similarity Measures
Canonicalization of Database Records using Adaptive Similarity Measures
In proc Knowledge Discovery and Datamining (KDD) 2007
It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as {\sl canonicalization}. Despite its importance, there is very little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is ``central'' in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different {\sl styles} of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. {\sl KDD} versus {\sl Conference on Knowledge Discovery and Data Mining}). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. We empirically evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.
- 11 Views
FACTORIE: Efficient Probabilistic Programming for Relational Factor Graphs via Imperative Declarations of Structure, Inference and Learning
FACTORIE: Efficient Probabilistic Programming for Relational Factor Graphs via Imperative Declarations of Structure, Inference and Learning
Neural Information Processing Systems (NIPS) workshop on Probabilistic Programming, Vancouver, 2008
- 66 Views
A Unified Approach for Schema Matching, Coreference,and Canonicalization
A Unified Approach for Schema Matching, Coreference,and Canonicalization
In the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD), Las Vegas, Nevada, 2008
The automatic consolidation of database records from many
heterogeneous sources into a single repository requires solving
several information integration tasks. Although tasks such as
coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly. We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach. Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50\% error reduction for coreference and a 40\% error reduction for schema matching.
- 26 Views



Like