Welcome to Xin (Luna) Dong's Homepage!

Data Integration

Michael Franklin, Alon Y. Halevy, and David Maier. From databases to dataspaces: A new abstraction for information management. Sigmod Record, Dec. 2005.
Alon Y. Halevy, Naveen Ashishi, Dina Bitton, Michael Carey, Denise Draper, Jeff Pollock, Arnon Rosenthal, and Vishal Sikka. Enterprise information integration: successes, challenges and controversies. Sigmod 2005.

Information Integration Using Logical Views [Link]

Data Exchange

Ronald Fagin and Phokion G. Kolaitis and Renee J. Miller and Lucian Popa. Data Exchange: Semantics and Query Answering. ICDT, 2003. (First paper on data exchange)
Ronald Fagin and Phokion G. Kolaitis and Lucian Popa. Data Exchange: Getting to the Core. ACM Transactions on Database Systems, 30(1):174-201, 2005. (Must-read)
Ariel Fuxman and Phokion G. Kolaitis and Renee J. Miller and Wang Chiew Tan. Peer data exchange. PODS, 2005.
Phokion G. Kolaitis and Jonathan Panttaja and Wang Chiew Tan. The complexity of data exchange. PODS, 2006.
Georg Gottlob and Alan Nash. Data Exchange: Computing Cores in Polynomial Time. PODS, 2006.
Leonid Libkin. Data exchange and incomplete information. PODS, 2006.
Up to top

Schema Matching

Survey
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal,
  10(4):334-350, 2001. (Must-read)
- Pavel Shvaiko. A classification of schema-based matching approaches. Unpublished.
Element-level Matching
- Schema name & description
  - P. Mitra, G. Wiederhold, and J Jannink. Semi-automatic integration of knowledge sources. Proc. of Fusion, 1999.
  - L. Palopoli, D. Sacca, and D. Ursino. Semi-automatic, semantic discover of properties from database schemas. IDEAS, 244-253, 1998.
  - C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogenenous databases. Proc. 7, IFIP 2.6 Working Conf. Database Semantics, 1997.
  - D. W. Embley. Multifaceted exploitation of metadata for attribute match discovery in information
    integration. In WIIW, 2001.
- Instance
  - A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the
    semantic web. In Proc. of the Int. WWW Conf., 2002.
- Constraint
  - P. Mitra, G. Wiederhold, and M. Kersten. A graph-oriented model for articulation of ontology interdependencies. In Pro. of Extending DataBase Technologies, 2000.
  - S. Bergamaschi, S. Castano, M. Vincini, and D. Beneventano. Semantic integration of heterogeneous
    information sources. Data & Knowledge Engineering, 36(3), 2001.
  - J. Kang and J. Naughton. On schema matching with opaque column names and data values. In Proc. of SIGMOD, 2003.
Structure-level Matching
- T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proceedings
  of the International Conference on Very Large Databases (VLDB), 1998.
- L. Palopoli, D. Sacca, D. Ursino. An automatic technique for detecting type conflicts in database schemas. CIKM, 306-313, 1998.
- J. Madhavan, P. Bernstein, and E. Rahm. Generic schema matching with Cupid. In Proceedings of the
  International Conference on Very Large Databases (VLDB), 2001.
- S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm.
  In Proc. of ICDE, 2002.
- Lerner BS. A model for compound type changes encountered in schema evolution. ACM TODS 25(1):83-127, 2000.
- K. Zhang and D Shasha. Approximate tree pattern matching. Pattern matching in strings, trees, and arrays, 341-371, 1997.
- D. Calvanese, S. Castano, F. Guerra, D. Lembo, M. Melchiorri, G. Terracina, D. Ursino, and M. Vincini.
  Towards a Comprehensive Framework for Semantic Integration of Highly Heterogeneous Data Sources.
  In Proc. of the 8th Int. Workshop on Knowledge Representation meets Databases (KRDB2001), 2001.
- L. Xu and D. Embley. Discovering Direct and Indirect Matches for Schema Elements. In DASFAA,
  2003.
Combining Matchers
- A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: a machine
  learning approach. In Proc. of SIGMOD, 2001.
- H.-H. Do and E. Rahm. COMA - A System for Flexible Combination of Schema Matching Approaches.
  In Proc. of VLDB, 2002.
Cluster-based Matching
- W. Wu, C. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source
  query interfaces on the deep web. In Proc. of SIGMOD, 2004.
- B. He and K. C.-C. Chang. Statistical schema integration across the deep web. In Proc. of SIGMOD,
  2003.
- W. Li and C. Clifton. SemInt: a tool for identifying attribute correspondences in heterogeneous
  databases using neural network. Data Knowledge Engineering, 33(1), 2000.
- S. Castano, V. De Antonellis, and S. De Capitani di Vemercati. Global viewing of heterogeneous data sources. IEEE Trans Data Knowl Eng 13(2):277-297, 2001.
Learn from Previous Matching
- J. Madhavan, P. Bernstein, A. Doan, and A. Halevy. Corpus-basd schema matching. In Proc. of ICDE,
  2005.
- Jayant Madhavan, Philip A. Bernstein, Kuang Chen, Alon Halevy, and Pradeep Shenoy. Corpus-based Schema Matching. In Workshop on Information Integration on the Web at IJCAI, 2003.
Query Discovery
- R. J. Miller, L. M. Haas, and M. A. Hernandez. Schema mapping as query discovery. In VLDB, 2000
Up to top

Meta Data Management

Meta Data Applications
- Lucian Popa, Yannis Velegrakis, Renee J. Miller, Mauricio A. Hernández, Ronald Fagin: Translating Web Data, VLDB 2002.
- Stefano Spaccapietra, Christine Parent: View Integration: A Step Forward in Solving Structural Conflicts. TKDE 6(2): 258-274 (1994)
Data Models
- Natalya F. Noy, Mark A. Musen, Jose L.V. Mejino, and Cornelius Rosse: Pushing the Envelope: Challenges in a Frame-Based Representation of Human Anatomy. SMI Report Number: SMI-2002-0925, http://smi-web.stanford.edu/pubs/SMI_Abstracts/SMI-2002-0925.html.
- Richard Hull: Relative Information Capacity of Simple Relational Database Schemata. PODS 1984: 97-109
Mechanisms
- Paolo Atzeni, Riccardo Torlone: Management of Multiple Models in an Extensible Database Design Tool. EDBT 1996: 79-95.
- Philip A. Bernstein: Applying Model Management to Classical Meta Data Problems submitted for publication
- Peter Buneman, Susan B. Davidson, Anthony Kosky: Theoretical Aspects of Schema Merging. EDBT, 152-167, 1992.

Up to top

Object Matching (a.k.a Record Linkage)

History and overview:
- Origination: H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. In Science 130 (1959), no. 3381, pages 954-959, 1959.
- First formalization: Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
- Survey:
  - William Winkler. Overview of record linkage and current research directions. Technical Report, Statistical Research Division, U.S. Bureau of the Census, 2006.
  - Lifang Gu, Rohan Baxter, Deanne Vickers, and Chris Rainsford. Record Linkage: Current Practice and Future Directions. Unpublished, 2004.
  - M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems Special Issue on Information Integration on the Web, September 2003. (Must-read)
  - Mohamed G. Elfeky, Vassilios S. Verykios and Ahmed K. Elmagarmid. TAILOR: A record linkage Toolbox.
Field-wise Matching (String comparison)
- Survey: William Cohen, Pradeep Ravikumar and Stephen Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Workshop on Information Integration on the Web (IIW), at IJCAI 2003. (Must-read)
- William Winkler and Edward Porter. Approximate String Comparison and its effect on an Advanced Record Linkage System. Technical report, Statistical Research Division, U.S. Bureau of the Census, 1997.
- Adaptive string matching
  - Mikhail Bilenko and Raymond Mooney. Adaptive Duplicate Detection Using Learnable String Similarity Measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp.39-48, Washington, DC, August 2003.
  - S. Tejada, C. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In SIGKDD, 2002.
Record-wise Matching
- Rule-based:
  - H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: language, model, and algorithms. In VLDB, pages 371-380, 2001.
  - L. Jin, C. Li, and S. Mehrotra. Efficient Record Linkage in Large Data Sets. In DASFAA, 2003.
  - M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: a knowledge-based intelligent data cleaner. In SIGKDD, pages 290-294, 2000.
- EM Method:
  - William Winkler. Using the EM Algorithm for Weight Computation in the Felligi-Sunter Model of Record Linkage. Technical Report RR2000/05, Statistical Research Division, Bureau of Census, 2000.
  - William Winkler: Advanced methods for record linkage. Technical Report, 1994.
- Learning:
  - Jose C. Pinheiro and Don X. Sun. Methods for linking and mining massive heterogeneous databases. AAAI, 1998.
  - W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, 2002.
  - Sunita Sarawagi and Anuradha Bhamidipaty. Interactive Deduplication using Active Learning. In Proceedings of the ACM SIGKDD, 2002.
  - Decision Tree: S. Tejada, C. Knoblock, and S. Minton: Learning domain-independent string transformation weights for high accuracy object identiØcation. In SIGKDD, 2002.
  - Bayes and SVM: S. Sarawagi and A. Bhamidipaty: Interactive deduplication using active learning. In SIGKDD, 2002.
- Secondary Knowledge
  - A. Doan, Y. Lu, Y. Lee, and J. Han. Object matching for information integration: a proØler-based approach. In IIWeb, 2003.
  - X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In Proc. of CIDR, 2005.
  - M. Michalowski, S. Thakkar, and C. A. Knoblock. Exploiting secondary sources for unsupervised record linkage. In IIWeb, 2004.
Collective Model
- William Cohen, David McAllester, and Henry Kautz. Hardening Soft Information Sources. In Proceedings of ACM SIGKDD, 2000, 255-259.
- Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identity Uncertainty and Citation Matching. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS) 15, 2003.
- Andrew McCallum and Ben Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. IJCAI 2003.
- Parag and P. Domingos. Multi-relational record linkage. In MRDM, 2004.
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002.
- I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, 2004.
- D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM Data Mining (SDM), 2005.
- Xin Dong, Alon Halevy, Jayant Madhavan. Reference reconciliation in complex data spaces. In Sigmod, 2005.
Efficiency and Scalability
- Mauricio Hernandez and Salvatore Stolfo. The Merge/Purge Problem for Large Databases. In Proceedings of the ACM SIGMOD Conference, 1995.
- Andrew McCallum, Kamal Nigam and Lyle Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Proceedings of the ACM SIGKDD, 2000. (Must read -- classical paper for canopy)
- Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and Efficient Fuzzy Match for Online Data cleaning. In Proceedings of the ACM SIGMOD, 2003.
- Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating Fuzzy Duplicates in Data Warehouses. VLDB 2002.
Up to top

Data Fusion

Survey & Tutorial
- J. Bleiholder and F. Naumann. Conflicting handling strategies in an integrated information system. In WWW'06.
- J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1–41, 2008.
- Xin Luna Dong and Felix Naumann. Data fusion--Resolving data conflicts for integration. Tutorial in VLDB, 2009.
Truth Discovery
- M. Wu and A. Marian. Corroborating answers from multiple web sources. In WebDB'07. (Initial thought for conflict resolving)
- X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In SIGKDD'07. (First Bayesian model considering source accuracy)
- Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: the role of source dependence. In VLDB, 2009. (Refine Yin et al.'s Bayesian model, and consider source copying)
- A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, 2010. (Cosine model and other models; consider in addition accuracy on each data item)
- A. Marian and M. Wu. Corroborating information from Web sources. IEEE Data Eng. Bull, 34(3): 11-17 (2011).
- J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877–885, 2010. (Other models and a-priori knowledge of truth for a subset of data items)
- J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-fiding. In IJCAI, pages 2324–2329, 2011. (Consider probabilistic claims, value similarities, group beliefs, etc.)
- X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, 2011. (a-priori knowledge of truth for a subset of data items)
- Xuan Liu, Xin Luna Dong, Beng Chin Ooi, and Divesh Srivastava: Online data fusion. In VLDB'11. (First work on online fusion)
Copy Detection
- Xin Luna Dong and Divesh Srivastava. Large-Scale Copying Detection. Tutorial in Sigmod, 2011.
- Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: the role of source dependence. In VLDB, 2009. (First work on copying detection for structured data)
- Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010 (Extension for global detection)
- Anish Das Sarma, Xin Luna Dong, Alon Halevy. Data integration with dependent sources. In EDBT, 2011
Fusion and Copying in a Dynamic World
- Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Truth discovery and copying detection in a dynamic world. In VLDB, 2009.

Up to top

Data Integration with Uncertainty

Probabilistic Schema Matching
- Xin Dong, Alon Halevy, and Cong Yu. Data integration with uncertainty. VLDB'07.
- Carmel Domshlak, Avigdor Gal, and Haggai Roitman. Rank aggregation for automatic schema matching. TKDE 19(4), 2007
- Avigdor Gal. Why is schema matching tough and what can we do about it. Sigmod Record, 35(4), 2006
- Avigdor Gal, Ateret Anaby-Tavor, Alberto Trombetta, Danilo Montesi. A framework for modeling and evaluating automatic semantic reconciliation. VLDB Journal, 2003.
- Henrik Nottelmann and Umberto Straccia. A probabilistic, logic-based framework for automated web directory alignment.
- Henrik Nottelmann and Umberto Straccia. Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management 43:552-576, 2007.
Generating Probabilistic Mediated Schemas
- Anish Das Sarma, Xin Dong, and Alon Halevy. Bootstrapping pay-as-you-go data integration systems. Sigmod'08.
- M. Magnani, N. Rizopoulos, P. Brien, and D. Montesi. Schema integration based on uncertain semantic mappings. Lecture Notes in Compute Science, 2007.

Up to top