The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. However, there is a large amount of discrepancy on data provided by different Web sources and some Web sources could have quite low accuracy. Data fusion aims at resolving the conflicts and finding the truth that reflects the real world. We list below several data sets we used for experiments on data fusion techniques.
I. Stock (Contributors: Xian Li, Kenneth B. Lyons)
We collected trading data of 1000 stock symbols from 55 sources on every work day in July 2011. Detailed description of the data can be found at
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? In VLDB, 2013. [PDF][Full paper]
Raw data: Data set with all provided attributes and values extracted from web page before schema mapping. [download]
Clean data: Data set after manual schema mapping. [download] The schema after manual schema mapping is
Source | Symbol | Change % | Last trading price | Open price | Change $ | Volume | Today's high | Today's low | Previous close | 52wk High | 52wk Low | Shares Outstanding | P/E | Market cap | Yield | Dividend | EPS |
Gold standards: We provide three gold standards:
Information on NASDAQ100 stocks collected from nasdaq.com. [download]
Information on NASDAQ100 stocks collected by taking the majority values provided by five stock data providers: nasdaq.com, yahoo finance, google finance, bloomberg and MSN finance. [download]
Information on NASDAQ100 stocks and another 100 randomly selected stocks collected by taking the majority values provided by five stock data providers: nasdaq.com, yahoo finance, google finance, bloomberg and MSN finance. [download]
II. Flight (Contributors: Xian Li, Kenneth B. Lyons)
We collected information of over 1200 flights from 38 sources over 1-month period (December 2011). Detailed description of the data can be found at
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? In VLDB, 2013. [PDF][Full paper]
Other papers using this data set:
Xin Luna Dong, Barna Saha, and Divesh Srivastava. Less is More: Selecting Sources Wisely for Integration. In VLDB, 2013. [PDF][Full paper]
Data: Flight data set after manual schema mapping. [download]
Source | Flight# | Scheduled departure | Actual departure | Departure gate | Scheduled arrival | Actual arrival | Arrival gate |
Gold standard: The gold standard contains departure/arrival information on 100 randomly selected flights provided by corresponding airline websites. [download]
III. Book (Contributors: Xiaoxin Yin, Luna Dong)
Information on Computer Science books was collected from online bookstore aggregator AbeBooks.com in 2007. There are in total 1263 books and 894 data sources (bookstores). Detailed description of the data can be found at
Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Truth discovery with
multiple conflicting information providers on the web. IEEE Trans. Knowl. Data
Eng., 20:796–808, 2008. [PDF]
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating
conflicting data: the role of source dependence. In VLDB, 2009. [PDF][Presentation][Full
paper]
Other papers using this data set:
Xin Luna Dong, Barna Saha, and Divesh Srivastava. Less is More: Selecting
Sources Wisely for Integration. In VLDB, 2013.
[PDF][Full
paper]
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global
detection of complex copying relationships between sources. In VLDB,
2010. [PDF][Full
paper][Presentation]
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Solomon:
Seeking the truth via copying detection. Demo in VLDB, 2010. [PDF][Poster][Demo]
Data: Book data set. [download]
Source | ISBN | Title | Author list |
Gold standards: We provide a gold standard and a silver standard:
Gold standard: Precise author lists manually obtained from book covers on 100 randomly selected books. [download]
Silver standard: Author lists for all books. When different fusion methods reach agreement, we used the results; otherwise, we manually obtained the author lists from book covers (for more than 500 books). [download]
IV. Restaurant (Contributors: Laure Berti-Equille, Luna Dong)
We collected Manhattan restaurants listed by 12 websites weekly in Jan-Mar, 2009. Detailed description of the data can be found at
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Truth discovery and copying detection in a dynamic world. In VLDB, 2009. [PDF][Presentation][Full paper]
Data: 12 snapshots of restaurants. [download]
Source | Restaurant name | Address |
Gold standard: We called the 467 restaurants that were removed by some websites to decide if the business is still open. The gold standard contains for each restaurant information about whether the business is still open ("Y"). [download]
V. Weather (Contributors: Laure Berti-Equille, Luna Dong)
We collected weather data on 30 major USA cities from 18 websites every 45 minutes on a day in Mar, 2010. Detailed description of the data can be found at
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010. [PDF][Full paper][Presentation]
Data: Data from 18 sources. The schema is provided at the beginning of each data file. [download]
Gold standards: A gold standard and a silver standard on source copying relationships are provided in the paper.