Data Sets for Data Fusion Experiments

The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. However, there is a large amount of discrepancy on data provided by different Web sources and some Web sources could have quite low accuracy. Data fusion aims at resolving the conflicts and finding the truth that reflects the real world. We list below several data sets we used for experiments on data fusion techniques.

I. Stock (Contributors: Xian Li, Kenneth B. Lyons)

We collected trading data of 1000 stock symbols from 55 sources on every work day in July 2011. Detailed description of the data can be found at

Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? In VLDB, 2013. [PDF][Full paper]

Source Symbol Change % Last trading price Open price Change $ Volume Today's high Today's low Previous close 52wk High 52wk Low Shares Outstanding P/E Market cap Yield Dividend EPS
  1. Information on NASDAQ100 stocks collected from nasdaq.com. [download]

  2. Information on NASDAQ100 stocks collected by taking the majority values provided by five stock data providers: nasdaq.com, yahoo finance, google finance, bloomberg and MSN finance. [download]

  3. Information on NASDAQ100 stocks and another 100 randomly selected stocks collected by taking the majority values provided by five stock data providers: nasdaq.com, yahoo finance, google finance, bloomberg and MSN finance. [download]

 

II. Flight (Contributors: Xian Li, Kenneth B. Lyons)

We collected information of over 1200 flights from 38 sources over 1-month period (December 2011). Detailed description of the data can be found at

Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? In VLDB, 2013. [PDF][Full paper]

Other papers using this data set:

Xin Luna Dong, Barna Saha, and Divesh Srivastava. Less is More: Selecting Sources Wisely for Integration. In VLDB, 2013. [PDF][Full paper]

Source Flight# Scheduled departure Actual departure Departure gate Scheduled arrival Actual arrival Arrival gate

 

III. Book (Contributors: Xiaoxin Yin, Luna Dong)

Information on Computer Science books was collected from online bookstore aggregator AbeBooks.com in 2007. There are in total 1263 books and 894 data sources (bookstores). Detailed description of the data can be found at

Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20:796–808, 2008. [PDF]
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: the role of source dependence. In VLDB, 2009. [PDF][Presentation][Full paper]

Other papers using this data set:

Xin Luna Dong, Barna Saha, and Divesh Srivastava. Less is More: Selecting Sources Wisely for Integration. In VLDB, 2013. [PDF][Full paper]
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010. [PDF][Full paper][Presentation]
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Solomon: Seeking the truth via copying detection. Demo in VLDB, 2010. [PDF][Poster][Demo]

Source ISBN Title Author list
  1. Gold standard: Precise author lists manually obtained from book covers on 100 randomly selected books. [download]

  2. Silver standard: Author lists for all books. When different fusion methods reach agreement, we used the results; otherwise, we manually obtained the author lists from book covers (for more than 500 books). [download]

 

IV. Restaurant (Contributors: Laure Berti-Equille, Luna Dong)

We collected Manhattan restaurants listed by 12 websites weekly in Jan-Mar, 2009. Detailed description of the data can be found at

Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Truth discovery and copying detection in a dynamic world. In VLDB, 2009. [PDF][Presentation][Full paper]

Source Restaurant name Address

 

V. Weather (Contributors: Laure Berti-Equille, Luna Dong)

We collected weather data on 30 major USA cities from 18 websites every 45 minutes on a day in Mar, 2010. Detailed description of the data can be found at

Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010. [PDF][Full paper][Presentation]