Data Sets for Data Fusion Experiments

The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. However, there is a large amount of discrepancy on data provided by different Web sources and some Web sources could have quite low accuracy. Data fusion aims at resolving the conflicts and finding the truth that reflects the real world. We list below several data sets we used for experiments on data fusion techniques.

I. Stock (Contributors: Xian Li, Kenneth B. Lyons)

We collected trading data of 1000 stock symbols from 55 sources on every work day in July 2011. Detailed description of the data can be found at

Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? In VLDB, 2013. [PDF][Full paper]

Raw data: Data set with all provided attributes and values extracted from web page before schema mapping. [download]
Clean data: Data set after manual schema mapping. [download] The schema after manual schema mapping is

Source

Symbol

Change %

Last trading price

Open price

Change $

Volume

Today's high

Today's low

Previous close

52wk High

52wk Low

Shares Outstanding

P/E

Market cap

Yield

Dividend

EPS

Gold standards: We provide three gold standards:

Information on NASDAQ100 stocks collected from nasdaq.com. [download]

Information on NASDAQ100 stocks collected by taking the majority values provided by five stock data providers: nasdaq.com, yahoo finance, google finance, bloomberg and MSN finance. [download]

Information on NASDAQ100 stocks and another 100 randomly selected stocks collected by taking the majority values provided by five stock data providers: nasdaq.com, yahoo finance, google finance, bloomberg and MSN finance. [download]

II. Flight (Contributors: Xian Li, Kenneth B. Lyons)

We collected information of over 1200 flights from 38 sources over 1-month period (December 2011). Detailed description of the data can be found at

Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. Truth Finding on the Deep Web: Is the Problem Solved? In VLDB, 2013. [PDF][Full paper]

Other papers using this data set:

Xin Luna Dong, Barna Saha, and Divesh Srivastava. Less is More: Selecting Sources Wisely for Integration. In VLDB, 2013. [PDF][Full paper]

Data: Flight data set after manual schema mapping. [download]

Source

Flight#

Scheduled departure

Actual departure

Departure gate

Scheduled arrival

Actual arrival

Arrival gate

Gold standard: The gold standard contains departure/arrival information on 100 randomly selected flights provided by corresponding airline websites. [download]

III. Book (Contributors: Xiaoxin Yin, Luna Dong)

Information on Computer Science books was collected from online bookstore aggregator AbeBooks.com in 2007. There are in total 1263 books and 894 data sources (bookstores). Detailed description of the data can be found at

Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng., 20:796–808, 2008. [PDF]
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Integrating conflicting data: the role of source dependence. In VLDB, 2009. [PDF][Presentation][Full paper]

Other papers using this data set:

Xin Luna Dong, Barna Saha, and Divesh Srivastava. Less is More: Selecting Sources Wisely for Integration. In VLDB, 2013. [PDF][Full paper]
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010. [PDF][Full paper][Presentation]
Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Solomon: Seeking the truth via copying detection. Demo in VLDB, 2010. [PDF][Poster][Demo]

Data: Book data set. [download]

Source

ISBN

Title

Author list

Gold standards: We provide a gold standard and a silver standard:

Gold standard: Precise author lists manually obtained from book covers on 100 randomly selected books. [download]

Silver standard: Author lists for all books. When different fusion methods reach agreement, we used the results; otherwise, we manually obtained the author lists from book covers (for more than 500 books). [download]

IV. Restaurant (Contributors: Laure Berti-Equille, Luna Dong)

We collected Manhattan restaurants listed by 12 websites weekly in Jan-Mar, 2009. Detailed description of the data can be found at

Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. Truth discovery and copying detection in a dynamic world. In VLDB, 2009. [PDF][Presentation][Full paper]

Data: 12 snapshots of restaurants. [download]

Source

Restaurant name

Address

Gold standard: We called the 467 restaurants that were removed by some websites to decide if the business is still open. The gold standard contains for each restaurant information about whether the business is still open ("Y"). [download]

V. Weather (Contributors: Laure Berti-Equille, Luna Dong)

We collected weather data on 30 major USA cities from 18 websites every 45 minutes on a day in Mar, 2010. Detailed description of the data can be found at

Xin Luna Dong, Laure Berti-Equille, Yifan Hu, and Divesh Srivastava. Global detection of complex copying relationships between sources. In VLDB, 2010. [PDF][Full paper][Presentation]

Data: Data from 18 sources. The schema is provided at the beginning of each data file. [download]

Gold standards: A gold standard and a silver standard on source copying relationships are provided in the paper.