Guangyuan's Research and Development Blog: Dataset - 2nd Linked Open Data-enabled Recommender Systems Challenge

The collected data from Facebook profiles about personal preferences (“likes”) for items in three domains: movies, books and music. After a process of user anonymization, the items available in the dataset have been mapped to their corresponding DBpedia URIs. These mappings can be used to extract semantic features from DBpedia or other LOD repositories to be exploited by the recommendation approaches proposed in the challenge. The dataset is split in a training set and an evaluation set.

The training set, the test set and the mapping files are available at eswc2015-lod-recsys-challenge-v1.0.zip

For each domain (movies, books and music) this archive contains:

the training set file with each line composed by: userID \t itemID.
the mapping file with each line composed by: itemID \t type \t DBpediaURI.

The dataset contains 3225 items for the book domains, 6372 items for the music domains and 5389 items for the movie domains.

The training set contains 11600 ratings for the book domains, 854016 ratings for the music domains and 638268 ratings for the movie domains.

Further investigation of the dataset

After checking the dataset of music domain, it contains 1,093,851 likes information for music domains (liked music artists or bands)
52,072 users with 21 liked items on average
Can construct 5 test cases and others for training cases
Some bad characters within URIs such as http://dbpedia.org/resource/S�bastien_Tellier can be found through SPARQL query like below:

 select distinct ?s where {  
 ?s ?p ?o .  
 ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o1 .  
 VALUES (?o1) {(<http://dbpedia.org/ontology/MusicalArtist>) (<http://dbpedia.org/ontology/Band>)}  
 FILTER regex(str(?s), "http://dbpedia.org/resource/S.*bastien_Tellier")   
 } LIMIT 100

Dataset - 2nd Linked Open Data-enabled Recommender Systems Challenge

No comments:

Post a Comment