Showing posts with label JIST2015. Show all posts

RESIM (REsource SIMilarity) for Linked Data

DESCRIPTION

The resim.jar file is an implementation of RESIM (REsource SIMilarity) measure. The measure is designed to calculate the semantic similarity between two resources in a Knowlege Graph (e.g., DBpedia, zhishi.me) with a SPARQL Endpoint. RESIM is presented in [1] and then it is improved in [2] and the summary of the measure is presented in [3]. The implementation extends the measure further by enabling different properties for two resources, e.g., for incoming indirect link between dbpedia:Steve_Jobs and dbpedia:Apple_Inc., it can be connected by dbpedia:Steve_Jobs<-dbpedia-owl:keyPerson<-dbpedia:NexT->dbpedia-owl:successor->dbpedia:Apple_Inc. while in the paper we restrict the property should be the same.

When writing a paper or producing a software application, tool, or interface based on the library, please kindly cite the source [2].

REQUIREMENT

Java 1.7
JENA 2.11.2

EXAMPLE 1

public static void main(String[] args) {

    // similarity measure settings
    List<String> additionalPropertyList = Arrays.asList(
            "<http://purl.org/dc/terms/subject>", 
            "<http://www.w3.org/2000/01/rdf-schema#subClassOf>", 
            "<http://www.w3.org/2004/02/skos/core#narrowerOf>", 
            "<http://www.w3.org/2004/02/skos/core#broaderOf>",
            "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>"
            );
    List<String> includePropertyList = Arrays.asList(
            "<http://purl.org/dc/terms/subject>", 
            "<http://www.w3.org/2000/01/rdf-schema#subClassOf>", 
            "<http://www.w3.org/2004/02/skos/core#narrower>", 
            "<http://www.w3.org/2004/02/skos/core#broader>",
            "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>"
            );
    List<String> excludePropertyList = Arrays.asList(
            "<http://dbpedia.org/ontology/wikiPageWikiLink>" 
            );

    // initialize similarity measure
    // param1: SPARQL Endpoint URL
    // param2: pattern for property - default: "http://dbpedia.org/ontology/
    // param3: include property list
    // param4: additional property list to pattern
    // param5: exclude property list
    ResourceSimilarityMeasure rsm = new ResourceSimilarityMeasure("http://dbpedia.org/sparql", null, null, additionalPropertyList, excludePropertyList);    

    System.out.println(rsm.getSimilarity("<http://dbpedia.org/resource/Drink>", "<http://dbpedia.org/resource/Mouth>", 2));

}

EXAMPLE 2

public static void main(String[] args) {

    List<String> excludePropertyList = Arrays.asList("<http://dbpedia.org/ontology/wikiPageWikiLink>");
    ResourceSimilarityMeasure rsm = new ResourceSimilarityMeasure("http://dbpedia.org/sparql", null, null, null, excludePropertyList);  

    System.out.println(rsm.getSimilarity("<http://dbpedia.org/resource/Apple_Inc.>", "<http://dbpedia.org/resource/Steve_Jobs>", 2));
    System.out.println(rsm.getSimilarity("<http://dbpedia.org/resource/Apple_Inc.>", "<http://dbpedia.org/resource/Steve_Wozniak>", 2));
    System.out.println(rsm.getSimilarity("<http://dbpedia.org/resource/Apple_Inc.>", "<http://dbpedia.org/resource/Jonathan_Ive>", 2));
    System.out.println(rsm.getSimilarity("<http://dbpedia.org/resource/Apple_Inc.>", "<http://dbpedia.org/resource/Microsoft>", 2));
    System.out.println(rsm.getSimilarity("<http://dbpedia.org/resource/Apple_Inc.>", "<http://dbpedia.org/resource/IPad>", 2));

    // Printed results
    0.7107697604926099
    0.26084710667467736
    0.13425729687979637
    0.6239711506085717
    0.7341358492281069
    Started: 18.1.2016 21:54:42 Finished: 18.1.2016 21:55:12

}

The ResourceSimilarityMeasure requires 5 parameters. The first one is a SPARQL Endpoint (e.g., DBpedia SPARQL Endpoint) and the other ones are used for controling the property list for this measure.
The 2nd parameter (pattern) is used with 4th, 5th parameters. For example, the default pattern of property for this measure is "http://dbpedia.org/ontology/" (i.e., DBpedia Ontology properties) and it will consider/exclude the list of properties if there is an additional property list or an exclude property list exist.
The 3rd parameter is an include property list that controls property list for this measure in a strict way. That is, the measure will only consider this property list if you define the list.

REFENRECES

Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Guangyuan Piao, Safina showkat Ara, John G. Breslin 5th Joint International Semantic Technology Conference, Yichang, China, 2015
Measuring Semantic Similarity for Linked Open Data-enabled Recommender Systems, Guangyuan Piao and John G. Breslin, The 31st ACM/SIGAPP Symposium on Applied Computing, Pisa, Italy, 2016
Exploiting the Semantic Similarity of Interests in a Semantic Interest Graph for Social Recommendations, Guangyuan Piao, The 31st ACM/SIGAPP Symposium on Applied Computing, Pisa, Italy, 2016

Review of 2015

Research

The year 2015 is finished and I've been in Ireland and Insight for 1 year and 6 months. Doing research here gives me quite different experience than previous research experience, and it poses good opportunities as well as challenges for myself.

Independent:

You have to grow up and be able to do your research (not projects) with your ideas and opinions, and conduct experiments by your own. I remembered the seminar I participated at the beginning of the PhD journey and the speaker described our academic supervisor as advisor since it is more appropriate. That means our advisor is who giving advices for your research but not who telling you every step you should move forward, and usually our supervisors also too busy to do so.

At first, I could not start own research and conduct experiment by myself, and there was always uncertainty about myself and I realized the way I've been trained always was "supervised" by others. It reminds me the time in South Korea when I was a master student as well as an employee in a company where I received a lot of things to-do every day from senior members. In contrast, I did not receive any call here, and all communication has been done through emails which is still a surprising fact for me. Thanks to God, even I have a lot of improvements to achieve, I have started the research, with advice from my supervisor.

I started to recognize the statements from (So you want to do a PhD from Open University) that a PhD is confirming your "research independence", i.e., you have to demonstrate that:

Ability to do research by yourself, rather than simply doing what your supervisor tells you
Awareness of where your work fits in relation to the discipline, and what it contributes to the discipline
Mature overview of the discipline

Insight Centre:

There have been many changes for Insight@Galway which was formally well known as DERI. Our former director Prof. Steffan Decker moved to Germany and we have new director Prof. Dietrich Rebholz-Schuhmann. Interestingly, many researchers, including PhD students, Postdocs moved to Germany as well. There are many career paths for graduates from here including academic positions as well as industry ones, or even some of them start running own startups etc.

Conference

After several attempts for conferences, I've published two full papers in JIST2015 and SAC2016, and I found that it is really important to publish or try to publish your results in any conference or journal to get started, and get feedbacks from the experts. In my previous experience, I've been recommended do not to read and present a conference paper for a seminar during previous studies. However, here, one thing I love is that top conferences have the same importance to top journals. There is an interesting article to read if you have the same wondering: https://homes.cs.washington.edu/~mernst/advice/conferences-vs-journals.html
At the end of the year, I submitted a paper to ESWC2016 which has very interested tutorials for me http://2016.eswc-conferences.org/program/workshops-tutorials and hope I will have an opportunity to attend it:). Another conference I'd like to participate is UMAP2016 which is also highly related to my research. So... Fingers crossed for the upcoming new year.

JIST2015 Travel Report

This is the first travel report for my first participation at a conference for presenting my paper. The Joint International Semantic Technology Conference is a merged conference of ASWC (Asian Semantic Web Conference) and CSWC (Chinese Semantic Web Conference). This year, the conference consists of main conference as well as an industrial forum and a data challenge. The majority of the participants come from China and Japan and the committees expects more participants from the USA and Europe. Overall, there were around 100 participants for the conference and the industrial forum.

Keynote speakers:
There were two keynote speakers and many invited talks during the conference and the industrial forum. Speakers include Prof. Frank van Harmelen, Hong-gee Kim from SNU Korea, Ming Zhou from Microsoft Asia, Prof. Jeff Pan and many industrial participants from giant companies such as Baidu, Alibaba, China Mobile etc.

Prof. Frank van Harmelen talked about the research efforts on Semantic Web till now and the near future and several projects that they are currently working on for understanding the Semantic Web.

2000-2010: how to build it
2010-2015: how to use it
next decades: how to understand it

To this end, he talked about How to build a WoD observatory and their solution named - LOD Laundromat: clean your dirty triples: crawl from registries (CKAN), which consists of 38 billion triples and counting!

Look back the research papers on Semantic Web community, there were a majority of papers optimized in DBpedia last 5 years. This is an especially bad idea since Semantic Web!= DBpedia... They try to use some analytical tools on top of the LOD Laundromat (LODLab, LOTUS) re-run experiments on larger datasets (100+) which was collected by the WoD observatory.

Even at an early stage, there are some interesting findings such as "Hotspots: Most of the dataset is not used in many queries (in terms of datasets)", if we can identify it, we can optimize for these queries and local sub-graph works well for answering queries (no explanation yet - theory)

So the agenda should be changed: not building things, but why those things work so well!

Industry forum:

It was very impressive that there were around 10 companies participated in the forum to present their work on using Semantic Technologies in their companies. The most impressive talk for me was given by Jie Bao from Memect Inc., who was working at RPI and MIT before. He pointed out the difference of research in Europe and the USA:

based on project vs. business
conference: ISWC vs. semantics

He stressed that even Semantic Web community is often talking about schema.org for a good example of SW, it is a project and not a business. There are little start ups survived at present even we expected a new semantic search engine, semantic agent etc..

And other slides on start-up companies with Semantic Web...

Data Challenge:

I think this session is one of the most beneficial ones for me. I personally participated the data challenge on entity type prediction and placed 4th among 13 teams. Other top-5 participants came from Peking University, Chinese Academy of Sciences, Fujitsu R&D and Dongnan University. Learnt a lot about NLP and Machine Learning and I need to improve NLP skills and ML skills in the near future.

JIST2015 data challenge from GUANGYUAN PIAO

Editors' session:

Since my paper "Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes" for the main conference was one of the Best Paper Candidates, I had the chance to meet journal editors to get some feedbacks about the work. I had many useful comments from them and had a lot of work to do to improve/strengthen it.

Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes from GUANGYUAN PIAO

It was a good experience to attend the conference and the JIST2016 will be at Singapore:)

---------------------------------------------------------------------------------------------------------
Update after the conference:

The proceedings are available from March, 2016. This is a very slow publication of post proceeding compared to other conferences such as UMAP. The proceeding of UMAP was available during the conference while JIST2015 proceeding had not been available until March, 2016. Both were published by Springer, what's the difference??

Previous JIST conferences have external proceedings for posters and workshops in CEUR but not this time. May be it depends on the organizing committees and it is better to check it before you submit any workshop or poster papers in a conference.

Although the conference has a line up of great speakers and programs, it would be better to disseminate the activities of the conference in global channel like Twitter (instead of using the localized social network) as it is a "international" conference.

JIST2015 Data Challenge

中文 stop words

Apply Filters for Attributes

Change all properties to nominal using regex pattern

String regex = "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Replace all {0} or {1} to {0,1} in train.arff and test.arff files to let them be compatible.

Normalize all numeric attributes into [0,1] interval

Save Predictions to a File

[Visualize classifier errors] - > save as a .arff file
remove all headers and import it into a .csv file
final two columns show the "predicted classes" and "actual classes"

Parameter Selection

Use CVParameterSelection as the classifier
Choose a classifier to optimize the parameters
Add a parameter need to be tuned in CVParameters

e.g., I 10 250 25 for Random Forest will test 10, 20 ... 250 (25 steps) for the [number of trees] parameter
It will show the best option as output: "Classifier Options: -I 90 -K 0 -S 1"

Feature Selection

Fast feature (attribute) selection using ranking (using searching is not feasible due to the number of attributes is large)

GainRatioAttributeEval with Rank
Set threshold < 0.1, < 0.2, < 0.3, < 0.4 to run experiments to see which threshold is working best -> 0.2

Consider Index (Can it improve the performance?)

Used formula 1-index/size
Answer: