CKIM2020 Analyticup Workshop Report



Today, I had time to attend the CIKM Analyticup Workshop on Covid-19 Retweet Prediction Challenge held in CIKM2020 online, which is organized by Dimitar Dimitrov (GESIS – Leibniz Institute for the Social Sciences) and Xiaofei Zhu (Chongqing University of Technology). The workshop starts with Philipp Singer who is one of the grandmasters and ranked 3rd place at the moment on Kaggle followed by the talks of finalists of the competition.

Philipp Singer gives a keynote talk on "How to Kaggle?" where he went through his journey from his first participation to the current status as a grandmaster. More specifically, he took two typical competition examples to show why Kaggle is important, especially, in terms of learning by participating those competitions. One of the interesting competition is the NFL Big Data Bowl challenge where his team got the first place, and more importantly, the solution has been deployed for ranking performance of players now. This is a great example of how the Kaggle competitions reflect real-world examples and challenges to be solved. 

Regarding experience for designing solutions for those competitions, some of the points are interesting to me:
  • The best-performing solution is not necessary complicated ones, but really well-designed and tuned solutions that backed with intuitions regarding to the problem
  • Exploratory data analysis is also important
  • Domain knowledge is not necessary to design the solution, the NFL challenge is a good example

He also provides how to get started and why we should do it regarding competition platforms such as Kaggle:
  • Just start participating competitions that might be of your interests
  • You don't need computing resources, many of the Kaggle competitions provide the resource for you (interesting to hear his two examples are based on Kaggle resources)
  • The most important thing is learning by doing your best
  • Those platforms are where the academic ideas get tested in wide range of real-world problems

After the keynote, the winning teams for the Analyticup presented their solutions. It is interesting to see those winning teams are all from data scientist background in industry. The top-3 teams came from Rakuten, AIDA, and Aidemy Inc., respectively.

Interestingly, the winning solution from Vinayaka Raj seems simpler than other teams, and the model is from the paper - NPA: Neural News Recommendation with Personalized Attention. The personalized attention seems played a key role for decreasing the MLSE scores at the first place.



At the end, the ensemble approach using different word embeddings such as fasttext, fasttext wiki, glove 840B, glove twitter, and lexvec together with the pseudo-labelling yield the best performance on the competition.




The second-place solution uses various types of models for ensemble, where one interesting features is the graph embedded entities (in the figure below) which the authors said contributed a lot to the performance. 



The intuition of the common-user model is quite similar to the intuition that I had in our solution (which is ranked 4th at the end), which is based on the set of common users in both the training and test sets to train a separate model(s).



The 3rd solution uses DNN with tweet- and user-related features as follows:



Overall, the challenge is quite interesting and the workshop provides a lot of insights to me for learning and understanding missing points of my solution. Especially, although somehow the other solutions inlcuding mine incoroporated "manually designed attention" in the solution, e.g., focusing on the set of common users separately, the attention mechanism incorporated into the winning solution might provide automatic attention to users and boost the performance very well.