Sunday, February 9, 2014

Test your data science skills - Kaggle competitions

I recently started directing my interests towards Data Science. I started reading related books and doing related MOOCs. It is very interesting area of science, which has particularity good application in business. But dry learning from books and from MOOCs could be unrelated to real life problems and also could be boring at some point. For programming, solution is easy: coding competitions. Here is list of many pages dedicated more or less to coding competitions. But how about data science?

Luckily, there is one website which is hosting data science competitions: kaggle.com. There are several categories of contest organized on kaggle:
  • Featured: often complex problems but heavily sponsored in terms of prizes. Organized by big companies.
  • Masters: limited access competitions. You have to receive access to Master Tier of kaggle by achieving great results in previous competitions.
  • Recruiting: competitions dedicated for recruitment
  • Prospect: competitions without leader boards. Goal of them is usually to explore various data sets and discuss results among other kagglers.
  • Research: problems related to strict research areas.
  • Playground: interesting problems which are solved for fun.
  • Getting Started: tutorials
By the time I'm writing this post, there are 14 competitions (4 of them are long lasting tutorial competitions), with 334 K$ prize pool.

How are those competitions working? When you register for each competition, you receive access to train and test data and some additional information. Train data has known value that you have to predict. Test data is actual data on which you will test your model. After running model on test data, you submit your results to automated test web application and receive accuracy results. Those results will build public leader board. It is called public because only part of your results are taken into calculation of your score. Other part will be taken into account after closing the competition and based on it, final leader board will be constructed. By this split, it is harder to overfit model by examining results.

So how to start with data science competitions? I recommend starting with Titanic competition. This competition has nice tutorials in Excel, Python and R, and looks quite easy, at least at beginning. 

Anyway, I wish you GL & HF. See you on leader boards!

No comments:

Post a Comment