Saturday, April 15, 2017

TPOT - your Python Data Science Assistant

While dealing with supervised learning problems in Python, such as regression and classification, we can easily pick from many algorithms and use whichever we like. Each of those algorithms has parameters, so we can use them to adjust its execution to our needs. Everything is fine and we are free to tinkering with our model. But this works when we know what to do and we have experience with data we are playing with. And how to start when you have just little knowledge of your data? Here comes TPOT.

TPOT is a Python module which can be used as stand alone application or it could be imported to Python script and used there. Its purpose is to test data processing and modeling pipelines build from various algorithms with various parameters.

So how to start using TPOT? You need to start as you usually start with building machine learning model. I decided to use Heart Disease Data Set for my example. First, loading and preparing data:
Then, we need to decide how to deal with missing data. I decided to fit it with mean of column:
Finally we can run TPOT:
As you can see, to start TPOT "evolution", you need to input numbers for generations and population_size. Generations value says how much generations will be created and population_size determines size of each generation. It means, that with generations = 5 and population_size = 50 there be 5*50 + 50 pipelines build and tested. In my case, best pipeline was:

Best pipeline: XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.01, XGBClassifier__max_depth=6, XGBClassifier__min_child_weight=2, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.55)

It says that I should use XGBClassifier on input data with mentioned parameters and I should receive 0.823039215686 CV score. To see how to actually use such pipeline we can examine generated Python file:
Only action which is needed to use this generated file is to fill missing input in line 7. And voilĂ , we have nice (and cheap) starting point for further analysis.

1 comment:

  1. Hello Damian.
    The Article on Python Data Science Assistant is nice.it give detail information about Data Science.Thanks for Sharing the information about it. data science consulting

    ReplyDelete