Wednesday, May 10, 2017

Not enough RAM for data frame? Maybe Dask will help?

What environment do we like?

When we are doing data analysis we would like to do it in as much interactive manner as possible. We would like to get results as soon as possible and test ideas without unnecessary waiting. We like to work in that manner because great proportion of our work leads to dead ends. As soon as we get there we can start thinking about other approach to problems. When you were working with R, Python or Julia backed by Jupyter Notebook, you are familiar with this workflow. And everything works well when your workstation has enough RAM to handle all data, its processing, cleaning, aggregation and transformations. Even Python, which is considered as rather slow language performs nicely here. Especially when you are using strongly optimized modules like NumPy. 

But what with case, where data which we want to process is larger than available RAM? Well, there are three simple outcomes of your attempts: 1) your operating system will start to swap and possibly will be able to complete ordered task, but it will take way more time than expected; 2) you will receive warning or error message from functions which can estimate needed resources - rare case; 3) your script will use all available RAM and swap and will hang your operating system. But are there maybe any other options?

What can we do?

According to this (Rob Story: Python Data Bikeshed) and this (Peader Coyle - The PyData map: Presenting a map of the landscape of PyData tools) presentations, there are many possibilities to build tailored data processing solution. From all of those modules, Dask looks especially interesting to me, because it mimics Pandas naming convention and functions behavior.

Dask offers three data structures - array, data frame and bag. Array structure is designed to implement methods known from NumPy array. So everyone experienced with NumPy will be able to use it without much problems. The same approach was used to implement data frame which is inspired by Pandas data frame structure. Bag on the other hand is designed to be equivalent of json dictionaries or other Python data structures.

What is key difference between Dask data structures and their archetypes? Dask structures are divided into small chunks and every operation on them is evaluated when its needed. It means that when we have data frame and series of transformations, aggregations, searches and similar operations Dask will calculate what and when to do with each chunk and will take care of executions of those operations and do garbage collection immediately. If we would like to do that in original Pandas data frame it will have to fit into RAM entirely, and every of steps in pipeline will also have to store its results in RAM even if some operations could be executed with inplace=True parameter.

How can we use it? As I mentioned, Dask data structures were designed to be "compatible" with NumPy and Pandas data structures. So, if we check data frame API reference we will see that many Pandas methods was re-implemented with the same arguments and results.

But the problem is that not all original methods from NumPy and Pandas are implemented in Dask. So it is not possible to blindly substitute Pandas with Dask and expect that code will work. On the other hand, in cases where you are unable to read your data at all, it might be worth to spend some time to rework your flow and adjust it to Dask.
Second problem with Dask it that despite it tries to execute various operations on chunks in parallel, it may take more time to produce final results than in simple all in RAM data frame. But if you know time execution characteristic of your scripts you can try to substitute some heavy time parts of it and compare with Dask execution. Maybe it will make sense.

The future?

What is the future of such modules? It depends. One my ask "Why to bother when we have PySpark?". And this is valid question. I would say, that Dask and similar solutions fits nicely in niche where data is to big for RAM but on the other hand it fits on directly attached hard drives. If data fits nicely in RAM I wouldn't bother to work with Dask and similar modules - I would just stick to plain old and good NumPy and Pandas. Also if I had to deal with such amount of data that wouldn't fit on available attached to workstation disks, I would consider going into big data solutions which also gives me redundancy over hardware failures. But Dask is still very cool, and at least worth to test.

I hope that Dask developers will port more of NumPy and Pandas methods into. I also saw some works towards integrating it with Scikit-Learn, XGBoost and TensorFlow. You should definitely check this out when you will be considering buying more RAM next time ;).

No comments:

Post a Comment