Things we learned integrating Data Exploration with D-Tale

Press enter or click to view image in full size

In the last project I was working on, we needed to provide to the client different types of analyses in order to detect anomalies on an existing application. Those analyses were doing their job just fine, but what if the data evolved and new type of anomalies emerged? What if the client just got curious and wanted to look at the data themself? Maybe find a new type of anomalies without having to rely on an external team?

We decided the best solution to this problem was to provide a data exploration tool. Our requirements were:

The client must be able to visualize data coming from different tables of our database
The client should be able to display it in multiple ways (tables, charts, statistics… the more the better !)
It should integrate easily with our Django backend
It must not take a lot of development time

We looked at the classic pandas, matplotlib, seaborn, etc., but these solutions made writing code necessary. The process would have been too complicated to be useful: write code, rebuild docker image, deploy to the kubernetes cluster, explore your data, come to the conclusion that this is not exactly what you wanted to see, rinse and repeat…

We obviously needed something more dynamic. Perhaps a JS charting library on our frontend, with custom API endpoints for querying the data? Well if you can allow custom type of charts, make different endpoints with different queries for different tables, and all in a few days of work, contact us and we’ll probably hire you ;). The project was almost at its end and it was too much work for the amount of time we had left.

Finally, we found D-Tale.

Press enter or click to view image in full size

For more information : https://towardsdatascience.com/introduction-to-d-tale-5eddd81abe3f

From the project’s github :

D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view and analyze Pandas data structures. It integrates seamlessly with ipython notebooks and python/ipython terminals. Currently, this tool supports Pandas objects such as DataFrame, Series, MultiIndex, DatetimeIndex & RangeIndex

It provides all features we needed and much more. This article will not list or present D-Tale’s features. There are already many articles doing that.

I was tasked to integrate D-Tale in our solution. I developed a page where the user could go through a small process to select the data to be displayed, an algorithm to structure the data in a certain way, and finally spawn a D-Tale server.

Press enter or click to view image in full size

Select the data you want to explore

I was surprised by how simple it was. Once I had the dataframe ready, it only took me one line of python to start D-Tale, and a few to keep track of all the instances running.

The actual line of code used

Everything was working fine. The next step was to merge my branch and go through the CI pipeline. I was done, in less time than I had planned… at least I thought I was. We noticed that sometimes, the instances did not appear on our list. By refreshing the page, they randomly appeared, or not… Weird.

We figured out there was a difference between the local environment (where we start Django directly) and the staging environment (where we serve our Django app through gunicorn and nginx). We had multiple gunicorn processes in staging and each D-Tale instance was bound to a randomly chosen one (so we didn’t see it if the next request went through a different process).

So I read everything I could on the project’s documentation. It said specifically that we needed to set up a cache to share the data between the processes. Ah ! That must be that. After a few more hours to set it up, it still didn’t work. I felt totally hopeless: the feature was not usable, and I had no lead on how to fix this problem.

I decided to open an issue on github and ask directly, even if I didn’t really expect a result. But it turned out Andrew Schonfeld (the author and main contributor) read my issue, took some time to understand what I wrote, and figured out the solution. And then in 15 minutes, everything worked as expected.

That’s how I felt. Thanks Andrew !

I think there are two important things to take from this story.

First, it is very important to have a staging environment. Had we not set up this environment like in prod, the client would only have seen a broken feature. Here’s an article going more in depth about this subject.

Second, I’m glad we chose a well-maintained open-source project. The solution to our problem was simple but it could have taken me a week to find out where the problem was coming from.

In the end, we’re glad we chose D-Tale. It handles our data exploration needs perfectly and was easy to integrate.

Want to know more ? ✏ Contact us (in french, or not)

Lucas Bergognon is a fullstack developer at Scalian. Working in the AI team, he typically works on frontend, backend and devops.