Recently, we were asked to perform a short data analysis for NMBS, the Belgian railway company.
Over the course of a week, they wanted us to investigate two related datasets to help them understand their data and identify missing observations.
In essence, this seems like a simple, straight-forward task. However, given that we only had one week to understand and analyse quite a lot of data (a couple millions of rows for each dataset), while also ensuring that the analysis could be reproduced and presented by the client himself in the future, the challenge ahead was not to be underestimated.
In this blogpost, we’d like to offer some tips and tricks on how to investigate large datasets in a short amount of time. Hopefully, these will help you make sense of your own datasets when time is of the essence.
Getting value from large datasets…
Take time to set up your workflow correctly from the beginning
It’s always a good idea to set up your permanent workflow from the start, even before exploring the data. This is especially the case when handling larger datasets, since small changes in the data source can make you rerun your entire analysis, which could take ages if you didn’t configure your workflow correctly.
Instead of diving head-first into the exploration phase, ask yourself a couple of questions first:
- How much data do I have?
- How will I present my results?
- Does my analysis need to be reproducible?
If you have a lot of data and your analysis need to be reproduced by other people in the future, then it doesn’t make a lot of sense to start exploring the data in notebooks on your local machine. Since you’ll need to transfer to a cloud solution to be able to handle the full dataset later anyway, why not do this right away? This will save you from future headaches when that ‘one-time’ analysis you performed is actually really interesting to the business and you need to redo it a multitude of times on a new data source.
For the NMBS case, we set up a pipeline on day one that read the data from an Azure Blob storage to Databricks and output its results back into the Blob storage. Azure Data Factory would run this pipeline each time a new file was added to the Blob Storage with a specific name (e.g., ‘input’). We hooked a PowerBI dashboard up to our output files, which helped us in the exploration phase to visualize our first findings, but which was also used in the end to present our final conclusions.
Setting up this simple pipeline before we did any analyses ensured: that we would not have to redo our analyses manually, boosted our productivity, and had the advantage that the client would be able to run the full analysis himself with new data by simply uploading new ‘input’ files and inspecting the resulting graphs in PowerBI.
Even though it might seem like a good idea to be able to show some interesting data titbits as soon as possible, taking time out of the first day of the project to correctly set-up your workflow is never a bad idea.
Focus on specific examples
When faced with large volumes of data, it’s often easiest to make sense of this data by focussing on specific examples. If we’re investigating missing observations between related datasets, then simply summarizing the amount of missing information (e.g., 15 % of the data is missing) is not usually very helpful. To really understand why some data is missing, we investigate examples of both problematic and non-problematic observations.
In our study case, initially we thought we found quite a lot of missing observations between our two datasets. However, by focussing on specific examples and discussing these with the business, we concluded that quite a lot of these could be considered as ‘missing but non-problematic’, which were then excluded in the final analysis.
Ensuring that the analysis only focussed on problematic cases could not have been done without investigating specific examples.
…In a short amount of time
Specific questions lead to achievable goals
Usually, we prefer to keep the project questions and goals open-ended in the beginning. Getting pinned down in a specific pathway while you don’t even know your data yet could close interesting avenues right away. However, when time is a limited resource, setting precise goals for the week will speed up your project considerably. This does mean that you might not be able to further explore some other data insights, but you could always document these unexplored pathways for future analyses. Data analyses that only take up a couple of days are often meant as groundwork for other projects anyway, so it’s ok you aren’t able to squeeze every interesting piece of knowledge out of it right away. Focussing on some key insights from the start will provide more value to the business in this case.
For NMBS, we decided together to concentrate on six key questions for which we presented the conclusions at the end of the week.
Communication is (even more) key
You should always be consistently communicating with the client to ensure that your goals are aligned. This is the case for larger projects, but even more so for small projects. If you don’t have a lot of time, you might feel tempted to not hold daily discussions since you’d want to spend the little time you have on the analysis itself. However, if your communication is not clear, you could be wasting time on analyses that are of little interest to the business. This is precious time that could have been spent better elsewhere, such as holding a quick call with the client to ensure that your next steps are the right ones for the project.
In our case, we decided to have a quick call at least once a day and kept in close contact with the client throughout the day as well. This ensured that we did not waste valuable hours of our time on parts of the analysis that were of lesser importance.
Documentation is key in every project but is often the first thing to go out the window when faced with a time constraint. This is always a bad thing: in large projects it becomes difficult to work together with other partners and to keep track of previous developments, while in smaller project the methodology you used will be immediately forgotten after the project is finished. Not documenting your methodology and results in a smaller project will mean that your work will have been for nothing if the business decides that your analysis was interesting enough that it merits other related projects. In this case, taking the time out of your day to document everything you did will be much more valuable to the business than using that time for some extra analyses.
In the NMBS case, we used our last halve-day of the week to document everything we did during those four and a halve days. Even though this means that we’re spending ten percent of our total worktime simply on writing out the methodology, the entire analysis would be practically worthless after the final presentation if we did not do this. This way, the client or other analysts can easily continue the project where we left of, using the methodology and findings we already gathered for them.
In need of a data analysis yourself?