Skip to main content
Bitcoin Btc Stock Exchange Live Price Chart Picjumbo Com

Getting value from large datasets in a short amount of time

June 1st, 2022
DATA ANALYSIS

One week, millions of rows, six key questions. We were asked to perform a short data analysis for NMBS, the Belgian railway company. Over one week, we investigated two related datasets (several million rows each) to help them understand their data and identify missing observations. The analysis had to be reproducible and presentable by the client afterward. Here are our tips for investigating large datasets when time is limited.

Set up your workflow correctly from the start

Before exploring data, ask: how much data do I have? How will I present results? Does the analysis need to be reproducible? For NMBS, we set up a pipeline on day one: data from Azure Blob Storage to Databricks, output back to Blob Storage, with Azure Data Factory triggering on new files. A Power BI dashboard sat on top for both exploration and final presentation. This ensured reproducibility, boosted productivity, and let the client run future analyses by simply uploading new files.

Focus on specific examples

When faced with large volumes, focus on specific examples rather than just summaries. Simply saying '15% of data is missing' isn't very helpful. By investigating examples of both problematic and non-problematic observations and discussing them with the business, we discovered that many initially flagged cases were 'missing but non-problematic' — which were then excluded from the final analysis.

Bitcoin Btc Stock Exchange Live Price Chart Picjumbo Com

Communication, specific goals, and documentation

With limited time, daily calls and close contact with the client prevent wasted hours on low-priority work. We agreed on six specific questions upfront and presented conclusions at the end of the week. We spent our last half-day on documentation alone — ten percent of total worktime, but without it the entire analysis would have been practically worthless for future use.

Need help making sense of your data?

LET'S TALK