Data exploration is the iterative process of uncovering hidden insights in data. Good data exploration unfolds almost as though it were an interview between the analyst and her data. The analyst sets out to answer a broad question of interest, asking successively more specific questions as she goes. Here’s an example “conversation” where an analyst uses data to try and answer this question: “Why is our churn from last month so much higher than anticipated and what should we do about it?”
It turns out that the CEO of this company had asked that push notifications be turned off because one of the company’s investors found them annoying. He asked the product team to remove the feature without studying the impact that decision would have on churn rate and ended up causing the company to miss its quarterly churn goal as a result!
This is a very typical data exploration process. The analyst is asked a broad question about the business and needs to both come up with theories, and test those theories against the data. This process requires a significant amount of creativity, context, and rigor on the part of the analyst.
This is different than KPI reporting. KPI reporting is a broad look into the health of a business’s KPIs and how they are trending over time. There's more to know about KPI reporting, but for now, let’s dig deeper on the tools used for data exploration.
If you’ve made it this far, congratulations: you have the keys to the city. Your data has been aggregated into a single central repository, updates automatically as your business evolves, and is housed in a platform that can be queried using SQL, the dominant language of data.
Amazingly, however, all of your labor thus far has just been table stakes. You haven’t actually derived any real value from your data yet— you’ve just been doing the plumbing! Now it’s time to reap the fruits of your labor. In this section, we explore the many types of tools that you can attach to your data warehouse in order to run a more data-driven business. If used correctly, these tools can influence product development, customer acquisition, cost management, hiring, fundraising, and much, much more. Let’s dig in.
Statistical programming is where analytics all began. SQL queries are great, but when you’re getting your hands dirty with a serious analysis, there are no shortage of scenarios where they fall short:
Statistical scripting languages like R, SAS, Matlab, etc, provide all these advantages and more. No self-respecting data scientist made it through school without using them extensively, and for many they are still the fastest, most flexible, and most powerful way to answer data-driven questions.
That said, these tools are for serious analysts only. You need to learn syntax, understand how to correctly interpret the statistical models at work, and work within the best practices of programming to even attempt collaboration or the interpretation of your work by others. And don’t count on the output winning any design awards: these tools can provide functional visualizations, but the appeal of the output stops at functional.
These tools are powerful weapons for powerful people, but they don’t solve everyone’s problems when it comes to data. That’s where the rest of the field comes in.
These tools take data analysis a step farther by providing a more user-friendly interface for exploring, querying, and visualizing data:
Some of these tools also include basic dashboarding functionality so that you can present multiple analyses within a single view.
Tools like these are the most common compliment to a robust data pipeline like the one described in earlier sections of this document. However, unlike full-stack business intelligence tools (described next), these querying and visualization tools are only useful if you’ve already solved the problem of data consolidation and optimization. As a result, they tend to be better suited for analysts at organizations with an mature analytics function.
In today’s leading companies, data isn’t just for analysts anymore. A truly data-driven organization will find ways to make data accessible and explorable for every member of their team. This extends beyond reporting— BI should empower business users to ask and answer their own questions, even if they don’t have experience with data modeling or writing SQL. The analytical and organizational benefits of such a system include:
Most of the tools we’ve explored so far fit into a discrete workflow: user has a question, tool helps user answer that question, new questions emerge, repeat. The tool in such a framework delivers value by minimizing the delay between asking and answering. But what if there was a tool that told you what question you should be asking in the first place? Enter machine learning.
Machine learning tools base their offerings on statistical methods like cluster analysis, correlation analysis, and time series analysis to identify noteworthy anomalies in your data, classify entities into related groups, recommend actions to take, and more.
For the most part, these methods aren’t new; they’ve been available for decades. But modern SaaS machine learning tools like IBM Watson, Mintigo, and Spinnakr are productizing these techniques to make them available to companies without teams of data science PhDs.
Here’s what makes these tools stand out:
People have a profusion of preferences when it comes to analytical tools. Some people prefer command line interfaces like R and Python. Some people prefer to lay data out visually in Excel. Some people like manipulating data in a GUI customized for the type of analysis they do most.
Sometimes these choices are optimal, but sometimes they’re just based on what a user happens to know. It’s important to respect these choices in either case, because the most important thing is that people use data to make decisions at all. Limiting their flexibility in terms of how exactly they go about doing that will only serve to prevent adoption—the worst possible outcome.
The good news is that these tools aren’t mutually exclusive — each one is a different way of exploring and leveraging your company’s data to learn new things and make decisions. What tool is best for you is simply a function of the problems you want to solve, the people you have available, and the extent to which you want to get your hands dirty.
Today, visualization tools and full-stack BI solutions lead the way from an adoption perspective, but that is largely driven by the existing knowledge base and the products that exist in the market. In a world where the data scientist is the highest-growth job description on the planet, it wouldn’t be surprising to see existing and new tools built for this group to gain market share.