By Sean Kandel. This article originally appeared on Harvard Business Review, April 1, 2014.
As organizations collect increasingly large and diverse data sets, the demand for skilled data scientists will continue to rise. In fact, it was dubbed “The Sexiest Job of the 21st Century” by Harvard Business Review.
Unfortunately, the day-to-day reality of the role doesn’t quite match the romanticized version.
Starting in 2012, my colleagues and I began taking a closer look at the hands-on experience of data scientists. At Stanford, I conducted 35 interviews of data analysts from 25 organizations across a variety of sectors, including health care, retail, marketing, and finance. Since then I’ve spoken with another 200-300 analysts. What we found was that the bulk of their time was spent manipulating data − a mix of data discovery, data structuring, and creating context.
In other words, most of their time was spent turning data into a usable form rather than looking for insights.
Granted, this stems from a positive shift in analytics. Whereas companies once maintained tight control over data warehouses, they are now shifting toward more agile analytic environments because the drive for data-driven decision-making has catalyzed the need for a different type of work. Today, data quality is no longer about a central truth but is instead dependent on the goal of the analytic tasks. Exploratory analysis and visualization require that analysts can fluidly access disparate sources of data in various formats.
The problem is that most organizations aren’t set up to do this. In traditional data warehousing environments, IT teams structure data and design schemas when the data is loaded into the warehouse,and are then largely responsible for ensuring strict data quality rules. While this upfront design and structuring is costly, it worked fairly well for years. Now that companies are dealing with larger and more complex data sets, however, this old way of managing data is impractical.
To keep pace, most organizations are currently storing raw data and structuring on demand. Schemas and relationships between datasets are now derived at time of use instead of at the time of load. This shift gives data analysts more flexibility to find unexpected insights, but also places the time-consuming onus of discovery, structuring, and cleaning solely on them.
Indeed, in our 2012 study of data analysts, we characterized the process of data science as five high-level tasks: discovery, wrangling, profiling, modeling, and reporting. Most analytic and visualization tools focus on the last two phases of this workflow. Unfortunately, most of a data scientist’s time is spent on the first three stages.
These three involve finding data relevant for a given analysis tasks, formatting and validating data to make it palatable for databases and visualization tools, diagnosing data for quality issues, and understanding features across fields in the data. In these phases, data scientists encounter numerous challenges, including data sets that may contain missing and erroneous or extreme values. These tasks often require writing idiosyncratic scripts in programming languages such as Python and Perl, or extensive manual editing using tools like Microsoft Excel. But if not caught, this can cause any assumptions made to be wrong or misleading – poor data quality is the primary reason for 40% of all business initiatives failing to achieve their targeted benefits.
Because of this, the skills of talented data scientists are often wasted as they become bogged down in low-level data cleansing tasks or encumbered when they cannot quickly access the data they need. This creates a huge bottleneck, stalling the progression of data as it moves from data stores like Hadoop to analytic tools that allow for greater insights. Data cleansing and preparation tasks can take 50-80% of the development time and cost in data warehousing and analytics projects.
Instead of solving these problems, organizations are often adding to the amount of data that require a data scientist’s attention. Through activity and system logs, 3rd-party APIs and vendors, and other publicly available data, companies have access to an increasingly large and diverse set of data sources. But without the right systems in place, the prohibitive cost of data manipulation leaves much of this data dormant in “data lakes.”
And by making data analysis a core business function for many departments, skilled analysts and members of IT are spending large chunks of time helping others access the data they need via low-level programming instead of doing any analysis themselves.
According to Gartner, 64% of large enterprises plan to implement a big data project in 2014, but 85% of the Fortune 500 will be unsuccessful in doing so. These time-consuming data preparation tasks are largely to blame. Not only do they throttle individual data scientists, but they greatly decrease the probability of success for big data initiatives.
If we can ever hope to take full advantage of big data, data preparation is going to need to be elevated out of the manual, cumbersome tasks that currently make up the process. Data scientists must be enabled to transform data with greater agility, not just manually prepare data for analysis. Domain experts will need to be able to explore deeper relationships between data sets without data being diluted by prolonged IT programmer or data analyst involvement.
Ultimately, the goal of data analysis is not simply insight but improved business process. Successful analytics can lead to product and operational advancements that drive value for organizations, but not if the people charged with working with data aren’t able to spend more of their time finding insights. If data analysis ever hopes to scale at the rate of technologies for storing and processing data, the lives of data scientists are going to need to get a lot more interesting.