Automating The Pain Out Of Big Data Transformation

By Alex Woodie. This article originally appeared on Datanami, January 24, 2014.

Having a big data set is a bit like owning a big house. There are many more activities available to you and your friends, but at the end of the day, keeping the house clean and tidy is a major expense. What the world really needs right now is a way to automate the process of cleaning, normalizing, and preparing the big data set (the house) for the next analysis (party). Luckily, the big data community has responded with new offerings that take data transformations to new automated heights.

Data preparation is the bane of the average data scientists. “If you go talk to any data scientist and say ‘How much time do you spend in data prep?’ They’ll say, ‘Uh, that’s the worst part,'” says Prakash Nanduri, CEO and co-founder of Paxata and a veteran of the data transformation, ETL, and master data management (MDM) rackets. “The biggest challenge in any analytics exercise is not really doing the analytics, but in preparing the data for the analytics.”

Paxata is one of a handful of companies helping to push the envelope of data transformations in our new big data world. The old ETL-based approach that was created to push data from transactional ERP systems into data warehouses and data marts, is becoming a major impediment to productivity.

While big data platforms like Hadoop and NoSQL databases have given us amazing new capabilities for storing and making sense of data, those big data platforms have done little to solve the elementary data transformation issues facing us today. In some ways, Hadoop and NoSQL databases may have even exacerbated the problem because of the perception that big data has been tamed, so why not collect more of it? If anything, data scientists and data analysts are struggling even more under the weight of big data and the demands to crunch it, crack it, and get something useful out of it.

The foundational problem is that data transformation is still largely a manual process. “Today when you go out and talk to people whose job is to work with data every day, they’ll tell you they spend 50, 60, 70, often 80 percent of their time doing tasks, such as data munging or data wrangling,” says Joe Hellerstein, CEO and co-founder of Trifacta. “These are the tasks we’re trying to make much more efficient.”

To the Algorithms!

Companies like Paxata and Trifacta are on the cutting edge of data transformation. The two Bay Area outfits were founded with the principle that there has to be a better way to clean, normalize, and otherwise prepare data for analytics. And both companies are exploring innovative use of visual displays and machine learning algorithms to help automate the data preparation process.

“We’re leveraging powerful algorithms and distributed computing techniques to allow the system to automatically detect relationships across these big data sets, and actually merge these data sets automatically with an interactive fashion with the business analysts, rather than having an IT person do it,” Paxata’s Nanduri says.

Paxata’s cloud-based offering uses machine learning algorithms from consumer search and social media, including indexing, textual pattern recognition, and statistical graph analysis technologies. The technique allows Paxata to build a data model that is able to deduce semantic and syntactic connections among different pieces of structured and semi-structured data. As more data is added to the model, the transformations get better, Nanduri says, and much of the manual coding and stitching together of data sets that so often precedes the analytics session.

Paxata conducted a “soft launch” at last fall’s Strata + Hadoop World event in New York, and counts blue chip names like Dannon and Pabst Brewing Company as early customers. The goal is to allow data scientists and others who work with data to spend 20 percent of their time cleaning, pruning, and otherwise massaging the data, and 80 percent actually analyzing it from their BI tool.

“These are the people who have been fundamentally underserved,” Nanduri says. “They’ve been struggling with products like Excel or maybe they’ve started to use products like Tableau and they’re really loving it. But they’re having a huge problem with data preparatino and that’s where we come in. For those types of individual, we significantly increase their productivity. They don’t need to worry about munging data. That’s what we’re solving for them.”

A Transformational Experience

Trifacta is also tackling the problem of data cleansing and transformation with the power of advanced machine learning algorithms. But with its as-yet unannounced product, the company will also be delivering a highly visual approach to the problem.

“We really believe you have to have three branches of technology to solve this problem,” Hellerstein says. These includes a scalable data infrastructure, a good human-computer interface, and finally machine learning. Bringing these three elements together in a big-data platform, such as Hadoop, will allow Trifacta’s software to automate much of the grunt work currently involved with data transformation.

“It gets people out of the bits and bytes of writing code at the low level and raises them up into a visual domain,” Hellerstein says. “The approach is to look at where the user bottlenecks are and change the way they interact with data. So we’re trying to change that interface, change that interaction experience when working with data.”

The graphical experience will be absolutely central to Trifacta’s approach. Considering that some of the co-founders of Trifacta came out of the Stanford University analytics project that yielded the VizQL technology behind Tableau Software’s incredibly successful product, that shouldn’t be surprising.