A software solution that helps data analysts and data scientists clean, combine, format, map, and enrich the data for analysis or machine learning.
The three primary functions of data preparation tools are discovery to understand data content, transformation to reorganize or restructure data, and governance for useful and trusted data. Data discovery processes find meaning, patterns and specific items in a collection of data. Data transformation processes change data to improve, enrich, combine, format, restructure, or otherwise shape and organize data for a specific use. Governance processes perform data validation and data protection functions and manage the metadata essential to know and trace data lineage.
Data preparation is, of course, at the heart of this great industry of data (re-)generation. From its original role in populating data warehouses and marts, data preparation is now spawning tools to fill data lakes and distribute their data goodness ever further. However, the volumes of data involved today raise serious questions about the validity of an architecture based firmly on creating ever more copies of data, especially of social media and IoT data. This simple observation was one driver of the pillared data architecture I proposed in Business unIntelligence, where these voluminous data types are stored only once—if at all—within the enterprise. In addition, for reasons of timeliness and agility, the same architecture posits the need to reduce the number of copies of traditional business data as well.
Data Preparation. Although data catalog makes it easy to find relevant data, it doesn’t help data analysts clean, format, combine, derive, or aggregate data. That’s the job of data preparation tool, the last component of the data analyst workbench. Data preparation tools go well beyond what Excel offers; most are designed to handle big data that consists of large volumes of multi-structured data with many concurrent users. Most importantly, data preparation tools keep a visual audit trail of data manipulations, allowing data analysts to edit or reuse an existing analysis that they or a colleague created.