Data Engineering for GenAI: Three Criteria to Evaluate Pipeline Tools
ABSTRACT: This blog explores three criteria to evaluate tools that manage unstructured data pipelines for GenAI.
Read time: 6 mins.
Sponsored by Datavolo
Generative AI initiatives push data engineers outside their comfort zone. Long accustomed to SQL tables and BI projects, they must learn to transform unstructured content—emails, documents, pictures, and the like—into trustworthy inputs for language models. Not easy! Fortunately, an emerging class of data pipeline tools can help.
This blog, the third and final in a series, explores three criteria to evaluate tools that manage unstructured data pipelines for GenAI. It presages a report and webinar that expand on these criteria, offering an “ultimate guide” to evaluating tools in this critical category. The first blog in our series defined why and how data engineers enable GenAI initiatives by building unstructured data pipelines. The second blog explored how data engineers mitigate the risks of GenAI by optimizing the data, optimizing the pipelines, and governing their environments.
Asking the hard questions
Data engineering leaders should evaluate unstructured data pipeline tools by their functional breadth, ease of use, and governance capabilities. Let’s explore each of these criteria and the primary questions to ask vendors.
Tool Evaluation Criteria
Criterion 1. Functional breadth
Data engineering leaders should evaluate tools’ functional breadth by asking about their ability to manage the full pipeline lifecycle; manage all the necessary ETL steps; and support their sources, targets, and data types.
Does this tool enable users to manage the full lifecycle of an unstructured data pipeline?
Data engineers and their colleagues design, test, deploy, observe, and orchestrate the pipelines that deliver data to GenAI applications. While the steps themselves are familiar, they have new requirements when it comes to unstructured data. Tools that consolidate most or all of these steps in a single interface can reduce complexity, training requirements, and risk.
Does this tool manage all the necessary ETL steps for unstructured data?
As described in our last blog, unstructured data pipelines perform the following steps. Assess whether pipeline tools support all these steps or force your team to purchase an additional tool.
Extract. First the pipeline parses and extracts relevant text and metadata from applications and files, including complex documents with embedded figures and tables.
Transform. It divides the content into semantic “chunks” and uses an embedding model to create vectors that describe the meaning and interrelationships of chunks. It also might enrich these documents with chunked content from additional sources.
Load. Finally, it loads and embeds these vectors into the vector database or vector-enabled data platform. These vectors are now ready to support GenAI.
Architecture of an Unstructured Data Pipeline
What sources, targets, and data types does it support?
GenAI thrives on a wide variety of data. Your pipeline tool should support all major sources of text, imagery, audio, and video, including popular applications such as Microsoft 365, Salesforce, GitHub, Slack, and Zoom, using open APIs and simple connectivity options.
It also should support common file formats such as the following:
Text: PDF, DOC and DOCX, HTML and HTM, and XLS and XLSX
Image: Bitmap, JPEG, GIF, PNG, and EPS.
Audio: MP3, WAV, ALAC, and WMA.
Video: MP4, MOV, WMV, MPEG-4, and AVI
In addition, it should support all major targets for GenAI environments. This includes vector databases such as Pinecone, Weaviate, and Vespa, as well as vector-enabled data platforms such as Databricks, MongoDB, and SingleStore. Probe vendors about their ability to integrate all the current and future sources, formats, and targets in your environment.
Criterion 2. Ease of use
Data engineering leaders should evaluate tools’ ease of use by asking about skill and training requirements, degree of automation, and level of effort.
What skills and how much training does this product require?
GenAI forces data engineers to learn how to handle unfamiliar elements such as unstructured data, chunking techniques, and embedding models. Evaluate pipeline tools based on their ability to make data engineers proficient in these areas without significant training.
What level of automation does it offer?
As always, the more you automate routine tasks, the better you improve productivity. Look for pipeline tools that offer a graphical interface to minimize scripting, and better yet a GenAI chatbot to minimize typing. Remember, of course, that expert users still must inspect and often revise tool outputs for quality.
What level of effort does it require to manage pipelines?
User productivity boils down to output per hour. Devise some simple, measurable output metrics—perhaps number of sources connected, volume of data ingested, or pipelines created—and compare these metrics across homegrown and commercial tools. Test how these metrics vary by use cases to ensure your team can stay productive when requirements change.
Criterion 3. Governance
Finally, data engineering leaders should ask vendors about their tools’ governance capabilities, ability to make GenAI trustworthy, and ability to make pipelines explainable.
How does this tool support data and AI governance programs?
The risks of GenAI range from hallucinations to privacy breaches, biased outputs, and compromised intellectual property. A pipeline tool must reduce these risks by helping govern GenAI inputs. Look for a tool that organizes metadata, controls user access, masks sensitive data, and audits activities. For more detail on these governance measures, check out our earlier blog.
Metadata. Your tool should organize metadata that describes element s such as files, pipelines, and workflows, make it searchable for users, and integrate it with catalogs. It also should map file hierarchies and other relationships between datasets.
Access controls. Administrators should be able to enforce role-based access controls that ensure only authenticated users perform only authorized actions on permissible datasets.
Masking. Your tool should selectively mask files, tables, or columns from certain parties to protect personally identifiable information (PII) such as social security numbers.
Auditing. It should maintain an audit log of all pipeline activities. Data engineers, data stewards, and compliance officers need to search, review, and export these records to support both internal and external reporting requirements.
How does the tool help users make GenAI inputs more trustworthy?
As described in our last blog, data engineers and their colleagues optimize GenAI inputs with curation, transformation, profiling, and validation. Your pipeline tool should assist these steps in the following ways.
Curate. The tool should help curate source objects by enabling users to track and edit their labels, annotations, and other metadata.
Transform. It should give users visibility into transformation tasks such as chunking and enrichment.
Profile. By profiling GenAI inputs at each stage in the pipeline, the tool can help users track lineage end to end. This boosts confidence in the outputs and speeds the remediation of issues.
Validate. The tool should validate GenAI inputs, for example by checking data against schema registries and enabling users to insert custom logic. Capabilities like these reduce pipeline errors that undermine data integrity.
Does the tool make pipelines explainable?
Automation and abstraction, while desirable, should not hide technical details. Data engineers need the option to inspect pipeline inputs, transformation techniques, and outputs. For example, if they ask the chatbot why it recommends a certain chunking method or embedding model, they need an intelligible explanation to share with data scientists, business experts, and auditors.
Summary
Data engineers face significant challenges as they prepare and deliver unstructured data to GenAI applications. The more they can consolidate functions into a single tool that streamlines effort and assists governance programs, the better they can overcome the challenges and get back into their comfort zone. To dig more into this topic, join my webinar with Luke Roquet, COO of Datavolo, on May 28.