Patching Data Pipeline Leaks: Meeting the Challenge of Data Quality in the Cloud

This is a collaboration article that was originally published on FirstEigen.com

The cloud has revolutionized the ability of small companies to process large amounts of data. In data intensive industries like financial services, it helps boutique firms go toe-to-toe with the traditional goliaths and their enormous data centers. At the same time, it’s created new challenges for teams trying to maintain data quality. 

Leaky Pipelines: The Top Data Quality Issue for Financial Services Firms

At small financial services firms, almost all data comes from external sources—credit bureaus, data vendors, governments, and other service providers. It arrives in massive quantities, often in near real-time. Operational or transactional databases capture this data in the short term before companies can move it to their cloud-based analytics environment. In between, data might pass through a data lake or other intermediate repositories. Handling these data flows requires running hundreds or even thousands of extract, transform, load (ETL) jobs every day. 

Every move represents an opportunity for systemic data quality issues to arise. These errors stem from problems with data and technology infrastructure and are independent of any particular line of business. Miscommunications between applications can result in duplication, corruption, or even omission of data in systems down the line. Of these, the most common--and therefore most impactful--for financial services firms is missing data. 


Systemic data quality issues arise from problems with data and technology infrastructure and are independent of any particular line of business.


Pipelines drop data when they get out of sync. For instance, one data leader I spoke with discussed the specific challenge of moving Salesforce data into Snowflake. His Salesforce system can only process five jobs at a time and locks tables while it does so. This prevents ETL processes from accessing the data, causing them to drop those rows. Even with the new official Salesforce-Snowflake connector, he still loses millions of rows, which creates a headache for downstream analytics. 


Pipelines drop data when they get out of sync.


Pipelines also desync when network links go down or degrade. In the cloud, organizations have no control over remediation--when something has gone wrong, they must wait for the provider to fix it. Multiply these issues across every external provider feeding data into a firm’s environment and you begin to realize the scale of the issue. 

The Cost of Poor Data Quality

These pipeline leaks represent a real threat to organizations’ bottom line. Depending on the use case, firms sometimes require more than 99.999% accuracy to meet downstream service-level agreements (SLAs) or financial reporting requirements. Even for less critical use cases like internal marketing analytics, missing more than .25% of the data can impact the validity of analyses. 

Unless they’re able to identify and control pipeline leaks, organizations can face regulatory penalties, miss business opportunities, and lose their competitive edge. What seems like a mundane issue can lead to hundreds of thousands of dollars in fines. Data quality is a bit like eating vegetables, not necessarily pleasant, but critical to maintaining the health of the enterprise. 


Unless they’re able to identify and control pipeline leaks, organizations can face regulatory penalties, miss business opportunities, and lose their competitive edge.


Identifying Leaks at Scale

Headcount has nothing to do with data scale; even small firms handle enormous quantities of data. As a result, catching pipeline leaks becomes a significant challenge. It often requires row by row reconciliation of millions of rows of data and the application of hundreds of data quality rules. Larger companies might have the personnel to take a more manual approach, but small firms don’t have that luxury. For them, catching data quality issues requires automation.


Larger companies might have the personnel to take a more manual approach, but small firms don’t have that luxury. For them, catching data quality issues requires automation.


Thankfully, new tools have emerged to reduce the burden of writing data quality rules. Instead of relying on human teams to craft rules, these automated data quality platforms, such as FirstEigen’s DataBuck, use machine learning to analyze data flows and generate rules. All humans need to do is review and refine the rules produced by the software. This technique frees up data teams to focus on fixing data quality issues instead of flagging them. 

By necessity, these modern tools also tend to be more cloud oriented than previous generations of data quality platforms. Vendors have designed them from the ground up to work with modern cloud-based data stacks. As a result, they integrate with cloud data sources more smoothly than traditional data quality tools built for on-prem deployments.

Takeaway

The cloud is a game-changer for smaller businesses that handle large quantities of data. But the same features that make it appealing--outsourced infrastructure and maintenance--can lead to systemic data quality risks when pipelines get out of sync and drop data. In order to detect these errors at scale and prevent negative downstream consequences, businesses need automated data quality tools that integrate with their cloud environments.

Joe Hilleary

Joe Hilleary is a writer, researcher, and data enthusiast. He believes that we are living through a pivotal moment in the evolution of data technology and is dedicated to...

More About Joe Hilleary