The field of software engineering has benefited immensely from new techniques and technologies, such as DevOps via Git, and continuous integration/continuous deployment (CI/CD) via tools like Jenkins,. Now a company called Recce is hoping to bring the same sort of benefits to the field of data engineering with an open source product by the same name, as well as a commercial product.
The goal of the Recce (short for “reconnaissance”) project is to bring the same type of best practices for data validation workflows–such as data diffing, validation checklists, and query result comparison–directly into the data transformation workflows. The software does this by integrating directly with tools like dbt, thereby enabling data engineers and other data professionals to ensure that the cleanest and best data is being used for downstream analytics use cases in data warehouses, data lakes, and lakehouses.
Data engineers and other practitioners (dbt Labs likes to call them “analytics engineers”) are already doing checks, such as looking for null values and to ensure the ranges or referential integrity is maintained. Recce helps to automate those checks and provide a basis for additional verification, says Chia-liang “CL” Kao, the creator of Recce and the CEO of the company by the same name.
“In other words, they’re doing a lot of spot checks, like running this specific query for the production database and your development branch, kind of staging data, and then eyeballing the results,” Kao tells BigDATAwire. “Oftentimes, it’s very manual. So we are automating that process, allowing the practitioner to bring in the business stakeholders earlier to look at the data.”
By automating the checks that dbt is already doing and making the results easier to consume via a graphical user interface (GUI), the results will be consumable by a broader range of personas and therefore have a wider impact on the business, says Kao, the former Apple engineer who developed SVK, the precursor to Git.
It’s all about helping the data quality checks make sense for the users’ particular environment, Kao says.
“So by reading the output of the comparison, like the differences or the aggregation of the differences, they’re able to create a checklist to say, ‘Hey, I’ve looked at this query. I intended this to be X and it is indeed X,’” he says. “This is how they currently go about making the verification themselves, but it’s done manually. So we’re helping them to automate that process into a reliable way, so that when you add more commits to your pull request, these checks can be automatically rerun and reverified, so that they’re not lost in the void.”
Kao has targeted dbt with the first release of Recce because dbt is so widely used by data engineers and other data professionals. The plan calls for Recce eventually to support other popular data tools, such as SQLMesh, Dagster, and others, he says.
The goal is to ensure the quality and integrity of data as far up the data supply chain as possible, Kao says. The field of data observability is solving a similar problem, but it’s mostly looking at data after it has been loaded into an analytics database or warehouse and has undergone the all-important transformations–the “T” in ETL and ELT–which is where many errors are introduced.
The introduction of AI, both as an application and as a data engineering tool, makes it all the more critical to solve data quality issues as early as possible in the data lifecycle, Kao says. As data becomes more critical for software development, the data review will become as important–if not more important–than the code review for Python, SQL, or other code.
“Now the prompt or the underlying model is a building block that you’re using as part of the pipeline. Now you’re changing the logic of the pipeline. You have this kind of unexpected impact to your downstream. How do you verify that?” says Kao, who is also the CEO of Recce. “We’re relying on certain eval or something for our applications. But ultimately I think the future is like code review. As we do in software, when we are doing this new type of LLM-driven code [development], it’s going to be data review.”
However, software can only take us so far. Humans are a critical link in the data review process, because computers can’t validate whether the ultimate values are correct or not, Kao says. Context is critical for determining the correctness of data, he says. That’s why Recce is seeking to streamline as much of the process as possible and remove impediments to getting this information in front of human eyes.
“The major difference from software CI/CD is that the correctness depends on the interpretation of the drift, like compared to the production system,” Kao says. “And that wasn’t usually done because it was very involving. But when we talked to more mature teams, they would have to spend time on that to ensure the output for the data is correct. So what Recce brings is really simplifying that workflow and then also integrating it into the CI/CD system.”
During a demo of a dbt pull request in Recce, Kao showed how a user is able to visually determine how changes to a certain database field will impact downstream tables. It’s a real-time cross-referencing capability that will let users, for instance, see how a coupon change will impact how customer lifetime value is calculated, Kao says.
“You can see after I change that coupon definition, how is my customer lifetime value across the customer changing?” he says. “Is the distribution change something I expected?”
The first release of Recce came out about a year ago, and today it’s being downloaded about 3,000 times per week, Kao says. Anyone can download Recce and run a local Recce server.
Yesterday, Recce announced the version 1.0 release of the product, which adds a host of new features, including support for column-level lineage; breaking change analysis; profile, value, and Top-K diff to the column; interactive custom queries, and structured checklists and evidence collection.
The company also announced the launch of Recce Cloud. Currently in beta, the service provides more collaboration functionality for teams than what is offered in the open source product, including: full data-validation context sharing with teams, including lineage diffs, custom query results, and structured checklists, and automated sync checks across environments and blocked merging until all checks are approved.
Lastly, the San Francisco-based company announced that it has raised $4 million in venture capital to fuel its growth. The round was led by Heavybit, with participation from Vertex Ventures US, Hive Ventures, and angels Visionary, SVT Angels, Brighter Capital, Ventek Ventures, Scott Breitenother and Tim Chen of Essence VC.
“Data pipelines are the New Secret Sauce for every company building with AI, enabling teams to create and improve high-quality training data from their own IP,” said Heavybit General Partner Jesse Robbins, who is joining Recce’s board. “Recce provides the essential toolkit for unlocking the full value of their data with iteration, refinement, and monitoring, while mitigating the risk of errors and corruption. Heavybit is thrilled to support them as they grow the ecosystem for data pipeline validation in the age of AI as part of our ongoing mission of 10+ years: Bringing critical enterprise infrastructure to market.”
Related Items:
Data Quality Getting Worse, Report Says
Data Quality Top Obstacle to GenAI, Informatica Survey Says
Data Quality Got You Down? Thank GenAI
The post Recce Aims to Become the CI/CD for Data Engineering appeared first on BigDATAwire.
Leave a Reply