Estuary

Streaming ETL with Change Data Capture (CDC): A Comprehensive Guide

Learn how Streaming ETL with Change Data Capture (CDC) enhances real-time data integration. Discover Estuary Flow's powerful features for efficient and scalable data pipelines.

Streaming ETL with Change Data Capture (CDC): A Comprehensive Guide
Share this article

In the world of big data, traditional Extract, Transform, and Load (ETL) processes are commonly used for moving and transforming data from various sources into a target system. However, these processes can often be time-consuming, expensive, and prone to errors, which can cause delays and increased operational costs. 

Did you know? According to industry reports, organizations spend up to 60% of their data management budget on ETL processes alone. This is where Change Data Capture (CDC) comes in as a game-changer.

Change Data Capture (CDC) is an alternative approach to ETL that tracks and captures data changes in real time. When combined with Streaming ETL, it can transform your data integration strategy, making it faster, more efficient, and cost-effective. Estuary Flow leverages these technologies to empowers organizations to streamline their processes, resulting in faster, more efficient, and cost-effective real-time data integration and analytics. 

In this guide, we will delve into how Estuary can be used to build CDC-based Streaming ETL pipelines, starting with clear definitions and moving through detailed implementation steps.

What is Streaming ETL?

Streaming ETL is the process of Extracting, Transforming, and Loading data from sources to target systems using streaming data, typically in near-real-time.

Unlike traditional batch ETL, which involves periodic and scheduled data extraction and transformation, Streaming ETL retrieves and modifies data as soon as it's received. For example, with databases this may involve streaming data using log-based change data capture (CDC) as a source. 

Other sources might include Kafka or another messaging system. Streaming ETL is most commonly used to support real-time analytics or high volume systems where there is has become impossible to move data in batch within a required batch window.

Estuary Flow is an excellent example of a real-time CDC and streaming ETL.

What is Change Data Capture (CDC)?

etl processes cdc - CDC

Image Source

Change Data Capture (CDC) is a technique used in data integration processes that identifies and captures changes made to data in a database or data source. It captures and tracks changes in near real-time updates to the target system for up-to-date data access.

CDC works by creating a separate log that contains information about the changes made to the data, including the type of change (insert, update, delete), the time of the change, and the affected data. The log is then used to update the target system with the changed data, enabling real-time data integration.

Change Data Capture Methods

There are three main methods of Change Data Capture: query-based, trigger-based, and log-based. Among these methods, log-based CDC, particularly in real-time settings, aligns seamlessly with the concept of Streaming ETL. 

  • Query-based Change Data Capture - This method involves querying the source database for changes and capturing them based on the results of the queries. For example, you could query the database for any rows that have been updated since a certain timestamp and then capture those changes for further processing. This method is easy to set up and can be used with any database that supports SQL queries. However, it can be less efficient than other methods as it requires constantly querying the source database.
  • Trigger-based Change Data Capture - This method involves setting up triggers on the source database that are fired whenever a change is made to the data. These triggers capture the data changes and send them to the CDC system for further processing. This method is efficient and can capture all changes made to the data, but it can be complex to set up and maintain. Additionally, not all databases support triggers, so this method may not be an option in all cases.
  • Log-based Change Data Capture (Real-Time) - This method involves monitoring the transaction logs of the source database to capture changes in real time. This method is efficient and can capture all changes made to the data, including insertions, updates, and deletions. However, it can be complex to set up and maintain as it requires access to the database logs and the ability to parse them. (This is the method Estuary Flow supports with pre-engineered connectors).

Regardless of the CDC method used, the captured changes are transformed and loaded into a target system using ETL processes. This enables you to analyze the data and gain insights into your business operations. 

If you want to learn more about these Change Data Capture (CDC) methods, check out this article!

How does Change Data Capture (CDC) work?

etl processes cdc - how cdc works

Image Credits

Today, when we talk about CDC we’re usually talking about the log-based method. But it can be the most complex to understand. 

In simple terms, when you make changes to a database, like adding new data, updating existing data, or deleting data, these changes are recorded in a transaction log. This log is used to track all the changes made to the database in real-time. Change Data Capture (CDC) is a process that reads these changes from the transaction log and sends them to a destination in the form of events.

For example, let's say you have a database with an Orders table. When a new order is placed, the information about the order is written in the transaction log. CDC reads this information and sends it to a destination, where it is stored in the Orders table. This way, any interested parties can be informed of the change and react accordingly, like refreshing their version of the data or triggering business processes.

Different databases use different methods to track changes, but the idea is always the same: changes are tracked and emitted as they happen in real-time, ensuring that the stakeholders are informed of the changes and can react accordingly.

Why Streamline ETL Processes with CDC?

Streaming ETL (Extract, Transform, Load) Processes with Change Data Capture (CDC) can offer significant benefits for your organization's data integration and analysis needs. Here are some reasons:

  • Real-time Data Processing: CDC enables real-time data processing, which means that data changes can be captured and processed as soon as they occur. This allows you to make more informed decisions and take action based on the most up-to-date information.
  • Reduced Data Latency: By capturing data changes as they occur, CDC reduces the latency between data updates and their availability for analysis. This helps ensure that the data being analyzed is as accurate as possible.
  • Increased Efficiency: CDC streamlines the ETL process by reducing the need to extract and load entire datasets each time they are updated. Instead, only the changes to the data are captured and processed, reducing the amount of data that needs to be moved and transformed.
  • Improved Data Accuracy: CDC helps improve data accuracy by capturing and processing only the changes made to the data, reducing the risk of errors that can occur when updating entire datasets.
  • Better Scalability: CDC is designed to handle high volume and high velocity data streams, making it an ideal choice for organizations that need to process large amounts of data quickly and efficiently. It can also be used with a variety of different data sources and target systems, making it a flexible and scalable solution for data integration.

By streamlining ETL processes with CDC, organizations can integrate and analyze data more efficiently, accurately, and promptly, leading to better business outcomes and a competitive advantage in the market..

Streaming ETL Processes with CDC Using Estuary

Estuary Flow is a powerful data streaming platform that offers built-in support for Change Data Capture (CDC). Flow allows you to build CDC-based ETL pipelines with a variety of CDC connectors, making it an ideal choice for organizations looking to streamline their data integration processes. 

Estuary Flow includes CDC connectors for:

After capturing real-time changes to data, you can transform and load it into the target system efficiently. This helps you make more informed decisions and gain insights faster, leading to improved business outcomes.

Here is a step-by-step guide to Streaming ETL Process with CDC Using Estuary:

Step 1: Capture the data from your source

etl process cdc estuary home page
  • Click on Capture. Select New capture.
etl processes cdc - new capture
  • On the Captures page, you can search for your source database and create a new Capture. Give the Capture a name. Fill in the details of your source database, which include properties like the server address, database name, username, and password. 
etl process cdc - new capture
  • Click Next, then click Save and Publish:
etl process cdc - Publish

You’ve now set up change data capture from the source system! Now, it’s time to push those real-time updates to a destination. 

Step 2: Set up your destination

You can do this either through the pop-up by clicking “Materialize Collections”:

etl processes cdc - materialize collections

Or, from the Materializations page > New Materialization:

etl processes cdc - new materialization

The steps involved in ingesting data may differ slightly depending on the destination you choose. Estuary provides a range of out-of-the-box connectors for popular databases and data warehouses like Snowflake, BigQuery, and Postgres. 

Setting up a destination is as straightforward as setting up the source, as demonstrated in the preceding steps, and can be completed in just a few clicks.

etl processes cdc - estuary connectors

For more help, see the Estuary docs for:

Conclusion

Change Data Capture (CDC) is an effective method of capturing and tracking data changes in real time, enabling organizations to streamline their ETL processes. CDC can be implemented using different methods, such as query-based, trigger-based, and log-based methods. Among these, the log-based method is the most efficient, capturing all changes made to the data in real time. Incorporating log-based CDC into Streaming ETL processes enables organizations to update the target system with the most recent data from the source system, resulting in real-time analytics and informed decision-making.

Estuary Flow is a powerful data streaming platform that can be used to achieve this. By streamlining ETL processes with CDC, organizations can achieve numerous benefits, including real-time data processing, reduced data latency, increased efficiency, improved data accuracy, and better scalability. These benefits can ultimately result in better business outcomes and competitive advantage.

If you are ready to reap the benefits of the CDC and elevate your data strategy, sign up to try Estuary Flow for free and start exploring its extensive features.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of Jeffrey Richman
Jeffrey Richman

With over 15 years in data engineering, a seasoned expert in driving growth for early-stage data companies, focusing on strategies that attract customers and users. Extensive writing provides insights to help companies scale efficiently and effectively in an evolving data landscape.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.