Data is Only Useful Outside of the Database

12 min read Original article ↗

Database or Datajail?

Dani Palma

Press enter or click to view image in full size

Despite the tremendous investment in database infrastructure in the last decades, data alone doesn’t drive value — applications, analytics, and machine learning models do. Extracting raw data from databases and leveraging it elsewhere transforms it into business value, actionable insights, and, thus, competitive advantage.

Imagine, for instance, a company’s customer information database. That data is essentially inert without integration into a CRM system — it can’t drive marketing campaigns, personalize experiences, or inform sales decisions.

Among the many ways to extract and propagate these records somewhere else, one has been gaining much traction lately. Use cases like this is exactly where Change Data Capture (CDC) shines. It provides a way to move data from databases to external applications without disrupting operations.

In this article, we’ll take a look at why replicating data outside the database is critical, dive into some of the mechanics of log-based CDC, and discuss a few use cases in more detail.

The Case for Data Beyond the Database

While databases store vast amounts of structured information, that data alone doesn’t make the wheels of modern business turn.

For data to be genuinely helpful, it must flow beyond storage, enabling (real-time) interactions and intelligence.

Data for Analytics

One of the biggest data-liberation use cases is analytics. Analytics tools are often designed to work with real-time or near-real-time data to drive timely insights.

For example, a retail company’s transactional data becomes insightful when it feeds into a BI tool or dashboard, helping track daily sales, customer preferences, and seasonal trends.

Often, you want to analyze not just the latest state of the records but the change history itself. This is easily enabled by CDC, as it can provide you with the raw deltas.

Operational Efficiency through Data Syncing

Applications that rely on distributed microservices architecture need data consistency across all components. In such architectures, replicating data to be used in caches and search indices is critical as they are necessary to keep operations smooth without constantly querying a primary database.

Implementing a search functionality over a highly normalized operational database can be a pain. Hence the popularity of systems such as Elasticsearch, which hold an aggregated, denormalized view of the data that lives in the application.

Keeping Elasticsearch in sync with the source database is a perfect use case for a real-time CDC pipeline.

Machine Learning and AI

Press enter or click to view image in full size

Training ML models or deploying AI applications is impossible without feeding them data. Often, this data must be a consistent, accurate snapshot of business operations — delivered in near real-time. For instance, ML fraud detection or recommendation engine models need frequent data updates to remain relevant and practical.

Pre-trained LLMs also need contextual knowledge to minimize hallucinations and ensure they include the latest world state in their answers. Re-training models every day is currently not feasible, so the best way is what is commonly known as “RAG” — retrieval augmented generation.

This can be done in weekly or daily batches, but wouldn’t you want the chatbot you’re conversing with to know what happened just a few minutes ago?

Moving data from an operational database to another service enables all these things and more.

Let’s examine why CDC is the ultimate approach to solving this.

Log-Based Change Data Capture: The King of Data Replication

In simple terms, CDC is a method that identifies and records changes made to a database and then propagates these changes to downstream systems. Log-based CDC, a specific approach to CDC, monitors the database’s transaction log, which records every change to the database in the order it occurred.

This non-intrusive method captures data updates in near real-time, making it optimal for maintaining data consistency.

Why Log-Based CDC is Better Than Batch Processing

Historically, batch processing was the primary method for transferring data from databases to anywhere else. But, these approaches suffer from many issues. They often introduce delays, limit real-time visibility, and consume significant system resources.

Press enter or click to view image in full size

src: estuary.dev

Log-based CDC is better. Here’s why:

  1. Minimal Latency: Changes are reflected in downstream systems in near real-time.
  2. Reduced System Load: Unlike batch processes, which can strain databases heavily during extraction, log-based CDC continuously replicates data changes with minimal performance impact.
  3. Data Consistency and Accuracy: Event-by-event consistency can be guaranteed because every change in the transaction log is captured.

But, it is not always the answer. Here are a few cases when you probably don’t need log-based CDC:

  • You only need the data once: Don’t bother setting up CDC if it’s a one-time query.
  • You don’t need a lot of data: If you don’t need a lot of data and you don’t care about changes in data like deletes, sometimes a simple query works.
  • You can’t get access to the WAL: There are good reasons for this, mostly around security and data access.
  • You can’t get the right events: This may seem weird, but it does happen. The data doesn’t always represent the business logic. Sometimes, you have to recreate a higher-level event as a series of data changes. In this case, you may need to capture that event another way. But for the rest of the data, CDC still makes sense.

As you can see these are fairly situational, which means that generally, CDC is the best approach.

How Log-Based CDC Works

Now that we all agree CDC is great let’s see how it works in a bit more detail. It can be summarized with three major phases:

  1. WAL Monitoring: The first step is to monitor the DB’s transaction log. We must capture each insert and update and delete operations as they occur. This is usually done by subscribing to a relay provided by the database system instead of polling the log or the tables.
  2. Change Extraction: These changes are extracted and converted into a standard, easy-to-parse format for easy integration with other applications or databases.
  3. Data Replication to Target Systems: The captured changes are then replicated to designated target systems, such as data warehouses, search indexes, or external APIs, enabling their immediate use in operational or analytical processes.

The Evolution of CDC

CDC has existed since the early days of relational databases, but the need for real-time data replication and integration across disparate systems drove its evolution. In its earliest forms, CDC was handled through batch processing, where a system would periodically extract snapshots of database tables and identify changes by comparing these snapshots.

This might have been effective for batch reporting, it was too slow and resource-intensive for real-time needs.

With the rise of event-driven architectures, CDC methods evolved to support near-real-time replication through trigger-based approaches, where database triggers would detect and log table changes. However, these triggers added too much unnecessary overhead for database administrators, especially in high-throughput environments.

Then software engineers realized that the only tool they’d need had been there all along: The Write-Ahead Log. This data structure and its related operations were initially strictly used to ensure high availability as a crash recovery mechanism. But, the to the fact that it contained all operations that changed the database itself, it proved to be a perfect vehicle for replication.

Thus, the introduction of log-based CDC happened, as it enabled efficient change capture directly from the database’s logs, eliminating the need for frequent table scans or triggers.

Today, log-based CDC is a widely adopted method for capturing changes with minimal performance impact, and it plays a critical role in real-time analytics, data warehousing, and event-driven applications.

How Log-Based CDC Works

Log-based CDC reads the database’s transaction log (a write-ahead log, or WAL in PostgreSQL). Transaction logs contain a record of all changes made to the database, which the database uses for recovery and consistency.

Press enter or click to view image in full size

Log-based CDC in PostgreSQL

  1. Write-Ahead Logging (WAL): PostgreSQL uses a write-ahead logging (WAL) approach to ensure durability and consistency. Each time a change (INSERT, UPDATE, DELETE) is made to a database table, PostgreSQL writes a corresponding log entry in its WAL file before applying the change. This ensures that the database can replay these logs to restore data integrity if there’s a crash.

Logical Replication Slots: To enable CDC, PostgreSQL allows the creation of “logical replication slots” on the WAL. A logical replication slot is a mechanism that will allow external processes to “subscribe” to the stream of changes from the WAL.

This lets downstream systems (like a CDC tool or ETL platform) receive real-time changes in the form of logical changes (such as row-level inserts, updates, and deletes) instead of raw binary logs.

Here’s how a logical replication slot can be created in PostgreSQL:

SELECT * FROM pg_create_logical_replication_slot('replication_slot', 'pgoutput');
  1. This command creates a logical replication slot named replication_slot, which uses the pgoutput plugin to capture and format data changes in a way that external systems can consume.

Decoding WAL Entries: When a logical replication slot is in place, PostgreSQL starts decoding changes in the WAL into human-readable transactions that capture the operation, the table affected, and the data involved.

These decoded changes are usually streamed to a replication tool (like Debezium, PeerDB, or Estuary Flow) through the replication slot.

To illustrate, this is how an INSERT operation in the source database might translate into a message with the following structure:

{
"operation": "INSERT",
"table": "customers",
"data": {
"customer_id": 12345,
"name": "Alice",
"email": "alice@example.com"
}
}
  1. This format allows CDC tools to interpret changes without needing access to the database schema, making data integration even more efficient.
  2. Streaming Changes: PostgreSQL streams these changes out in near real-time after decoding. A CDC tool like Debezium can connect to the logical replication slot and consume changes continuously. This enables downstream systems to stay in sync with the source database with minimal latency.
  3. Handling Snapshotting and Initial Loads: One challenge with CDC is that newly created replication slots only capture changes from the point of creation onward. Most CDC systems perform an initial snapshot of the database to handle initial loads or synchronize databases with a significant amount of existing data. In PostgreSQL, this might involve querying the database tables directly before consuming changes from the WAL.
  4. Idempotency and Exactly-Once Processing: CDC systems often need to ensure idempotency (so the exact change is not applied multiple times) and, ideally, exactly-once processing. Kafka, for example, adds support for exactly-once semantics by tracking messages at the consumer level. At the same time, CDC tools like Debezium manage offsets and message sequencing to help achieve this in distributed environments — other tools, like Estuary Flow, guarantee exactly-once processing out of the box.

While powerful, log-based CDC is not without its complexities:

  • Storage and Retention: Transaction logs can snowball, especially in high-change environments. Retention policies and disk usage should be managed carefully, as an overloaded WAL can impact database performance.
  • Schema Evolution: When source schema changes (e.g., column addition, type change), CDC consumers need to handle these updates gracefully. Some CDC tools offer built-in support for schema evolution, but complex schema changes can still lead to downstream issues.
  • Network and Latency: CDC systems must balance between reading logs frequently to reduce latency and minimize network and processing load on the source database.
  • Data Consistency: Achieving exactly-once processing guarantees is challenging in distributed environments. For example, PostgreSQL doesn’t inherently track which messages have been sent to a consumer, so CDC tools must manage offsets and retries to avoid duplicating or missing data.

More Use Cases

As mentioned above, there are many and more use cases for CDC. It turns out that getting data out of a database opens the door to a lot of business value.

Press enter or click to view image in full size

Data Replication and Syncing

Log-based CDC enables companies to replicate data in near real-time to various systems and applications, creating a single highly available and reliable source of truth.

  • Search Indexing: By syncing changes to a search index, users can instantly search updated records — essential for e-commerce, where product availability and pricing must be accurate and current.
  • Cache Invalidations: Database changes will most often trigger updates in application caches. As one of the hardest problems in software development, ensuring that cached data remains consistent with the underlying database is critical for a lot of user-facing applications.
  • Denormalized Views: Lastly, log-based CDC helps populate denormalized tables in analytics environments.

Real-Time Side Effects

Even outside of data replication, as an architectural pattern, when using CDC, each data change can trigger downstream events or actions in other systems. A few examples:

  • Email Notifications: For instance, inserting a new record in a customer database can trigger a welcome email to the new customer.
  • Workflow Automation: Imagine adding a new order to a database. This can trigger a series of actions: inventory updates, payment verification, or even shipment scheduling. CDC events can conveniently trigger all of these.

Decoupling Systems

Tangentially related to the previous use case, CDC can facilitate event broadcasting. This pattern allows multiple systems to act independently while staying informed of changes from upstream.

  • Inventory Management: In an e-commerce setup, a new order can update inventory levels, trigger a pick-and-pack operation, and notify logistics for shipment — all enabled through real-time CDC events.
  • Decoupled Service Architecture: CDC is invaluable for microservices architectures, where services must be independently scalable and decoupled. By broadcasting events, CDC ensures services remain in sync without depending on synchronous communication.

Compliance

Last, but not least, CDC can also be a super useful tool for data governance, compliance, and auditing. For sectors bound by data privacy regulations (GDPR, HIPAA, etc.), CDC can help track data access and changes, ensuring compliance with audit trails.

Backups

It’s also worth mentioning that log-based CDC can replicate changes to backup systems, providing a near-real-time mirror of critical databases in a secondary environment, such as a hot standby system.

Conclusion: Make Data Work by Setting It Free

Data locked inside a database is just the beginning. To truly make data valuable, you will always need to replicate it to external systems that can convert it into something useful.

Log-based CDC can offer an efficient approach to moving data beyond storage, enabling real-time applications, advanced analytics, and system decoupling.

Adopting log-based CDC is a strategic move that leverages the best of data architecture and operational design for businesses seeking to drive immediate, actionable outcomes from their data.

Where does your data need to be to drive real value?