The ability to respond rapidly to customer behavior and market changes is transforming the landscape of “business as usual.” From industrial process optimization in the manufacturing industry to fraud detection in finance and ad personalization in retail, companies across industries are increasingly looking to identify and take action on opportunities in their data in real-time.
But supplying business leaders with insights for real-time decision-making is no easy feat. In most enterprises, key operational data is spread across hundreds of different systems, so a common practice is to extract and move all relevant data to a central location for analysis (for example, an enterprise data warehouse/data lake/enterprise data hub). But since most operational systems write all their data to a relational database, one of the main challenges to enabling real-time analysis is - how to capture changes from the relational databases in real-time?
This is where CDC comes into play. No, not the CDC sends groups of doctors in orange suits and space helmets to investigate a breaking epidemic... CDC stands for Change Data Capture, which means a set of techniques to capture the data changes that an app issues.
In this post, I’ll focus on relational database CDC, though the term also applies to other repositories, such as NoSQL databases, storage systems, cloud services, etc. The goal is to survey and score the range of CDC options available – and to guide technology leaders in thinking through which approaches are best-suited to their business’ needs.
Some common CDC techniques
Dual writes in the application
- How it works: You can go ahead and change the application itself to write each change both to its database and to a log file or a message queue, so you’ll have the list of changes ready somewhere.
- Advantages: Few; in practice, it is rarely desirable or even possible, for reasons outlined below.
- Disadvantages: To start with, in many cases you do not have access to the application source code (for example, for packaged applications), or a good understanding of all its write paths, or an ability to test such changes thoroughly. Also, that approach is labor-intensive - you will likely need to start from scratch for each application. And the final straw - it is very hard to make both writes (to the application database and the custom logs) atomic - that either both succeed or that both fail.
- How it works: Some tools provide a way to capture the database inbound traffic and reverse engineer the application requests.
- Advantages: Low overhead.
- Disadvantages: Generally not used due to a couple of critical flaws. First, it does not capture changes from bulk operations (a single statement that changes a lot of rows based on a query) or from calling a stored procedure. Second, whenever the network sniffing is down, all uncaptured changes will be lost.
- How it works: A database trigger is an optional piece of user code, that the database could be configured to run as part of any change to the rows of a table. A trigger could be used to log any change to a side table, so the list of changes could be queried from it later.
- Advantages: Don’t require application changes and can capture all standard changes.
- They run as part of the operational transaction, slowing it down. Even worse, it makes them disruptive - if they run into an unexpected error and throw an exception, the user transaction will fail, breaking the operational system.
- Using a database table to track changes consumes database storage, and requires an additional step to remove old data periodically.
- Triggers tend to become disabled or invalid over time, for many reasons ranging from table schema changes to various admin operations. Any change that happens until they are fixed will be lost.
- How it works: This group of techniques involves running a periodic SQL query to identify some types of changes. It relies on having some table property to identify changes efficiently - for example, a “last updated” timestamp column or an ascending integer primary column, etc.
- Advantages: Periodic queries have low overhead if the source table is properly indexed, and while they miss intermediate changes, they do catch up nicely after an outage and can support schema changes.
- Disadvantages: Depending on the specific technique, it will likely only identify some types of changes (for example, only INSERT or only INSERT/UPDATE), and will always provide just a delta between the periodic queries, so it will miss some intermediate change states in a sequence of changes.
Transaction log processing
- How it works: Relational databases write any change to their data to a transaction log - an internal mechanism that allows them to correctly recover from failure or to be restored to any point in time, if needed. The transaction log could be parsed (sometimes with the help of a built-in database infrastructure) to extract the changes stored within them.
- Advantages: This method allows capturing all changes of all types in an asynchronous fashion. It does not require changes in the database schema or the application.
- Disadvantages: While it typically has minimal overhead, It is, however, harder to implement.
As we saw, there are many possible ways directions to implement a database CDC solution. While each might have its niche, our experience is that transaction log processing is generally the most powerful solution, leading to the lowest latency and overhead while being able to capture every single change.
However, picking a CDC technique is only the starting point. In order to achieve a performant and reliable solution, there are many other considerations. For example - how to correctly handle disconnects and processing errors? How to make sure no change is lost or duplicated in all failure scenarios (exactly-once guarantees)? How to sync the initial data capture and the starting time of the CDC? How to minimize the overhead on the source database? How to minimize CDC latency and maximize its throughput?
Equalum’s Data Beaming technology is purpose-built to harness the power of open-source frameworks Spark and Kafka in an end-to-end solution. And we built the most robust and powerful CDC approach on the market to address the performance and application impact challenges companies face as they scale.