Top Design & Implementation Challenges with Change Data Capture (CDC)

Strategic Considerations for Successful Streaming Ingestion

Hero Image

The push towards real time data and streaming-first architectures has never been more pervasive than in our current big data and analytics landscape. The volume and velocity of data is ever increasing, causing strain on legacy architectures as they attempt to process it effectively. The complications of ingesting data from operational sources in near real time, transformed and optimized does not come without complexity.



In this comprehensive guide, we will walk you through Best Practices when deploying Modern Change Data Capture.

Prefer to Download the eBook?

DOWNLOAD




INTRODUCTION | Why Change Data Capture Matters Now More Than Ever

Change Data Capture is a low overhead and low latency method of extracting data, compared to traditional batch processes, limiting intrusion into the source and continuously ingesting and replicating data by tracking changes to the data. When designed and implemented effectively, CDC is the most efficient method to meet today’s scalability, efficiency, real-time and low overhead requirements.

It is imperative that your business has the right architecture in place to handle high throughput of data, a simplicity of replicating subsets of data and ever changing schema, as well as the capacity to capture the data and the changes exactly once, then replicate or ETL the CDC data to your data warehouse or data lakes, for analytics purposes.




CHAPTER 1 | Where Should My Organization Start When Implementing A Streaming Architecture?

Let’s begin by considering the “why” behind your move to a streaming-first architecture. Are you exploring new use cases to drive business growth with streaming data ingestion? Or are you watching existing use cases hit performance challenges with legacy tools and seeking new solutions?

You might consider the implementation of a batch and streaming approach, allowing you to maintain the batch processing that is functioning, and transitioning others to streaming where business gains or system efficiencies are clear. In either case, modern change data capture will be a vital component of your new architecture.


CDC is the gateway to unlocking two main ingestion use cases:

  1. CDC Replication - Replicating data from operational systems into data warehouses or data lakes. Replication is the key to power ELT (Extract, Load, Transform) data ingestion into analytics environments.
  2. CDC Streaming Ingestion - Performing a real-time ETL (Extract, Transform, Load) of data from operational systems into data warehouses or data lakes.

The main difference between the strategies is where the data manipulation is performed - in the ingestion pipe or in the analytics engine (DW or Data Lake) destination. Most organizations use a combination of these ingestion strategies, based on the organization needs, use cases and infrastructure.


Avoid Too Many CDC Tools

Traditionally, however, organizations were forced to use different ingestion products for each use case, due to the fact that many legacy CDC Replication tools do not have real-time transformation, data manipulation, aggregation and correlation capabilities within the pipe. Alternatively, CDC Streaming Ingestion tools were not built to natively support CDC Replication use cases and are missing key capabilities for this use case.

With the addition of the Batch ETL use case, which every organization also uses in some capacity, organizations often find themselves with three different tools to achieve a full implementation of their data ingestion strategy.

Using three different tools to support all three use cases is not only extremely cost inefficient, but also leaves the organization with management and maintenance of three tools, different processes, different skill sets and no sense of single point of ownership for their data ingestion approach.

It is vital to use the right tool for the right job, but it’s also key to consolidate ingestion processes where possible. Centralize system monitoring. Lower costs on various platform licenses. Streamline your architecture for more transparency into data processing, scalability and ease of use.





Chapter 2 | Beware of Source Overhead when Extracting and Transferring Data Using CDC

Growing data ingestion requirements from analytic initiatives create additional workloads on the operational systems in an organization. Batch jobs will be scheduled for certain windows of the day to avoid source overhead during business hours, but often, batch loads push that boundary, negatively affecting day to day operations. While organizations will continue to use batch ETL to support predictable workloads for tasks like post-mortem analysis and weekly reporting, batch only architectures cannot keep up with modern velocity and volume of data that enterprises require. However, streaming data from the operational systems can also have a significant impact on the ability to guarantee SLAs, if not implemented properly. Change Data Capture is the single, most effective way to extract data from the operational system, without making application changes.


CDC techniques avoid reading entire data sets and instead focus on only the changes to the data, to extract incremental data. This technique eliminates the inherently expensive, traditional batch data extraction on source systems. There are several CDC techniques available in many popular data sources, each one offering different levels of performance, intrusiveness and maybe most importantly, data source overhead. For example, in a database, we can implement CDC in many different ways, such as triggers, JDBC queries based on a date pseudo column or a transaction log. The first two are extremely inefficient, both from performance perspective and more importantly database overhead concerns.


Database triggers are extremely expensive, synchronous operations that will delay every single database transaction to execute the trigger. JDBC query based extraction involves expansive range scans on the data tables that can create significant overhead even if the table is indexed and partitioned correctly. CDC using transaction log extraction is the most efficient way to extract changes in an asynchronous fashion, without slowing down operations and without additional I/O on the data tables. The impact on both performance and source overhead will differ significantly based on the chosen CDC strategy, in ways that can “make or break” the project, due to the level of risk introduced to the operational system.





Chapter 3 | Optimize Initial Data Capture as You Begin CDC Powered Replication

Retrieving the current image of the data set before applying CDC changes is vital to achieve a fully functional replication process. This process is often done relatively frequently as part of the data set replication. Therefore, optimizing the initial capture phase, along with the synchronization of it with the CDC changes, is a key goal.

Here we will highlight a few, critical aspects of Initial Data Capture that ensure successful CDC replication:

  • Initial Capture Speed
    Initial capture, the process of extracting once the current state of the data set, should be robust and extract the data set using bulk, parallel processing methods. There is usually more than one way of extracting a bulk of data from a source, either through JDBC, or APIs, with different parallelism levels. Picking and optimizing the best approach is a key in ensuring a multi-object replication would maintain acceptable service levels. Each source has a unique, optimized way to perform bulk extraction. Restreaming is very common, due to many potential reasons, such as nightly loads, out of sync objects, failures and more, therefore initial capture is not a one off process.
  • Synchronizing between CDC and Initial Capture processes
    A replication process requires a combination of the existing state of the data set as well as the capturing of all the changes from that point in time. The point in time is typically referred to as the synchronization offset, usually tracked by either a timestamp, but more accurately, by looking at an internal change sequence number. The synchronization offset is required to achieve a successful, exactly once data replication, regardless of the source type, for any source type that stores data at rest. Trying to manually sync between the initial capture and CDC changes is a tedious task that is prone to failures, due to the high frequency of changes in the data set. It’s important to ensure there are automatic, reliable ways to ensure a proper synchronization.
  • Decoupling Initial Capture and CDC Process Dependency
    Initial capture, even if optimized, can still take a long time to complete, due to the size of the data set. It’s important to eliminate dependency between the completion of the two processes, by allowing both processes to run in parallel. This allows the overall process of achieving a real-time replication to reach a “synced” state much quicker and, in most sources, the initial capture data and the CDC changes are not extracted from the same logical and physical file anyway. Leveraging parallel processing by decoupling initial capture and CDC to improve overall performance and “time to synced” is recommended.

    This becomes especially critical when handling many objects in a queueing fashion, as the “time to synced” can delay further for every consecutive object in the queue. Decoupling can help both initial capture and CDC execute at top speed.




Prefer to Download the eBook?

DOWNLOAD








Chapter 4 | CDC (Extraction) Bottlenecks from your Sources Cause Negative Ripple Effects

A CDC ingestion process is only as quick as its weakest point. It does not matter how scalable your ingestion pipeline is if the CDC extraction method used on the source is slow. The source will almost always become the bottleneck of any CDC ingestion process. It’s also important to remember that not all CDC extraction methods and APIs are created equal. You must consider performance, latency and source overhead tradeoffs before choosing a CDC approach.

For example, in an Oracle database, we can implement CDC in many different ways, such as triggers or JDBC query based on a date pseudo column. We can even use Log Miner, which is considered a “true” CDC tracking by accessing the redo log. That said, even Log Miner, which is the preferred method out of the three, is limited to 1 core in the database, which roughly translates into anything up to 10,000 changes per second. Anything more than that and Log Miner will start accumulating lag. That lag will keep growing continuously until a restream occurs, afterwhich the lag will start growing once more.

There is, however, a fourth method available in this instance (tenfold more complicated than any of the other three approaches) - direct redo log parsing. A direct redo log parsing reads the redo log binary data and natively parses the bytes based on offset positions, essentially reverse engineering the binary output. While only a handful of solutions have this capability, this method offers speeds of up to 100,000 records per second (depending on the record size and storage speed), with even lower overhead on the database.

Bottlenecks can also temporarily happen throughout the replication pipeline, due to momentary network latency, source or target service delays and more. Therefore, the pipeline should also have backpressure mechanisms to avoid hitting memory limits in the pipe and protect the already pressured network or a disk. Applying simultaneous backpressure at various points in the data flow also alleviates strain on the system and ensures data accuracy.





Chapter 5 | Handling High Volume Transactions

In relevant databases that offer transaction approach (more than one change to the data is committed or rolled back all together), another layer of complexity is added to the replication process - reading uncommitted changes.

There are a few reasons to read uncommitted changes:

  1. High volume transactions can lead to hitting memory limits if they are only being read after the commit has occurred.
  2. Persisting uncommitted changes helps maintain a replication process latency SLAs, since there’s no need to wait to the commit before the extraction starts. Persisting these uncommitted changes also helps reduce the time to recovery of a CDC process failure which is not source related.

The challenge then, is the need to manage the uncommitted changes. For example, in a rollback scenario, you want to remove those changes from the persistence layer so it does not get replicated, and also does not waste disk space. The second important aspect is that persisting the uncommitted data creates another I/O process that can negatively affect performance, throughput and latency. It’s vital to choose the right persistence layer and the right level of asynchronous processing.





Chapter 6 | "Exactly Once - End To End" is Vital with Change Data Capture Powered Ingestion

Achieving a true, exactly once guarantee in a cluster computing pipeline is a BIG challenge. Exactly once semantics guarantee means ensuring that data arrived from the source into the target once, not more than once, and no less. While this is a key requirement for a data replication use case, it’s also a key requirement for many data ingestion and ETL processes. It is much harder to achieve in streaming and data replication scenarios, versus traditional batch processing. When dealing with a multi-modal pipeline, as well as a source and a target, there are many potential risks for the end to end, exactly once semantics guarantee. Even if each component guarantees exactly once between itself to the next, only by having a recovery component governing the process end to end, exactly once semantics can be guaranteed.

Take Kafka for example, the Kafka consumer guarantees exactly once from the source read into Kafka, but it only covers issues that can occur while data is being delivered to the consumer. It does not cover source failures, nor will it cover target failures (wherever Kafka sends the data after it receives it), or any type of synchronization between what was sent from the source and what arrived at the target. When adding something like Spark for processing in between, there are now 4-5 different potential points of failures in the chain. Tracking the checkpoints of data integrity throughout the pipeline is the key to ensuring exactly once semantics guarantee.





Chapter 7 | The Challenges of Managing CDC Replication Objects at Scale

Managing CDC replication at scale can be extremely challenging. Replication use cases often require handling dozens, hundreds or even thousands of objects. Consider handling a replication of a database schema, for example, which can contain thousands of tables. Each table needs to be validated for replication conflicts, unsupported data types, access permissions and many more aspects. Without the ability to handle multiple objects using a bulk management instrument, defining and day to day maintenance of the replication process can become an unwieldy task and often insurmountable.

In many cases, Do-It-Yourself developments or new age, commercial solutions were not built with replication use cases in mind. As a result, bulk management instruments for replication are simply not available. What at the development or POC stage might seem minor, becomes a true challenge down the pike when going into “real-life”production with actual use cases. The complexity only increases with DIY and new age commercial solutions when initial data capture and initial capture-CDC synchronization has to be done manually. As you think back to the thousands of objects to be replicated, consider the challenge of having to manually sync and load initial capture processes for each table schema.

In a word...punishing.

Replication Groups solve this problem, driving both operational and business efficiency with one-click bulk replication. Replication groups can facilitate large data migration, cross platform data warehousing (replicating to a data lake or data warehouse) and the managing of many thousands of objects in simple ways. Performance and Usability in replication are the two most critical aspects of every successful replication.





Chapter 8 | Data Drift Can Break Streaming Ingestion Pipelines

Data drift (also referred to as schema evolution) can create serious technical challenges as seemingly never ending mutations occur causing structural and semantic drift. As fields are added, modified or deleted at the source, or the meaning of the data evolves, data drift can occur, breaking the data pipeline and creating downtime for analytics.

Data drift is not a new challenge. It has been present in batch ingestion processes for centuries. When transitioning into streaming ingestion, however, data drift becomes tenfold more challenging for two main reasons:

  • Streaming pipelines are in near real-time, continuous by definition, which means human intervention when something unexpected (like data drift) occurs is extremely disruptive, as opposed to batch processes which are pre-scheduled and run within a certain window or time frame.
  • Because of the continuous nature of streaming ingestion, when data drift occurs, if the change is not handled automatically or at least both data streams (before the drift and after) are maintained properly, it can lead to serious data integrity issues and potentially expensive recovery processes.

In streaming pipelines, data drift should mostly be handled automatically, by a predefined set of rules that outline how each type of change is being treated. These rules can be simplistic and generic and can be fine-grained to the type of data, type of change, type of data flow and more. Without the pre-defined set of rules, a human intervention will be required and SLA disruption will occur.


Even after setting the rules, there’s a need to expect the unexpected and maintain both streams before and after the drift, maintain watermarks (to be able to order the events properly and reconcile the data in the target) and have an easy way to perform the right manipulation on the data to adapt to the new structure and re-stream the old-structured data to the target.

The last layer of data drift handling needs to include proper monitoring and alerting of the data drift incident for auditing and administration purposes.





Chapter 9 | A Future Proof Way to Deploy a Streaming Ingestion Solution

Enabling multi-modal (ETL/ELT) CDC streaming ingestion capabilities requires getting more than just one component right. When looking at open-source frameworks, each framework enables part of that promise, however as standalone frameworks, they are not a solution.

  • Spark provides unparalleled data processing speeds, but does not collect CDC changes from the source.
  • Kafka enables highly scalable data buffering and pub/sub capabilities, but relies on open-source and commercial CDC solutions for CDC extraction and relies on frameworks like Spark for significant computation workloads.
  • Even open-source CDC solutions have limitations in capabilities, performance and enterprise grade support.

The key is to have all of these technologies work together coherently, in a fully orchestrated environment. Achieving this coherent solution using all these open-source frameworks, however, is extremely challenging when attempting DIY solutions. Best case scenario...it takes a few years to design, develop and build. The system ends up working, but requires infinite maintenance, administration and development. Over 90% of these DIY projects end up failing, and over 60% of data engineers’s valuable time is spent on patches and fixes to complex, custom coded systems.


On top of all these risks, there’s the ultimate risk of getting locked into a framework that ends up being deprecated a few years down the road, similarly how Hadoop MapReduce is widely replaced today by popular frameworks such as Spark. The right orchestration solution needs to take all these risks into account, leverages the best open-source frameworks, simplifying the management, coding and administration hustle, but also takes into account that open-source projects often become deprecated when the new cool kid comes to town. Moreover, it’s important to consider that inevitability that the data volume and velocity will increase in the years to come. Today’s workloads are just a fraction of tomorrow’s workloads, you need to design the architecture that can scale with the growing demands.








Prefer to Download the eBook?

DOWNLOAD








Optimizing Streaming Architectures with Speed, Scalability, Simplicity and Modern CDC

Remember, CDC should be thought of as means to an end - a supplement to your Data Architecture and acting in service to your downstream data rather than a standalone effort.


Step One:
Identify your primary streaming use case.

Step Two:
See if you will have overlap in other use cases that could be accommodated by one tool versus many.

Step Three
: Aim for data architecture simplification to optimize operations, management, monitoring and system sustainability.



Regardless of the use case, some final considerations:

  • Simplify the streams of data - Complex data streams can lead to endless patch and fix corrections. Are there places where you could make your data architecture more efficient and easier to monitor? More performant? Safeguard your data and your time with simplicity and visibility.
  • Combine operations with one platform versus many - Seek to reduce the cost and complexity of multiple licenses and subscriptions in exchange for one platform that can perform all necessary functions with a single licensing fee. That said, don’t migrate everything on day one, start with green field use cases and migrate other use cases slowly that can benefit from the new platform.
  • Take advantage of a drag and drop UI versus complex custom coding in multiple tools - Diversify the talent that can engage with your data architecture by eliminating the need for heavy coding. Drag and drop interfaces streamline the process of onboarding, rollout, administration, maintenance and monitoring.
  • Explore platforms that make CDC ingestion fast and easy - Look for solutions with low overhead on source systems and the flexibility to change targets or sources with the click of a button. Replication groups to handle complex and large replications? Cloud agnostic? Even better.
  • Find a scalable multi-modal solution - Be ready to grow as your business expands and data volume and velocity increases. Need to maintain some batch processes as well as have streaming capability? Look for a solution that can accommodate both.

Modern Change Data Capture is a vital component for a streaming-first data architecture. These guidelines and best practices will help achieve an optimized streaming pipeline architecture that will ingest data seamlessly and efficiently from any data source, including legacy data sources, enabling the analytics teams to provide the most enriched, up-to-date business insights to the organization decision makers. As streaming-first ingestion becomes the standard preferred data architecture to support business operations and visibility, you must make sure to have the right strategy in place to succeed.





Interested in learning more about Equalum?



GET A DEMO



Ready to Get Started?

Experience Enterprise-Grade Data Ingestion at Infinite Speed.