Four Considerations when Evaluating a Streaming-First Data Platform

by Cesar Rojas

Hero Image
May 13, 2020 11:30am

Blog

4 Main Considerations When Deploying Streaming-First Data Architectures



There is great business value that derives from having real-time access to data but your systems, applications, and machine sensors are constantly producing operational data that cannot be used in its original form to innovate the business via analytics or some form of AI processing.

The need for consuming usable data in real-time is radically transforming how organizations build their data ingestion systems. Streaming-First Data Architectures are emerging as the preferred environment where data is captured and processed before it is replicated to data warehouses, data lakes, or any other analytical or operational platforms.



There are 4 main considerations you must examine when deploying streaming-first data architectures.



#1 Continuously Collect Data with Modern Change Data Capture


Before the explosion of real-time data, organizations were primarily running batch processes at scheduled times for “data at rest” or data stored in persistent storage. This has changed significantly in the last decade. Streaming-first data architectures must collect data at the time of creation and, in many cases, before it hits a persistent layer. Users wanting non-stop collection and movement of data need to consider a technique called Change Data Capture (CDC) to be able to immediately recognize when an insert, update and delete has occurred and quickly act on it.

Legacy CDC technologies typically provide out-of-the-box integrations to major relational databases but may offer limited support for non-database sources and targets. Streaming-first data platforms must be able to directly access data from machine sensors, application layers, message queues, and more. Technical leaders should ensure that a solution for data in constant motion has complete and proven support for data from any source to any target.

At the very minimum, the following types of data sources must be supported

  • Relational databases by performing transaction log processing and other forms of CDC
  • File repositories including locally mounted file systems, HDFS clusters, Amazon S3 and other object storage platforms
  • NoSQL databases or distributed key-value stores that in many cases don’t have a standard API to capture changes but a JDBC interface to collect changes
  • Application data utilizing REST APIs connections with endpoints at the application level, instead of the database level
  • Messaging queues like files and many other types of message formats

Once the data is collected from these different sources, data streams are processed in data pipelines. At Equalum we call pipelines “data flows”.

Below is an image of the Equalum Dashboard; with sources on the left, targets on the right and flows & flow executions in the middle.

Want to see the Equalum Data Ingestion Platform in action?

SEE EQUALUM




#2 - Operationalize the Processing of Changes on Multiple Tables with Replication Groups and Schema Evolution


Streaming-first data platforms should process nonstop changes to groups of tables in one go, so if multiple tables are updated in one transaction, this would be captured altogether. This capability is known in our industry as Replication Groups. In a replication group, the tables to be replicated can be selected by name or name patterns. When there is no support of replication groups, each table should have its own data flow and each of the flows require an independent flow execution. Also, since each table is a different task, the integrity of such data transactions may be compromised.Schema Evolution is a common feature of replication groups, it provides full support for database schema evolution (schema changes) in an automated manner with options for the customer to determine how he/she wants to propagate schema changes. Users should be able to automate the propagation of creating a table and adding or dropping a table column. When a new table appears at the source and its name matches the replication pattern, it will create a table in the target database and start replicating it immediately. When considering streaming-first data platforms, replication groups and schema evolution should radically simplify massive data streaming processes.

Here is a video of Equalum’s Replication Group functionality.


#3 - Ensure High Performant Processing of Streaming Data

Streaming-first data platforms are usually built to guarantee high-speed flow executions. Low latency, high throughput, and linear scalability are must-have capabilities when moving large data volumes or executing flows with multiple/complex steps. Streaming-first data platforms should harness the scalability of new open-source data frameworks (e.g. Apache Spark, Kafka, and others) to dramatically improve the performance of processes. The goal is to have the capacity to linearly increase data volumes while improving performance and minimizing system impact.

A Fortune 100 manufacturer re-architected its legacy ETL system with Equalum. Its data team converted to a new data pipeline to feed the data warehouse for continuous insights. They deployed Equalum to efficiently transform event data in-flight improving performance 15x. They now manage larger data flows than ever, consuming fewer CPU cycles at a lower cost.




#4 - Simplify Building of Data Flows/Pipelines with a UI-Based, Non-Programmatic Approach

In a streaming data flow, users should be able to easily define the processing that will be performed to the data before it is finally moved to the target. This processing could include a wide variety of data manipulations, transformations, and enrichment operations. This is effectively a departure from the mindset of traditional architectures where two or more solutions (ETL, CDC and other tools) were required, adding complexity and management costs.

A modern streaming-first platform should offer extreme simplicity when moving data from sources to targets. The ability to configure the processing of the data using visual components (e.g. drag and drop flow charts) with non-programmatic data operators (logical, mathematical, comparison, and others) radically minimizes the time of development and production.

There are operators and functions that are commonly used when processing real-time data. Make sure your streaming-first data platform has support for those common operators/functions as well as the ability to add new ones.



Equalum's flow screen is displayed below where users can create simple or very complex flows with all the required operators for streaming processing.





These four considerations are key when evaluating streaming-first data ingestion platforms. The Equalum data ingestion platform has been designed with those considerations in mind. Please contact Equalum for more information about how we can help you deploy your streaming-first data ingestion strategy.

WANT A DEEPER LOOK?

Download our e-book

"Modern Change Data Capture: How to Acquire the Data Your Organization Needs"


DOWNLOAD

Ready to Get Started?

Experience Enterprise-Grade Data Ingestion at Infinite Speed.