Data streaming in real-time has seen an exponential increase, and more than 80% of organizations report that real-time data streams are critical to building responsive business processes and improved experiences for their customers. Data streaming helps companies gain actionable business insights, migrate and sync data to the cloud, run effective online advertising campaigns, and create innovative nextgen applications and services. But in order to act on events and data as soon as they happen, you need a data infrastructure built for real-time streaming.
The need for real-time data
When a business runs in real-time, the need for real-time data becomes increasingly apparent. Use cases we see around security/threat management, customer activity tracking, and real-time financial data are all excellent examples of this.
Health care organizations are increasingly relying on real-time data when making decisions about patient care. IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains are impacted by real-time data. This data needs to be analyzed immediately, and is often transformed before reaching the target stores (i.e., real-time ETL). Real-time data streaming is therefore an integral part of modern data stacks.
Common Streaming ETL Use Cases
360-degree customer view
A common use case for streaming ETL (also called real-time ETL) is achieving a "360-degree customer view," particularly one that enhances real-time interactions between businesses and customers. An example of this could be when a customer uses the business' services (such as a cell phone or a streaming video service) and then searches their website for support. This data is sent to the ETL engine in a streaming manner so that it can be processed and transformed into an analyzable format. Raw interaction data alone may not reveal insights about the customer that could be gained from ETL stream processing. For example, the interactions might suggest that the customer is comparison shopping and might be ready to churn. Should the customer call in for help, the call agent has immediate access to up-to-date information on what the customer was trying to do, and the agent can not only provide effective assistance but can also offer additional up-sell/cross-sell products and services that can benefit the customer.
Credit Card Fraud Detection
A credit card fraud detection application is another example of streaming ETL in action. When you swipe your credit card, the transaction data is sent to or extracted by the fraud detection application. The application then joins the transaction data in a transform step with additional data about you and your purchase history. This data is then analyzed by fraud detection algorithms to look for any suspicious activity. Relevant information includes the time of your most recent transaction, whether you’ve recently purchased from this store, and how your purchase compares to your normal spending habits.
Streaming Architecture and key components
Streaming ETL can filter, aggregate, and otherwise transform your data in-flight before it reaches the data warehouse. Numerous data sources are readily available to you, including log files, SQL databases, applications, message queues, CRMs, and more that could provide valuable business and customer insights.
Stream processing engines use in-memory computation to reduce data latency and improve speed and performance. A stream processor can have multiple data pipelines active at a given point, each pipeline comprising multiple transformations. Each transformation leads to another transformation in the chain, with the result of this chain serving as input for the next transformation. There can be a wide variety of data producers, such as Change Data Capture (CDC), a technology that captures changes from data sources in real-time, as well as a wide variety of consumers, such as real-time analytics apps or dashboards.
The goal is to achieve a streaming latency of 1 second or less for over 20,000 data changes per second for each data source.
Data transformation during stream processing
The aim of streaming ETL or stream processing is to provide low-latency access to streams of records and enable complex processing over them, such as aggregation, joining, and modeling.
Data transformation is a key component of ETL. The transformation includes such activities as:
-Filtering only the data needed from the source
-Calculating new values
-Splitting fields into multiple fields
-Joining fields from multiple sources
- Normalizing data, such as DateTime in 24-hour format
When working with streaming data, it is often necessary to perform real-time data transformation in order to prepare the data for further processing. This can be a challenge due to the high volume and velocity of streaming data. The task can, however, be accomplished through the use of a number of techniques.
Data filtering refers to the act of limiting what data should be forwarded to the next stage of a stream processing pipeline. You may want to filter out sensitive data that should be handled carefully or that has a limited audience. In addition to data quality and schema matching, filtering is commonly used to ensure data quality. Finally, filtering is a special case of routing a raw stream into multiple streams for further analysis.
In some cases, a stream may still need to be restructured using projection or flattening operations after it has been transformed into structured records. These kinds of transformations are most commonly used to transform records from one schema into another.
Streaming ETL has emerged as the most efficient, effective method of real-time data integration when transformations are required, and supports critical business use cases by integrating with business intelligence products, AI, machine learning, and intelligent process automation (IPA) workflows.
Learn more about streaming ETL by downloading our whitepaper here.