When building a scalable, high performing data integration approach for a modern data architecture, companies are faced with endless possibilities. From DIY solutions built by an army of developers to out-of-the-box solutions covering one or more use cases, it’s hard to navigate the myriad choices and subsequent decision tree.
Many questions arise in the process like:
As you look for a Data Integration approach to facilitate your move to a streaming-first, modern data architecture, or assist in adopting Cloud Platforms, here are the Top 10 components you should be looking for in any solution.
Capturing data only once, not less than or more than once is often an under appreciated but very important component of a data integration solution. Exactly once is difficult to achieve, and often overlooked if organizations don’t need it in this moment. Let’s say you are looking at website views that are being captured at 1,000,000 views per second. If you lose 1% of those views in your data tracking because you lack exactly once in your functionality, it may not be crisis-causing. However, if you are a bank looking to track malicious transactions, and you are only catching them 99.9% of the time, you will inevitably face the consequences from unhappy customers. When data pipelines break and you have to go back in time looking for when and what you need to recapture, exactly once can be an incredible asset to ensure data accuracy and reliability. Not everyone can guarantee exactly once, especially end to end, so look for a solution that provides an exactly-once guarantee to future proof your architecture. When implemented correctly you can ensure that your data and analytics teams are looking at reliable, accurate data and making decisions based off of the full picture, versus potential speculation.
A streaming architecture is not complete without Change Data Capture - a methodology vs a technology - that is a low overhead and low latency method of extracting only the changes to the data, limiting intrusion into the source by continuously ingesting and replicating. There are many ways to effectively CDC, depending on the use case, such as log parsing, logical decoding, triggers and more, sp you want to ensure that your solution can CDC in various ways from various sources to ensure successful data capture - also known as a multi-modal CDC approach.
If your workflow requires not only simple replication, but joins, aggregations, look ups, transformations, you should be able to drag and drop using an ETL designer to drive scalability and flexibility. Build pipelines quickly, apply the appropriate functions required, change them as needed and have easy access to replicate your work in other areas of your architecture. A well built ETL designer will also afford your team faster onboarding and execution.
You should have a very intuitive user interface under a single pane of glass where you can achieve multiple use cases. If you are starting today with Oracle to SQL replication, and then your next use case is DB2 to Snowflake, you should be able to repeatedly leverage the same UI without having to train multiple people. Additionally, with multiple capabilities under one platform (i.e. Streaming ETL and ELT, Change Data Capture, Batch ETL and ELT) you afford yourself future protection as new use cases arise.
Things like Kafka, Kinesis and Event Hubs should be considered viable sources within your architecture and be easily accessed by your Data Integration solution. You will want the ability to take the data from streaming sources and move it to your eventual targets.
If your source or target changes as your use cases evolve over time, you should be able to build on top of your current solution using the same platform, same UI and same team of people with ease and scalability. If you were to face this challenge with a DIY approach from the start, you would be looking at new code, new configurations, potentially new developer skill sets...and the list goes on.
If you have a JSON or XML data type embedded in your database, you should be able to flatten that data structure and pull out the required column values out so that your data is easily consumed by downstream data applications.
You should be able to linearly scale by simply adding an additional node and then be able to accommodate increased workloads. Your critical components in the solution should not have a single point of failure. There should be multiple instances of those components so that if one instance goes down, your system can self-recover and heal. This is vital from an enterprise, operational point of view.
Having the choice of deployment should be up to you, not to your vendor.
Using the same resource pool of your cluster, you should be able to logically separate out the sources and targets for those that require it. Many times with sensitive data, not all members of an organization should have access to the data in its full format. You should be able to create job based data silos so that data secrecy can be maintained. If there is any PCI type data, you should be able to only let those that really need it see the data. Some solutions force the user to spin up multiple instances to create multiple tenants, resulting in duplication in management of environments and added resources. Look for a solution that allows the system admin to create tenancy for various lines of business and users by exploiting the underlying resources of the cluster, versus multiple instances.
A health insurance provider in the midst of COVID struggled with confirming the status of various subscribers in their system when they called into the customer service center. Without access to this important, near-real time data, they weren’t able to fill prescriptions when asked.
As a result, they explored various solutions that met the Top 10 Components for a Modern Data Architecture listed above. They then successfully implemented a near-real time solution with Kafka as a target to deliver near-real time data to their customer service team. Data on various subscribers was immediately updated in the system, allowing for prescriptions to be filled on demand, and much happier customers.
A large US retailer with hundreds of stores was trying to manage their inventory system better with near real time data to capture what items were sold and what was returned. They were looking for near-real time streaming of POS databases into their centralized inventory system which allowed for more efficient ordering of products and better inventory management.
Interested in seeing how Equalum’s end to end Data Ingestion platform can transform your business?