In this lively discussion, Equalum CEO - Nir Livneh and Eckerson President, Wayne Eckerson, tackled the evolution of data ingestion and the current landscape. From reactive to proactive to predictive analytics, business to self-service to artificial intelligence, the impacts on data ingestion and pressure to address the ever increasing thirst for insights is exponential.
Wayne: There have been 3 major eras in our recent data world - Business Intelligence, Self-Service Intelligence and Artificial Intelligence. As people are still trying to figure out how to do Business Intelligence properly, and crack the self-service puzzle (which has a lot of companies tangled), users are both empowered and creating chaos as the same time. As we have evolved through these eras of intelligence, our focus from a consumption perspective has been using data from the past (historical data) for a present day focus using historical data to predict what will happen in the future. We are now applying machine learning models for proactive approaches. Impact has gone from neutral in the business intelligence world to provide some basic operational information to positive in the self-service world where we can become more nimble and generate insights on demand, reacting to issues and opportunities. With AI, we have seen lots of great examples of companies running on algorithms and data, changing the world and disrupting both their industries and others at the same time.
Today, we are here to focus on data ingestion and data management. Ingestion has evolved from reactive to pro-active to predictive. In the 90s, we were trying to get historical data into a data warehouse and we thought we were doing well if we had a megabyte in our data warehouse. Of course, in hindsight that was very low volume. Fast forward 15 years into the world of self-service intelligence, we are now moving much more data that we have to batch in mini-batches just to get data in, delivering in real time instead of nightly or weekly. We are loading both historical and real-time data and volumes are increasing.
Now we are talking about hundreds of gigabytes (even terra bytes at the high end). Our warehouse has gone from being a historical artifact to one that is operational. Things are starting to move to the cloud, shifting architectures immensely which has deep impacts for ingestion.
Today, as we move to the world of AI, data volumes keep increasing as we add data from devices, machines, IoT, even mini batches aren't sufficient. We have to analyze data as it streams in and analyze as it streams without storage because it is too voluminous. We have to deal with hybrid clouds that span on premise and cloud platforms, multi-cloud and more. We have to ingest and move data across all of these systems and platforms at speed to meet an ever increasing thirst for data across organizations.
Nir has a huge amount of experience, focused on ingestion and that space. When you look at this framework, what speaks to you most from the ingestion perspective?
Nir: The focus here, as much as I want to dive into all of the details and technical pieces, starts with the value and business needs. We are talking a lot now about business SLAs which, in the 90s, wasn't the story. You used to look at whatever you did in the last year, and projected what you are going to do next year with post-mortem analytics. You had all the time in the world to think about how you wanted to build your data warehouse, what processes will bring data in, and all of that was pretty simple.
As we move forward, we are moving from post-mortem analytics into reactive, operational and predictive analytics which changes the SLA completely. Because of that, it changes requirements for the architecture and also the operational side of handling the data - things like data ops becomes very critical in that scenario as well. Everything changes towards supporting more use cases - proactive. They naturally include data lakes, AI, and sometimes subscription services as well. Very advanced companies also move into micro-services architectures in the same way where everyone in the organization can subscribe to the data that they want.
The biggest story is that the use cases are expanding on the same, existing architecture. But the reality is that when the volumes change, the SLA changes, and those parameters deeply impact how you perform analytics and ingestion, and likely in the warehouse and data lake. Those are the key pieces that force that shift. When the business requires more, there is now a need for more of a multi-modal approach for handling multi-use case in analytic spaces. If you are able to achieve it, it can bring a lot of value to the organization.
Wayne: Until about two years ago, no one used the term data pipelines - now it's one of the most popular terms out there. I am old data warehousing guy and we had, essentially, one pipeline - our ETL tool which I suppose managed a number of pipelines extracting from a number of sources, but the target was always the same with the data warehouse and sub-targets into data marts. With the advent of big data, data lakes and data scientists (who data lakes were geared towards), they are pulling all kinds of data out of these to populate various models which is where I think the "data pipelines" terminology came out of. Now we are hearing about companies who don't just have a dozen data pipelines, they have hundreds or thousands. The requirements for bringing data in (ingesting) and moving it through a number of different programs to populate hundreds of thousands of different targets has grown exponentially.
As a result, we need all new tools and platforms to support this that can do it at scale, with reliability and with speed. The implications for data ingestion area are also huge. I think the cloud is changing things quite a bit - I would love to hear your perspective on this. We are pulling data from all kinds of sources now that we never had and moving it to all kinds of different targets (multiple clouds).
What do you think has been the major impact of the cloud on these architectures?
Nir: At the end of the day, the cloud is a vendor, and they sell you some sort of platform that enables you to do some things, but there is also a little lock in as well which you have to take into account. A lot of companies, as they began thinking of ingesting and performing analytics in the cloud, they quickly realize that once they are using services that do that for them provided by the cloud vendor, that's great. BUT, if you want to have Azure, AWS and GCP running at the same time and have them co-exist in a multi-cloud fashion, its suddenly not a great idea to rely on services that are cloud specific. Those tools will always be tailored to that specific cloud. Long term, that may not be what you want. I think most companies, as they chose a specific cloud, started to realize it wasn't the best play and that they should diversify.
This ties back into the multi-modal story. You see it in the ingestion, in the lake and the warehouse, you see it in the cloud.
The reality is there are so many use cases, so many architectures that you need to support, you have to think about this globally, not just patch after patch. You can't just chose a cloud and then make a decision to add Azure and see what happens.
Wayne: I think a lot of people are doing that. I heard recently that most of the migrations to the cloud are being done using lift & shift, the brute force method which may not give you any advantage over what you have now, just add a lot of complexity. Here's something I've been wanting to ask you because a lot of vendors and using have been coming to me saying "the hardest thing I see people going through right now is moving their data from on-premise to the cloud." It seems to be a major stumbling block. Maybe it's that they can't do it all at once otherwise that is the lift & shift approach, figuring out what to put up, what to keep, and how to keep it all in sync is likely a good part of the complexity of this. I don't see this as that hard, but I know I am missing something. Do you have any insights on that?
Nir: Yes, I think there is a misconception. If you look at data migration into the cloud vs data integration into the cloud - we are looking at two very different things. Migrating data into the cloud is a relatively simple problem to handle - yes you have to worry about format and downtimes, but at the end of the day it's a migration. There's nothing very complicated there.
The story that is more complicated is about continuous integration or some type of integration between on-prem systems that cannot be migrated to the cloud with processes, applications or analytics that sit on the cloud. That's where you get the problem. There are security concerns, you are paying for network traffic all the time by your cloud vendor, volumes, etc. When people talk about migration the cloud, what they likely mean is that they want to integrate on-prem systems with the cloud applications for analytic environments.
Wayne: Ok, so you are the ingestion and integration expert. If I've got stuff that I have to keep on-premise, I have to keep it in sync with my new cloud data warehouse...in fact I have a client - a university - with this exact situation. They have a SQL Server data warehouse on premise. They don't want to change it. It is running all of their operational reports in a certain BI tool. Now they want a self-service environment because the other environment is completely locked down. It's locked down so much that people can't even use the BI tool to query the data warehouse. They can only use the reports. So now they want to create a self-service environment, and they are going to have to move some if not all of this data warehouse data into the cloud and merge it / integrate with new sources of data like their online learning data which is clickstream data - people taking courses. How engaged are people? Are they learning anything? What are they clicking on? They really want to analyze that. How do you architect when you have your data in two places? Do you have any insights on this?
Nir: It's funny that you mention this because we too have a customer who is very similar - clickstream data coming with SQL Server, on-prem, one of the Fortune 100 companies. The idea is that there are some decision points. First, you need to understand your security limitations. Can you push to the cloud? Can you pull from the cloud? There are many things that impact your architecture as to how you would push data in, and whether you need specific security, masking lineage, auditing, etc.
Assuming you don't have that, then the next question becomes do you ELT that data or ETL - extract, load, transform the data or extract, transform and load? In other words, do you actually transform the data before it makes it to the cloud using an ETL tool? Or do you just replicate the data with the raw data now available to consumers on the cloud, meaning it's likely in some lake or warehouse with maybe the warehouse doing some heavy lift on transformations. You need to be careful about how you make this decision. The vendors that sell you data warehouses and lakes will always want you to ELT - to replicate data - because they get paid more for having more workload.
The question is really, do you want to transform the data before it reaches the cloud - saving network and doing many things you couldn't do on
the cloud, but then you wouldn't have the raw data at all. This means, the amount of use cases you can then consume and start using might narrow a bit. If you have a good understanding of how those use cases look, not just now but in two years or five years, you might be in a good spot to ETL the data.
Another question to ask, can the on-prem handle the workload, or do you need the cloud to process the data in terms of resources? Obviously, you have to ask how is this data consumed after. I would encourage people to think, before you start architecting, about your users/consumers. How will this data be used afterwards? This will imply specific SLAs, specific architecture, limitations on on-prem cloud, tells the story of whether you need a data lake, data warehouse, a multi-modal lake and warehouse that sits together. Maybe you want to reuse the data a few times for various use cases? Then there is no point in having a lake and a warehouse if one can support all use cases.
Always look at your consumer, then look at your reality and try to fill those gaps.
Wayne: Yes, we always recommend the same. Work backwards. Work from the business case, business persona's and then have that affect your design of the architecture.
To hear the full session, check out the video above.
Balancing Standard and Custom Approaches
White Paper by Kevin Petrie & Wayne Eckerson | Eckerson Group