It’s a nightmare scenario familiar to any CIO or analytics leader: according to a study from McKinsey, the average IT project comes in 45% over-budget and 7% over time and delivers 56% less value than expected.
For enterprise big data initiatives involving ETL – still the most common approach to data ingestion – the most common source of project failure is a lack of visibility upfront into the true long-term cost of ownership and operation. What are some of the hidden costs of ETL?
- Maintenance: In addition to the upfront cost of building an ETL pipeline or purchasing an ETL tool, enterprises must factor in the ongoing expenses associated with administration and maintenance – a cost that in many cases dwarfs that of the initial investment. Common sources of maintenance needs include the time cost of adding new data sources and repairing broken data connections. For example, a retail bank that wants to bring user engagement data from its mobile app into a data warehouse will have to architect, test, and productionalize new connections to the source data: a process that can last months and demand multiple engineering FTEs.
- Training and onboarding: Any software deployment requires knowledgeable users, but the learning curve is especially challenging for complex technical solutions that cannot be operated in a self-serve fashion by business users. In-house ETL solutions, in particular, require the engineering team to create and maintain documentation. Key risks of inadequate user training include wasted time and missed opportunities. For example, a healthcare company looking to use digitized health information for predictive diagnostics will need to ensure that key ETL personnel is aware of business needs, deeply familiar with data sources and formats, and aware of existing ETL architecture.
- Changing business needs: An enterprise looking to harness the power of big data to drive business insights will see its reporting and analytics needs evolve over time. Key business requirements likely to change include the addition of new data elements and derivatives – and an ETL solution optimized for today’s needs won’t necessarily deliver against critical business requirements for a future state. For example, a media company looking to optimize ad spend across different marketing channels might shift from reporting on direct response metrics (e.g., ad click-through rates) to prioritizing signs of upper-funnel engagement (e.g., social media and app usage). Even if these data elements exist in the company’s data warehouse, cleaning, transforming, and piping them to an analytics environment may require significant ETL development time.
- Evolving data landscape: The explosion of data, particularly machine data, means that business leaders are looking to ingest greater data volumes than ever before – and to capture critical business advantage by transforming this data into real-time insights. Changing data formats (e.g., a move to unstructured and semi-structured data) create further challenges. An ETL process optimized for today’s data volume and velocity is unlikely to be able to scale effectively as needs evolve. For example, an industrial manufacturer whose ETL process is optimized for daily batch reporting is likely to encounter performance issues and critical pipeline failure as it tries to leverage sensor data to drive predictive maintenance in real-time.
- Source impact: As enterprises look to match the needs of business owners to make data-driven decisions in real-time, traditional ETL processes are likely to not only encounter performance challenges – but also to place increasing strains on other systems and applications. For example, a retailer looking to optimize the website experience in real-time based on customer behavior will likely encounter a negative impact on its production systems (including its eCommerce database, CRM, and POS) as ETL routines extract data from those sources more regularly.
Ultimately, the success of any big data project in the enterprise requires an accurate upfront assessment of the true long-term cost of technology ownership. Legacy ETL, long the default for data ingestion, represents a particular source of potential pitfalls that technical leaders should be mindful of as they build their technology roadmap to meet critical business needs.