Open Source Framework Power
Few technological innovations have contributed as much to unlocking the power of big data as open-source frameworks.
Since its release in 2011, Apache Hadoop has gained rapid adoption as a way of performing operations on data in a distributed environment. It has since been joined by a variety of open-source frameworks – including the well-known Apache Spark and Apache Kafka – that have collectively transformed the way data is streamed, stored, and accessed.
In contrast to traditional relational databases and computing methodologies, big data frameworks are engineered for massive scale and data complexity. And for enterprises looking to take advantage of their data in motion, the appeal of open source frameworks is clear. Open-source frameworks help companies avoid the risks of vendor lock-in while benefiting from continual development and innovation – all with a much lower price tag (at least on the surface).
But less well-understood are the common challenges that technical leaders face in implementing open-source frameworks to solve their data ingestion challenges. In fact, many leaders report that between cost and time overruns on initial configuration and ongoing maintenance needs, open-source implementations cost organizations orders of magnitude more than originally scoped.
So what are some of the most commonly overlooked challenges working with open source frameworks?
Usability/Ease of Deployment
- Why it Matters: Enterprises need the ability to quickly get their data pipeline configured with any number of sources and targets in order to support business users in making decisions off of data in motion.
- Open source challenges: Working directly with open source frameworks like Apache Spark and Apache Kafka requires specialized, in-depth coding knowledge of relevant languages (Java, Scala) and frameworks. For example, engineering teams might encounter issues when bundling dependencies – leading to a Spark app that works in standalone mode but encounters exceptions when running in cluster mode. It can be time-consuming for engineering teams to gain the experience required to deploy these frameworks in production.
- Why it Matters: Business users need access to data as it is created.
- Open source challenges: Open source frameworks tend to have many configuration options, and often require tuning to achieve optimal performance. For example, Apache Kafka is optimized for small messages. According to benchmarks, the best performance occurs with 1 KB messages. Larger messages (for example, 10 MB to 100 MB) can decrease throughput and significantly impact operations. Achieving high throughput and low latency requires in-depth knowledge of the behavior of partitions and memory usage, large message handling, and more.
Maintenance and Continuity
- Why It Matters: Business users must be able to ensure that new technology releases won’t interfere with critical workflows.
- Open source challenges: Open source frameworks tend to have frequent releases, which can entail significant changes “under the hood.” (E.g., Apache Spark follows a three-month release cycle for 1.x.x. Release and a three-to-four month cycle for 2.x.x releases.) New innovations can be valuable for organizations, but can also create significant problems if they’re not anticipated, particularly changes that necessitate changes in the API. This problem is particularly difficult in cases where documentation is incomplete or inadequate, especially given that distributed systems are notorious for their subtle corner cases and the difficulty of tracing down and reproducing problems.
Open-source frameworks offer the raw power and potential to transform how organizations use data in motion. But too often, open-source implementations fail or run significantly over-budget because organizations overlook the complexity involved in their installation and maintenance. In order to deliver their full value, these frameworks require technology that can make them easy to configure for business users, maximize their performance, and ensure that they’re being utilized in the most stable and scalable manner possible.