“We started wondering, can we offload all of the management of Kafka—and still get all of the benefits of Kafka? That’s when Confluent came into the picture.”
Mahendra Kumar
VP of Data and Software Engineering, BigCommerce
When it comes to smooth e-commerce capability in the modern era, BigCommerce provides the technology behind the technology—the platform that’s ultimately the backbone of buying for tens of thousands of merchants, including Solo Stove, Skullcandy, Yeti, and a lot of other incredibly popular brands and products.
For a company that enables varying levels of demand for merchants in the digital age, timeliness and accuracy of data are crucial. Inventory has to be accurate to the millisecond, and when data is delayed by a day, or even a minute, it results in lost revenue for merchants and damages to the organization’s reputation.
That’s only one of the reasons BigCommerce pursued Apache Kafka for their data streaming use case. The modern, open SaaS, e-commerce platform works with tens of thousands of merchants who serve millions of customers around the world. Its digital commerce solutions are available 24/7, all year long, for B2B, B2C, multi-storefront, international, and omnichannel customers.
As Mahendra Kumar, VP of data and software engineering at BigCommerce, puts it, “Realtime data is critical for current business. The value of data drops almost exponentially as time goes by. The sooner you can act on data, the more valuable it is.”
Recognizing the need for real-time data while understanding the burden of self-managing Kafka on their own led BigCommerce to choose Confluent—allowing them to tap into data streaming without having to manage and maintain the data infrastructure.
“We started wondering, can we offload all of the management of Kafka—and still get all of the benefits of Kafka? That’s when Confluent came into the picture.” ~ Mahendra Kumar, VP of Data and Software Engineering at BigCommerce
The challenges of self-managed Kafka when instant access to data insights is critical to business
Before Confluent, BigCommerce was managing its own Kafka cluster that was growing and required increasing maintenance. BigCommerce started out with six broker nodes, but as their traffic and use cases increased, they ended up trying to manage 22 broker nodes and five ZooKeeper nodes. This required over 20 hours a week dedicated to maintaining and managing Kafka infrastructure and about half a full-time engineer’s time.
They chose Kafka because it is a robust, resilient system designed for high throughput, which would allow the company’s data teams to retain data for a period of time, as well as offset and reprocess data as needed. However, open source (OSS) Kafka presented significant limitations. “It was fine initially,” says Kumar, “but as the volume of data started getting bigger and bigger, we got to a point where managing Kafka became almost like a full-time job.”
Rather than being able to focus on delivering new services that improve user experiences, team members were bogged down by software patches, blind spots in data-related infrastructure, and system updates.
Challenge #1: Maintaining and managing Kafka
“We had to constantly patch software, and from time to time we would see traffic surging and have to constantly manage the scaling of resources. It was a painful journey.” ~ Mahendra Kumar, VP of Data and Software Engineering at BigCommerce
Having the latest and greatest improvements to Kafka is important, and BigCommerce wanted to get the most out of the infrastructure that was running Kafka. However, managing upgrade priorities had to be balanced with other roadmap features that the team needed to deliver. “If we have to spend time on Kafka upgrades and patching, then it will take time away from our other roadmap priorities,” said Kumar.
Once the company got up to 22 Kafka nodes, management became unwieldy. The engineering team was composed of nine people to manage the data platform, data applications, data APIs, analytics, and data infrastructure. However, they had three engineers spending half their time on Kafka-related maintenance and upgrades.
Before big shopping events like CyberWeek, the entire team needed a month to get the technology ready to scale and had to overprovision their clusters by 10-15% to account for the uncertain traffic volume. This exercise was expensive and time-consuming, and valuable resources were being squandered because of the unknowns.
Challenge #2: Inability to tap into real-time data creates burden for customers
”The next evolution of our use case in Kafka: taking each critical type of e-commerce event and building a data model which could load data in real time using a stream processor." ~ Mahendra Kumar, VP of Data and Software Engineering at BigCommerce
Kafka was originally instituted to help BigCommerce capture critical retail events such as orders and cart updates. These events were stored in a NoSQL database as they came in. But with that model, the company was not able to tap into streaming data, and that meant they couldn’t exactly provide real-time data analysis for merchants who wanted to be able to extract quick insights from the data.
From time to time, the system would go down, and that would cause a spike of events to backlog, which degraded the merchant and end-user experience. BigCommerce had an ETL, batch-based system for analytics and insights, resulting in a wait time of eight hours before merchants could get their analytics reports. This system consisted of 30+ MapReduce jobs. On occasion, jobs would fail and need manual intervention, causing even further delays.
