How DoorDash Cut Feature Store Expenses by 75% with CockroachDB
Written on
Chapter 1: Introduction to DoorDash's Challenge
Recently, I stumbled upon an insightful article regarding DoorDash's approach to optimizing its feature store costs by leveraging CockroachDB. DoorDash, a leading food delivery service, faced the challenge of managing a rapidly growing machine-learning infrastructure. The company discovered that utilizing a combination of databases could enhance efficiency and streamline operations.
Section 1.1: The Initial Approach with Redis
Initially, DoorDash relied on Redis for its online machine-learning storage. However, as the demand for machine learning features escalated, it became evident that Redis was not a sustainable solution due to its high costs and maintenance challenges.
Subsection 1.1.1: Transitioning to CockroachDB
To address these issues, DoorDash opted to integrate CockroachDB alongside Redis. This strategic move resulted in an average reduction of 75% in cloud spending per unit of stored value, all while maintaining minimal latency increases.
Section 1.2: Overcoming Maintenance Challenges
DoorDash faced significant maintenance overheads when managing large Redis clusters, especially those exceeding 100 nodes. Scaling up via AWS ElastiCache led to excessive CPU usage, increased latencies, and unpredictable run times. To mitigate this, DoorDash developed a method to upscale Redis clusters without downtime, which involved creating a new cluster from the latest backup, replaying daily writes, redirecting traffic, and then decommissioning the old cluster.
Chapter 2: The Advantages of CockroachDB
Despite CockroachDB presenting slightly higher latencies for certain read/write operations, DoorDash recognized its potential as a viable alternative for applications that do not demand ultra-low latency and high throughput.
The video titled "How DoorDash manages 1.9 PB of data & 1.2M QPS on CockroachDB" delves deeper into how DoorDash efficiently handles massive amounts of data and requests using CockroachDB's unique architecture.
Section 2.1: Unique Features of CockroachDB
CockroachDB's architecture is what sets it apart from traditional databases. At its core, it features a Postgres-compatible SQL layer that operates seamlessly across multiple availability zones. Beneath this SQL layer lies a strongly consistent distributed key-value store. Unlike Cassandra, which uses a ring hash for key distribution, CockroachDB organizes keys into ordered segments known as "ranges." Each range represents a set of primary keys within a defined interval. As ranges grow, they automatically split, ensuring a balanced distribution across nodes and resilience against traffic spikes.
Section 2.2: Design Optimizations in CockroachDB
DoorDash also faced challenges during the initial design phase. The original feature store design utilized the entity key and feature name as the primary key, aligning with its upload service. However, DoorDash identified the opportunity to enhance efficiency by denormalizing data into a single table with a compound primary key.
In summary, DoorDash's experience illustrates the benefits of employing a diverse database strategy. Although Redis proved to be a maintenance and cost burden for online machine-learning storage, CockroachDB emerged as a strong contender for scenarios that do not require the utmost performance. Its distinctive storage architecture and operational capabilities facilitated zero downtime during upgrades and scaling, paving the way for a more efficient system overall.