Federico Ramallo - Density Labs blog

Stripe has made significant strides in developing and deploying machine learning (ML) models, particularly through their creation of the advanced ML feature engineering platform, Shepherd. The company relies on ML to optimize various operational aspects, including backend processing and user interfaces, which greatly enhance the global internet economy. Feature engineering, which involves defining the inputs for ML models, is a complex process, especially at Stripe's scale due to the vast amounts of raw data.

To streamline this process, Stripe partnered with Airbnb in 2022 to adapt its platform, Chronon, for use in Shepherd. This collaboration aimed to enhance Chronon’s capabilities to meet Stripe’s specific needs, particularly regarding scale, latency, and feature freshness requirements. Shepherd has been successfully used to develop a new fraud detection model with over 200 features, significantly outperforming previous models in blocking fraudulent transactions.

Feature engineering at Stripe involves balancing latency and freshness. Latency measures the time required to retrieve features during model inference, which directly impacts payment processing speed and customer experience. Feature freshness measures the time required to update feature values, crucial for adapting to rapidly changing fraud patterns. Stripe's strict requirements for low latency and high feature freshness pose unique challenges.

Chronon was chosen for its intuitive Python- and SQL-based API and support for both online and offline computation. However, it needed adaptations to handle Stripe’s vast data. The offline, online, and streaming components all had to be scaled to meet Stripe’s requirements.

To efficiently scale their key-value (KV) store, Stripe implemented a dual system: a lower-cost store optimized for bulk uploads and a higher-cost distributed memcache-based store for frequent reads and writes. This dual KV store implementation helped lower the cost of storing and serving data while meeting latency and feature freshness requirements.

For streaming jobs, Stripe chose Flink as the streaming platform due to its low-latency stateful processing capabilities. This allowed Stripe to implement a scalable write pattern that could achieve low latency updates. By integrating Flink with Chronon, Stripe managed to achieve a p99 feature freshness of 150 milliseconds.

To further reduce latency, Stripe introduced the concept of "tiling," which involves maintaining the state of preaggregated feature values in the Flink application and periodically flushing these values to the KV store. This method significantly decreases latency by reducing the amount of data retrieved and aggregated during feature computation.

In the offline context, Chronon generates training data for models and batch-only use cases. Stripe verified Chronon’s scalability with benchmarks and integrated its offline jobs with Stripe’s data orchestration system. They built a custom integration for scheduling and running jobs with their highly customized Airflow setup, providing flexibility for defining features using a variety of data sources and consuming features in downstream batch jobs.

Shepherd’s first major use case was developing an updated ML model for detecting SEPA fraud in partnership with Stripe’s Local Payment Methods (LPM) team. The new SEPA fraud model, consisting of over 200 features, was developed entirely on Shepherd and successfully blocks tens of millions of dollars of additional fraud annually.

Stripe’s contribution to the Chronon community includes generalizing their Stripe-specific implementations and optimizations for broader use. As co-maintainers of Chronon with Airbnb, Stripe is committed to expanding the project’s capabilities and supporting its open-source community.

Overall, Stripe’s development of Shepherd and its integration with Chronon illustrate the company’s commitment to advancing ML feature engineering. By overcoming challenges related to scale, latency, and feature freshness, Stripe has created a robust platform that significantly enhances their fraud detection capabilities and overall operational efficiency.