Apr 25, 2023 | 12 minute read

Performance Engineering at Elastic Path

Disclaimer: This post was originally published on July 25, 2022 and was updated on April 25, 2023 for accuracy.

At Elastic Path, we take the performance and scalability of our products seriously. Our passion for performance has enabled us to support e-commerce applications for some of the world's best-known companies, while pushing our products to the very limit of what our cloud hosting providers can handle. In this blog post, I'll outline the principles that form the foundation of our performance engineering practice and introduce you to our team of performance experts. We'll begin with the history of Performance Engineering at Elastic Path and a brief story.

Our History

Performance engineering is not a new phenomenon at Elastic Path. We've had a dedicated performance engineering team for well over a decade. Our commitment to performance has allowed us to attract and support large enterprise customers such as Intuit, T-Mobile, and Swisscom. During peak loads, these customers process hundreds of thousands of orders an hour through our Elastic Path Commerce platform. We've built and tested the platform to support hundreds of orders per second, and we've been there in the trenches with our customers every step of the way: participating in support calls, helping design and run tests, and delivering hotfixes on tight timelines.

Not only have we been in the trenches with our customers, but we've also been our own customer. Our current Senior Director of Cloud Operations, Security, and Performance was a performance engineer back in 2010 when Elastic Path hosted the 2010 Olympic Winter Games merchandising store. While traffic and sales were brisk leading up to and during the Winter Games, nothing prepared us for the day Oprah Winfrey gifted each member of her audience a pair of the red Olympic Winter Games mittens. As that episode aired, sales soared. The hard work we had put into performance testing our solution paid off, as it withstood the load. We’ve experienced an Oprah sales surge, so it means something when we tell you that our products have been refined to handle the most stringent performance and scalability requirements.

Throughout much of our history, our focus has been on the Elastic Path Commerce platform. However, over the last three years, we've also brought our performance expertise to bear on our multi-tenant SaaS offering, Elastic Path Commerce Cloud. Originally acquired as part of a Moltin acquisition, we began putting the product through its paces even before the deal closed. Performance testing and evaluation were key aspects of our acquisition-related due diligence process. Since then, we've spent hundreds of hours subjecting the Elastic Path Commerce Cloud platform to performance tests, documenting limitations, and addressing bottlenecks to enhance our performance and scalability. We've logged bugs, added autoscaling, tuned, rewritten services, and added new services all culminating in a tenfold increase in the scalability of Elastic Path Commerce Cloud so far.

With all that said, we are just getting started. One thing that's true about our performance team is that we are never satisfied with our performance and scalability. We are always looking for ways to make our APIs faster, support larger datasets, scale more quickly, and accommodate ever-increasing volumes of traffic and orders. When performance is a first-class feature, there's no time to rest.

Our Team

Elastic Path boasts a seasoned team of full-time performance engineers, a practice we've maintained for over a decade. In the four years since I joined, I've seen us push our systems harder and farther than any other place I've worked in my 20-year career, which includes employers such as Microsoft and Realtor.com. It's been a pleasant surprise to witness just how far we operate on the cutting edge of performance.

For instance, a few years back, one of our performance engineers completely rewrote the data access layer of our Elastic Path Commerce platform to support read-write splitting for our databases. This was no easy task. To obtain optimum performance, we didn't use commercial libraries but instead created our own solution, applying special knowledge about our read-write patterns to avoid replication lag-induced race conditions. Testing this new High Data Scale (HDS) feature led us to break AWS when we hit the 10,000 API calls per second limit of the AWS API Gateway, using five R5.24XL DBs with nearly 500 cores and 3.75 TB of RAM.

We can push this hard because our hiring practices are rigorous. The recruitment of our most recent performance engineer involved six months of screening over 100 candidates through a six-step hiring process, with only about a dozen candidates advancing to step four—a two-hour interview with senior team members. Only one candidate successfully completed all six steps.

In short, Elastic Path has a robust Performance Engineering Team with staff who have dedicated the majority of their careers to working in performance. With over 75 years of combined experience, there is little this team hasn't seen or dealt with when it comes to making enterprise software perform and scale.

Our Principles

There are several key tenets that underpin our performance engineering efforts. These principles are embedded in our processes, and we're continually striving to refine and enhance them.

Test Early and Often

At many companies, performance testing is an afterthought. It's often conducted on a compressed timeframe a day or two before a critical project goes live, only to end in disaster as some architectural decision made months earlier prevents the solution from performing or scaling. This is not how we do things at Elastic Path.

We firmly believe that performance must be integrated from the start. For us, this begins as developers deploy new code into our test environments. Each deployed code change undergoes hundreds of end-to-end tests to validate functionality. Collaborating with the development teams, the performance team has transformed each of these end-to-end tests into performance tests by measuring and tracking response times. As such, the end-to-end tests provide comprehensive coverage of all our API calls and serve as an early warning system for any performance regressions that might have been introduced into the codebase.

While the end-to-end tests focus on early warning and broad coverage, we also subject every change to a series of load and stress tests in our Staging environment. Managed by the performance team, these tests emphasize realism and are calibrated to mimic workloads seen in production. The staging environment mirrors our production cluster, ensuring an accurate representation of our product's performance in the real world. Moreover, the catalogs used for load testing are large and complex, much like our customers' catalogs. Long-running versions of these tests run overnight, while shorter versions are executed on every change to the Staging environment, ensuring nothing falls through the cracks. Results are published to a company-wide Slack channel for maximum visibility and engagement from not only our performance and development teams but also from our executives.

Lastly, in addition to development-focused performance tests and realistic staging-focused load tests, we also maintain a dedicated performance test lab. This environment is used for massive scale testing, system-wide config change performance testing, thorough testing of any new services, and any other type of exploratory performance testing. Testing in this environment is not designed to run on a schedule but rather is conducted as needed.

In summary, our multi-tenant solution, like all our e-commerce solutions, undergoes near-constant and thorough performance testing.

Test Realistic Workloads

When running tests to validate the performance of our e-commerce solution, it's critical to accurately simulate the API calls that our customers make against our software. Our performance team has over a dozen different workloads to choose from. Most of them have been created based on our expertise in how users utilize our platform and then cross-checked with production logs and analytics to validate the ratio and complexity of the API calls in the workload compared to what our systems experience in production.

The scripts encompass a variety of scenarios, adjusting key parameters such as conversion rate, the variety and complexity of products added to cart, whether or not promotions are added to cart, whether checkout is performed as guest or registered users, cache hit/miss ratios, and many other variations.

A simplified example of a typical low-conversion Business-to-Consumer (B2C) guest checkout workload looks something like this:

Browse – 85%
- Load Root Catalog
- Load Navigation Hierarchy
- Randomly Navigate Hierarchy
- Randomly view a Product Detail Page
Add to Cart – 10%
- Browse
- Load Promotions
- Retrieve Random Promotion
- Add to cart
- Add Promotion to Cart
- View Cart
Checkout – 5%
- Browse
- Add to cart
- Proceed to checkout
- Complete Purchase

This workload has a 5% conversion ratio with 10% abandoned carts and with 85% of user sessions just browsing. This type of flow might be typical of e-commerce stores during a non-sales event. Oftentimes, during sales events, conversion rates rise, especially during blowout sales events such as Black Friday or Cyber Monday. We have workloads to test those scenarios as well.

A workload such as the above is run across multiple stores for multiple clients simultaneously. We then layer this B2C shopper workload with an Admin workload simulating catalog editing and publishing operations.

Test Realistic Datasets

Much like our workloads, we strive to have our test datasets match the complexity of our customers' production deployments. One key component of this is data size. While others test on small sample catalogs, our catalogs vary in size from a few thousand products to over a million.

We are constantly pushing the limits of our stores. So far, we've tested performance and scalability in stores with up to:

1.3 million products
20,000 hierarchy containers
100,000 promotions
250 custom fields per product
30,000 accounts
250,000 registered users
60 million orders

In addition to testing large catalog sizes, we also ensure our test catalogs contain the complexity of customer catalogs. This includes varying factors such as node depth, node breadth, products per node, number of catalogs, number of pricebooks, number of attributes, number of variations, number of files, number of hierarchies, number of bundles, size of bundles, and many more.

That said, we are never truly satisfied with the size or complexity of our datasets, and we're always looking to test even larger sizes in increasingly complex ways that better mimic how our customers use our products.

Test for Reliability

What is performance without reliability? Reliability is revenue, and as such, it is critical to the functioning of any e-commerce business. As you can imagine, we take reliability very seriously at Elastic Path. In fact, our entire engineering process is built around ensuring our products are reliable.

It all starts with our approach to software development. Our software development processes are built on the principles of Agile Development, Continuous Delivery, and DevOps. These industry-leading practices result in the use of a software assembly line that sees all our code undergo rigorous and increasingly complex testing as it moves from developer workstations out towards our production environments. No change can be made to our production infrastructure without first passing thousands of code quality checks, unit tests, end-to-end tests, security tests, performance, scalability, and reliability tests. As any failing test results in the code change not progressing down the assembly line towards production, it's difficult for a breaking change to make its way into our production environments.

The reliability tests that we run for a dozen hours or more not only demonstrate product reliability at load over long-duration periods but also show very low error rates for our API calls. It's not uncommon to see error rates below 0.01%. Combined with our performance and end-to-end tests, our pre-production environment sees over 15 million API calls every day.

These rigorous test processes allow us to offer a 99.99% uptime guarantee on our production infrastructure. Our uptime is always publicly available and can be seen on our Status Page.

Constant Improvement

One of the principles of Agile Development practices is a culture of constant improvement. Elastic Path embodies this principle. We are always asking how we can improve. From sprint and release retrospectives to diagnosing the efficiency of our processes to triaging how bugs could have been found sooner, we constantly strive for improvement.

For performance, this manifests as a relentless drive to have Elastic Path Commerce Cloud scale to support ever-larger catalogs with ever-larger amounts of load. Over the last two years, we've optimized our code and scaled our production infrastructure from only being able to support 130,000 orders/hour on catalogs with 5,000 products to now being able to support over 300,000 orders/hour on catalogs of over 1 million products. This year, we plan on building out to support catalogs with up to 2 million products with even higher order rates.

In addition, over the last year, we've improved our auto-scale capabilities to allow us to scale a cluster from its smallest to largest configuration in 5 minutes, down from 7 minutes the year prior. This enables our infrastructure to quickly and automatically expand its capacity by 10x in response to any large traffic increases seen by our customers, guaranteeing consistent performance under these stress conditions.

Our relentless pursuit of performance and scale is never-ending.

Production Performance Monitoring

Understanding what’s happening with our production environment is critical to understanding how our customers are experiencing our service. As ensuring we meet our 100 ms and 99.99% uptime SLA is crucial, we’ve invested heavily in production monitoring infrastructure.

We use a series of best-in-class monitoring solutions to not only track adherence to SLAs but also to monitor a wide range of metrics, track system logs, and diagnose performance issues via tracing and profiling of every API call.

Using alerts, we can spot issues before they impact our customers. These alerts are triggered not only by exceeding set thresholds but also by artificial intelligence, which can find anomalous behavior across the thousands of metrics and data points we collect. This type of deep visibility allows us to quickly find and pinpoint the source of issues long before they impact our customers. Our operations staff is on call 24 hours a day should an alert require attention. Oftentimes, human intervention is not even required, as our systems have the capability to self-correct anomalous behavior via health checks and automated cycling of any services that appear to be in error.

The net effect of all these monitoring systems and alerting technology is a high availability e-commerce platform.