Performance Engineering at Elastic Path
At Elastic Path, we take performance and scalability of our products seriously. Our passion for performance has allowed us to support e-commerce applications for some of the world’s best-known companies while pushing our products to the very limit of what our cloud hosting providers can support. In this blog post I'll be outlining the principles that encompass the foundation of our performance engineering practice and introduce you to our team of performance experts. We'll start with the history of Performance Engineering at Elastic Path and a quick story.
Performance engineering is not some new phenomena at Elastic Path. We've had a dedicated performance engineering team for well over a decade. Our commitment to performance has allowed us to land and support large enterprise-based customers such as Intuit, T-Mobile, and Swisscom. During peak loads these customers push hundreds of thousands of orders an hour through our Elastic Path Commerce platform. We've built and tested that platform to support hundreds of orders/sec and we've been there in the trenches with our customers all along the way: on support calls, helping design and run tests, and delivering hotfixes on tight timelines.
Not only have we been in the trenches with our customers but we've been our own customer. Our current Senior Director of Cloud Operations, Security, and Performance, was a performance engineer back in 2010 when Elastic Path ran the 2010 Olympic Winter Games merchandising store. While traffic and sales were brisk leading up to and during the winter games nothing prepared us for the day that Oprah Winfrey gifted each member of her audience a pair of the red Olympic Winter Games mittens. As that episode aired sales soared. The hard work we had done performance testing our solution paid off as it withstood the load. With over a decade of experiences like this, our products have been hardened to handle the most stringent performance and scalability requirements.
Throughout much of our history our focus has been on our Elastic Path Commerce platform. However, over the last 2 years we’ve also brought our performance expertise to bear on our new multi-tenant SaaS offering, Elastic Path Commerce Cloud. Originally acquired as part of an acquisition (Moltin), we began putting the product through its paces before the deal even closed. Performance testing and evaluation was a key aspect of our acquisition-related due diligence process. From there, we’ve gone on to spend hundreds of hours drilling the Elastic Path Commerce Cloud platform with performance tests, documenting limitations and then opening bottlenecks to get our performance and scalability to the next level. We’ve logged bugs, added autoscaling, tuned, and re-written services, all culminating in us quintupling the scalability of Elastic Path Commerce Cloud thus far.
All that said, we are just getting started. One thing that’s true of the performance team is that we are never satisfied with our performance and scalability. We are always looking for ways to make our API’s faster, support larger datasets, scale quicker, and support ever higher volumes or traffic and orders. When performance is a first-class feature, you don’t rest.
Elastic Path has a seasoned team of full time performance engineers and has had one for over a decade. In the three years since I joined, I’ve watched us push our systems harder and farther than any other place I’ve worked in my 20 year career which includes employers such as Microsoft and Realtor.com. It’s been a pleasant surprise to see just how far out on the bleeding edge of performance we operate.
For instance, a few years back one of our performance engineers completely rewrote the data access layer of our Elastic Path Commerce platform to support read-write splitting for our databases. This was no easy task, as to obtain optimum performance we didn't use commercial libraries but wrote our own solution that applied special knowledge we had about our read-write patterns to avoid replication lag induced race conditions. Testing of this new High Data Scale (HDS) feature actually saw us break AWS when we hit the 10,000 API call/sec limit of the AWS API Gateway using 5 R5.24XL DB's constituting nearly 500 cores and 3.75 TB of RAM.
We are able push this hard as our hiring practices are rigorous. The hiring of our most recent performance engineer involved 6 months of screening over 100 candidates in a 6-step hiring process with only about a dozen candidates getting to step 4, a 2-hour interview with senior team members. Only one candidate got through all 6 steps successfully.
In short Elastic Path has a very strong Performance Engineering Team with staff that have spent the majority of their careers working in performance. With over 75 years of combined experience there is little this team hasn't seen or dealt with when it comes to making enterprise software perform and scale.
There are a handful of key tenets that underpin our performance engineering efforts. These are baked into our processes and we’re constantly looking to improve upon them.
Test Early and Often
At many companies, performance testing is an afterthought. It's often done on a compressed timeframe a day or two before some critical project goes live only to end in disaster as some architectural decision made months earlier prevents the solution from performing or scaling. This is not how we do things at Elastic Path.
We believe strongly that performance must be baked in from the start. For us, this starts as developers deploying new code into our test environments. Each deployed code change is run through hundreds of end-to-end tests to validate functionality. Working with the development teams, the performance team has turned each of these end-to-end tests into performance tests by measuring and tracking response times. As such, the end-to-end tests provide broad coverage of all our API calls and act as an early warning system to any performance regressions that might have been introduced into the codebase.
While the end-to-end tests are focused on early warning and breadth of coverage, we also run every change through a series of load and stress tests in our Staging environment. Owned by the performance team, these tests focus on realism and are calibrated to mimic workloads seen in production. The staging environment is a mirror of our production cluster ensuring an accurate picture of how our product performs in the real world. In addition, the catalogs used for load testing are large and complex, much like our customers’ catalogs. These tests are run 6 times a day with results published to a company-wide slack channel. The results get a lot of scrutiny not only from our performance and dev teams but from our executives as well.
Lastly, in addition to development focused performance tests and realistic staging focused load tests, we also have a dedicated performance test lab. This environment is used for massive scale testing, system-wide config change performance testing, thorough testing of any new services, and any other type of exploratory performance testing. Testing in this environment is not designed to run on a schedule but rather is run as needed.
Overall, our multi-tenant solution, much like all our e-commerce solutions, is under near constant and thorough performance testing.
Test Realistic Workloads
When running tests to validate the performance of our e-commerce solution, it's critical that accurately simulate the API calls that our customers make against our software. Our performance team has over a dozen different workloads to choose from. Most of them have been created based on our expertise of how users use our platform and then cross checked with production logs and analytics to validate the ratio and complexity of the API calls in the workload vs. what our systems experience in production.
The scripts encompass a variety of scenarios changing key parameters such as conversion rate, the variety and complexity of products added to cart, whether or not promotions are added to cart, whether checkout is performed as guest or registered users, and many other variations.
A simplified example of typical low conversion Business to Consumer (B2C) guest checkout workload looks something like this:
- Browse – 85%
- Load Root Catalog
- Load Navigation Hierarchy
- Randomly Navigate Hierarchy
- Randomly view a Product Detail Page
- Add to Cart – 10%
- Load Promotions
- Retrieve Random Promotion
- Add to cart
- Add Promotion to Cart
- View Cart
- Checkout – 5%
- Add to cart
- Proceed to checkout
- Complete Purchase
This workload has a 5% conversion ratio with 10% abandoned carts and with 85% of user sessions just browsing. This type of flow might be typical of e-commerce stores during a non-sales event. Oftentimes during sales events conversion rates rise especially during blowout sales events such as Black Friday or Cyber Monday. We have workloads to test those scenarios as well.
A workload such as the above would be run across multiple stores for multiple clients at once. We’d then layer this B2C shopper workload with an Admin workload simulating catalog editing and publishing operations.
Test Realistic Datasets
Much like our workloads, we strive to have our test datasets match the complexity of our customers’ production deployments. One key component of this is data size. While others test on small sample catalogs our catalogs vary in size from a few thousand products to over a million.
We are constantly pushing the limits of our stores. So far, we’ve tested performance and scalability in stores with up to:
- 1.3 million products
- 10,000 hierarchy containers
- 100,000 promotions
- 100 custom fields per product
- 30,000 accounts
- 250,000 registered users
- 22 million orders
In addition to testing large catalog sizes, we also ensure our test catalogs contain the complexity of customer catalogs. This includes varying things such as node depth, node breadth, products per node, number of catalogs, number of pricebooks, number of attributes, number of variations, number of files, and many more.
That said, we are never really happy with the size or complexity of our datasets and we’re always looking at testing ever bigger sizes in ever more complex ways that better mimic how our customers use our products.
Test for Reliability
What is performance without reliability? Reliability is revenue and as such is critical to the functioning of any ecommerce business. As you can imagine we take reliability very seriously at Elastic Path. In fact, our entire engineering process is built around ensuring our products are reliable.
It all starts with our approach to software development. Our software development processes are built on the principles of Agile Development, Continuous Delivery, and DevOps. These industry leading practices result in the use of a software assembly line that sees all our code undergo rigorous and ever more complex testing as it moves from developer workstations out towards our production environments. No change can be made to our production infrastructure without first passing thousands for code quality checks, unit tests, end to end tests, security tests, performance, scalability, and reliability tests. As any failing test results in the code change not progressing down the assembly line towards production, it’s difficult for a breaking change to make its way into our production environments.
The reliability tests that we run for a dozen hours or more not only demonstrate product reliability at load over long duration periods but also show very low error rates for our API calls. It’s not uncommon to see error rates below 0.01%. Combined with our performance and end to end tests our pre-production environment sees over 15 million API calls every day.
These rigorous test processes allow us to offer a 99.99% uptime guarantee on our production infrastructure. Our uptime is always publicly available and can be seen on our Status Page.
One of the principles of Agile Development practices is a culture of constant improvement. Elastic Path embodies this principle. We are always asking how we can improve. From sprint and release retrospectives to diagnosing the efficiency of our processes to triaging how bugs could have been found sooner, we constantly strive for improvement.
For performance, this manifests as a relentless drive to have Elastic Path Commerce Cloud scale to support ever larger catalogs with ever larger amounts of load. Over the last year we’ve optimized our code and scaled our production infrastructure from only being able to support 130,000 orders/hour on catalogs with 5000 products to now being able to support over 300,000 orders/hour on catalogs of over 1 million products. This year we plan on building out to support catalogs with up to 2 million products with even higher order rates.
In addition, over the last year we’ve improved our auto-scale capabilities to allow us to scale a cluster from its smallest to largest configuration in under 7 minutes. This allows our infrastructure to quickly and automatically expand its capacity by 10x in response to any large traffic increases seen by our customers, guaranteeing consistent performance under these stress conditions.
Our relentless pursuit of performance and scale is never-ending.
Production Performance Monitoring
Understanding what’s happening with our production environment is critical to understanding how our customers are experiencing our service. As ensuring we meet our 100 ms and 99.99% uptime SLA is critical we’ve invested heavily in production monitoring infrastructure.
We use a series of best-in-class monitoring solutions to not only track adherence to SLA’s but also to monitor a wide range of metrics, track system logs, and diagnose performance issues via tracing and profiling of every API call.
Using alerts, we can spot issues before they impact our customers. These alerts are triggered by not only exceeding set thresholds but also by artificial intelligence which can find anomalous behavior across the thousands of metrics and data points we collect. This type of deep visibility allows us to quickly find and pinpoint the source of issues long before they impact our customers. Our operations staff is on call 24 hours a day should an alert require attention. Often times human intervention is not even required as our systems have the capability to self-correct anomalous behavior via health checks and automated cycling of any services that appear to be in error.
The net effect of all these monitoring systems and alerting technology is a high availability e-commerce platform.
Through the hard-working efforts or our experienced team enacting our core performance engineering principles we’ve been able to:
- Achieve 100 ms response time for 95% of all calls made to our API
- Meet a 99.99% uptime SLA
- Support catalogs with 1M+ products
- Support processing of up to 300,000+ orders per hour per production deployment
- Support processing up to 7M+ API calls per hour per production deployment
Care to Chat?
Should these performance and scalability capabilities be of interest to you please feel free to reach out to our sales team at firstname.lastname@example.org.Contact Us