Is Continuous Performance Testing reducing time spent on analysis?
This has been a passionate topic for me especially since the last couple of months where I have spent a good amount of time trying to justify that continuous performance testing cannot replace the amount of time we need to spend on analysis. And at times we have ended up learning the hard way.
Late last year I started working for a team that was responsible of transforming an on-premise monolith middle-layer into cloud based microservices with the biggest motivation being reduction in cost and delivering a resilient, highly available and reliable architecture. Who doesn’t love the sound of that until you understand the mammoth task it really is. A 21-gun salute to the architect(s)/designers who took up this task. I stand by you.
For most people thinking well its just a microservice, once we’ve realised the smallest unit we have broken it up into and in the world of Continuous Delivery, just add a Blackbox and a Performance test and we are sorted. I wish it were that simple.
The core effort in writing the Blackbox is to identify good test cases (I’m sure every good tester would agree) with the intention to break the system and not happy day scenarios.
Similarly in the world of Performance the task isn’t simply to run component tests and evaluate if the service is able to handle the expected throughput. Every inexperienced tester will tell you so until you have asked him/her the considerations made in deciding the scope of the test.
This blog is not a rant but an honest attempt at sharing with fellow Performance Engineers the considerations that I believe should have been made. This eventually helps to shape how much time is required to allocate for Performance Testing.
Considerations while testing Microservices
Firstly there isn’t a one-size fits all. These are a few considerations that based on my experience have been critical while testing microservices:
- Number of hops in between microservices or API gateway to microservice:
Latency is a big factor and while the biggest advantage of a monolith is that everything is on one system, in the world of microservices, everything is distributed and at times running multiple high volume traffic that can contribute to high latency.
Ensure your tests capture and can differentiate the response time at each layer and network time.
- Database volume:
I hate it when tests are run against an empty or let alone not the right sized DB. It is critical to test a prod like sized DB with enough data to represent production. You will soon realise the volume of data will make a big difference in how your queries perform and selecting the right execution plan or the need of indexes and/or partitions.
- Test the right Throughput:
We all rely on the golden number that we have to meet as part of the NFR – “Service needs to support a peak traffic of 250 TPS” or “Ensure service is able to handle 20 million requests in 1 day”.
Getting that TPS number means there has been considerable effort in getting that number which is good yet its best to ask how that was derived (formula and source) and when was it (date/time) and if it reflects the true peak periods of the business scenario. This is a subject in itself and hence I talk about it at length here.
But what if you’re given something like 20 million in 24 hours. This isn’t like 20,000,000 / 24 / 60 / 60 = ~231 TPS but wait it could have been like 1000 TPS for about a peak period of 2 hours (7.2 million) and a lull or low traffic period after. Testing for ~231 TPS would seriously fail you.
Get that throughput expectation right.
- Can your downstream handle your traffic:
In cases when your service is a middle layer and is expected to handle certain traffic, it is critical to ensure the downstream is able to handle the downstream traffic. This is where you start to think about timeouts, api-limits, caching if any, etc.
At times running an integrated test helps to understand the overall picture with time spent at each layer including gateway, auth layer, load balancer, etc. These (ideally) should be minimal however I’ve learnt that these can be issues that aren’t really caught until you take a closer look or wait for a real prod issue to happen.
- Can your service handle Chaos:
Chaos Engineering is another hot topic to consider in the Performance and SRE space and ensuring your service is prepared to handle different chaos or failover scenario speaks volumes about the effort gone into testing the service.
- Cover every end-point if possible:
Performance engineers love a risk based approach as we can’t really test everything. We love to look at calls that have high throughput while at times avoiding low throughput calls (‘cas of timelines and judgement)
Imagine running a test with one endpoint having 98% traffic and not really spending too much effort on triggering the endpoint with 2% traffic (and setting up all the required test data) only to find out that the endpoint with 2% traffic ended up having a query that wasn’t tuned and ended up doing a full table scan thereby getting the DB CPU to like 95% and voila everything else goes for a toss.
- Look at every response code:
We love a clean test who doesn’t. But at times getting a HTTP-204 or HTTP-206 might look like a success but is truly a false positive. Same is the case with HTTP-5XX status codes.
While designing tests it is important to read the API spec and understand the failure scenarios especially during Chaos tests to ensure you are getting what you expect to get.
I don’t think I’ve been able to cover every consideration however the point being just because we are testing a smaller unit of code doesn’t mean we do not think about everything around it and consider the impact.
Once you have got your thinking hat on and have covered (almost) all the avenues of your microservice architecture, its time to run the test. Running it as part of your pipeline and expecting everything to go green on the first attempt is as good as writing any automation job – only a few 10x engineers get it right in the first go.
So you had a beautiful run and the 95th percentile response times look great – done deal. However a closer look at the report and you note a spike but that’s about it. Test results captured and published and service is live.
BOOM – response times start to get impacted in prod. What happened?
Well that spike could have meant something. On further analysis and spending a good amount of effort shows a slow query. It turns out our test results were being cached due to test data being reused and that initial spike was the first time the query was fired.
Don’t skip time spent for analysis. Try to keep an eye for outliers and observe patterns. These can make the difference.
Analysis in CT World
In the continuous testing world it is best to allocate some time post-test to analyse and provide all your findings. Ensuring there are right metric collectors that will help collect enough samples to help you make sense of the data is key.
Dig into those metrics to see any anomalies in infrastructure metrics or behaviours. Always own the configurations that go into production – and by that I mean you should review every configuration setting that could impact performance and ensure they are fine tuned.
Not all analysis needs to be documented as evidence however it is good that you’ve tested it to cover your risks.
A good analogy is when a good Doctor looks at all the reports like blood tests, blood pressure, X-ray and MRI but also observes what the patient is feeling before providing a go ahead. Sometimes the reports do not always indicate a problem however if things dont look right it might not be right.