I’ve been trying to push Tyk to the limits. I have Tyk running containerized on an ECS cluster, and I’m using an elasticache Redis instance (non-clustered, but with a read replica for failover). The problem is that once I scale to 4 Tyk containers, the Redis instance starts timing out. In elasticache I see the Engine CPU utilization goes to 100%. Now since Redis is single threaded, throwing larger instance sizes at it doesn’t make any difference. I tried a clustered setup as well, but that didn’t help. Any pointers or documentation on how this should be set up, and how we can actually scale Redis?
Have you had a change to review and/or apply any of the settings recommended in our documentation around this? There’s a section in Planning for Production about optimisation settings, some which are specifically about reducing redis lookups, this may help.
Otherwise could you share your gateway configuration file? It would be good to see what settings you are running with which may contribute to heavier redis loads. Also is it safe to assume you’re using the 3.2.2 gateway release?
I’ve tried the “optimisation settings”, but I do not see any improvement. At this point, Tyk is slowing down throughput by a LOT! Any suggestions on how to debug the high number of requests to Redis is appreciated.
I’ve just now tried v3.2.2 as well. No change in Redis CPU Utilization. From what I posted above, I cannot find anything that should affect Redis performance.
Thanks for sticking with it - Tyk is blazing fast so there definitely something amiss here.
We do recommend that you turn off healthchecks as they can be expensive. (TYK_GW_HEALTHCHECK_ENABLEHEALTHCHECKS set to false)
Do you have any rate limiting set up? That could also be contributing to this. It would be great if you can post your API definition and security policy yaml so we can take a look.
Also, can you disable Oauth (just for now) and run everything again so we can eliminate that from the equation?
Would you try load testing against this API definition? It’s the one from this load testing blog, which is also run on AWS, except using Redis in a container alongside Gateway in same EC2 node, but shouldn’t matter for a simple reverse proxy test.
You would only have to change the UPSTREAM URL in that API Definition to the local container you’re using.
That API definition does NOT require any redis checks as you’ve turned off
rate limit
auth
analytics
quotas
so you’ll be able to see the true latency of Tyk as a reverse proxy!
As you’re sending load, you’ll want to monitor the CPU of Redis, Tyk, and your upstream, to find out which is the bottle neck.
Please share further numbers, including reports from load test generators as well as resources and metrics on your machines, I’d personally love to see you wow’d!
Hello @sedky, thanks for your reply. I’m primarily using cdapi. I’ll try disabling Quota and health checks. Btw, because I was performance testing, I set both Rate and API to the same values (100000 requests per 10 seconds).
Okay, so after disabling healthchecks and quotas, my redis CPU utilization has dropped below 1%! I’ll now play around a bit more and see which one out of these two made the most difference. Will report back with my findings. Thanks a lot to everyone who helped troubleshoot this with me
Okay, so I have narrowed it down to the health check. Once I disable health check, I can still use Quotas without CPU and connection spike on Redis. But I’m still getting only about half of the throughput with Tyk vs without Tyk. I’ll investigate more, and in case I’m not able to figure things out, I’ll post more details here.