Scaling Redis for Tyk in AWS

kalpik · January 6, 2022, 1:27pm

Hello,

I’ve been trying to push Tyk to the limits. I have Tyk running containerized on an ECS cluster, and I’m using an elasticache Redis instance (non-clustered, but with a read replica for failover). The problem is that once I scale to 4 Tyk containers, the Redis instance starts timing out. In elasticache I see the Engine CPU utilization goes to 100%. Now since Redis is single threaded, throwing larger instance sizes at it doesn’t make any difference. I tried a clustered setup as well, but that didn’t help. Any pointers or documentation on how this should be set up, and how we can actually scale Redis?

Thanks

chris.f · January 6, 2022, 2:00pm

Hey kalpik!

Have you had a change to review and/or apply any of the settings recommended in our documentation around this? There’s a section in Planning for Production about optimisation settings, some which are specifically about reducing redis lookups, this may help.

Otherwise could you share your gateway configuration file? It would be good to see what settings you are running with which may contribute to heavier redis loads. Also is it safe to assume you’re using the 3.2.2 gateway release?

Best Regards,
Chris

kalpik · January 6, 2022, 2:17pm

Hello Chris,

I’m using v3.1.2. I’ll try those settings you linked. Or do you think the latest version will make a significant difference in speed?

Thanks for your reply.

kalpik · January 6, 2022, 2:21pm

Here are the environment variables I’m using for the Tyk container.

kalpik · January 7, 2022, 9:22am

I’ve tried the “optimisation settings”, but I do not see any improvement. At this point, Tyk is slowing down throughput by a LOT! Any suggestions on how to debug the high number of requests to Redis is appreciated.

Thanks

kalpik · January 7, 2022, 1:35pm

I’ve just now tried v3.2.2 as well. No change in Redis CPU Utilization. From what I posted above, I cannot find anything that should affect Redis performance.

coffeemanmatt · January 7, 2022, 2:14pm

Hi @kalpik ,

Thanks for sticking with it - Tyk is blazing fast so there definitely something amiss here.

We do recommend that you turn off healthchecks as they can be expensive. (TYK_GW_HEALTHCHECK_ENABLEHEALTHCHECKS set to false)

Do you have any rate limiting set up? That could also be contributing to this. It would be great if you can post your API definition and security policy yaml so we can take a look.

Also, can you disable Oauth (just for now) and run everything again so we can eliminate that from the equation?

Thanks!

kalpik · January 7, 2022, 3:13pm

Thanks for getting back. Appreciate it

I have already tried disabling OAuth and with the latest version of tyk. I will check what happens when I disable health checks.

I do use rate-limiting, but I’ve set up an insane limit on my key, just to make sure that’s not the bottleneck (100,000 per 10 seconds).

Here’s a GET on my /tyk/apis: Pastebin.com - Locked Paste

Password: V55iFLVbu5

sedky · January 7, 2022, 4:40pm

Heya @kalpik , which API Definition from that list are you using primarily?

Thanks for sharing I’ll help you optimize it

I believe a lot of Redis requests are around quota.

sedky · January 7, 2022, 5:17pm

@kalpik ,

I reviewed your tyk config, it looks good!

Would you try load testing against this API definition? It’s the one from this load testing blog, which is also run on AWS, except using Redis in a container alongside Gateway in same EC2 node, but shouldn’t matter for a simple reverse proxy test.

You would only have to change the UPSTREAM URL in that API Definition to the local container you’re using.

That API definition does NOT require any redis checks as you’ve turned off

rate limit
auth
analytics
quotas

so you’ll be able to see the true latency of Tyk as a reverse proxy!

As you’re sending load, you’ll want to monitor the CPU of Redis, Tyk, and your upstream, to find out which is the bottle neck.

Please share further numbers, including reports from load test generators as well as resources and metrics on your machines, I’d personally love to see you wow’d!

kalpik · January 10, 2022, 6:51am

Hello @sedky, thanks for your reply. I’m primarily using cdapi. I’ll try disabling Quota and health checks. Btw, because I was performance testing, I set both Rate and API to the same values (100000 requests per 10 seconds).

kalpik · January 10, 2022, 4:35pm

Okay, so after disabling healthchecks and quotas, my redis CPU utilization has dropped below 1%! I’ll now play around a bit more and see which one out of these two made the most difference. Will report back with my findings. Thanks a lot to everyone who helped troubleshoot this with me

Thanks

sedky · January 11, 2022, 1:15pm

Woohoo! Looking forward to some benchmark numbers from you

kalpik · January 11, 2022, 3:39pm

Okay, so I have narrowed it down to the health check. Once I disable health check, I can still use Quotas without CPU and connection spike on Redis. But I’m still getting only about half of the throughput with Tyk vs without Tyk. I’ll investigate more, and in case I’m not able to figure things out, I’ll post more details here.

Thanks!