Load testing fails with Tyk 5.1

Hi there!

we are conducting an extensive load testing of the open source Tyk Gateway 5.1.
Shortly the setup is the following:

  • three containerized Tyk instances in an ECS cluster
  • an Elasticache Redis instance (node type cache.m5.2xlarge, running in 1 shard and 2 nodes).
  • several sample upstream service containers (created based on Mockbin) which live in the same cluster. They serve to mock real upstream services and ensure minimal response time delay.
  • The test runs with a k6 client generating +100 req/sec ending the test at 2000 req/sec.

We experienced that Tyk is slowing down after a while because of Elasticache Allowance Exceed. After a closer look of what happens, we found that there is a very high volume of data is being read from the Elasticache by Tyk. More precisely:

On the screenshot we can see the following:

  • “Response timing -95th”: the response time increases significantly once the Elasticache Allowance Exceed starts to happen
  • “Elasticache Network”: there are about 1500 connections open the Elasticache
  • “Received (Tyk)”: 14.31 GiB of data received from Elasticache in every second

We monitored the communication between Tyk and the Redis database and found that in every second these commands are executed:

1693410667.617965 [0 172.18.0.2:47430] "set" "redis-test-b8917b54-5c8f-4746-ab7a-413accbe07f7" "test" "ex" "1"
1693410667.618609 [0 172.18.0.2:47430] "get" "redis-test-b8917b54-5c8f-4746-ab7a-413accbe07f7"
1693410667.619361 [0 172.18.0.2:37174] "set" "redis-test-242f7d15-0f92-4b90-bde0-5c2991c1e60d" "test" "ex" "1"
1693410667.619833 [0 172.18.0.2:37174] "get" "redis-test-242f7d15-0f92-4b90-bde0-5c2991c1e60d"
1693410667.620414 [0 172.18.0.2:37184] "set" "redis-test-6d1a27d4-b262-4a12-9be4-64c2312e584d" "test" "ex" "1"
1693410667.620867 [0 172.18.0.2:37184] "get" "redis-test-6d1a27d4-b262-4a12-9be4-64c2312e584d"

Our Tyk config looks like this:

{
  "listen_port": 8080,
  "secret": "<redacted>",
  "template_path": "/opt/tyk-gateway/templates",
  "tyk_js_path": "/opt/tyk-gateway/js/tyk.js",
  "middleware_path": "/opt/tyk-gateway/middleware",
  "use_db_app_configs": false,
  "app_path": "/opt/tyk-gateway/apps/",
  "storage": {
    "type": "redis",
    "enable_cluster": true,
    "addrs": [ "clustercfg.<redacted>.cache.amazonaws.com:6379" ],
    "port": 6379,
    "username": "appservices-user-testing",
    "password": "<redacted>%",
    "use_ssl": true,
    "database": 0,
    "optimisation_max_idle": 2000,
    "optimisation_max_active": 4000
  },
  "enable_analytics": false,
  "analytics_config": {
    "type": "redis",
    "csv_dir": "/tmp",
    "mongo_url": "",
    "mongo_db_name": "",
    "mongo_collection": "",
    "purge_delay": -1,
    "ignored_ips": []
  },
  "health_check": {
    "enable_health_checks": false,
    "health_check_value_timeouts": 60
  },
  "optimisations_use_async_session_write": false,
  "enable_non_transactional_rate_limiter": true,
  "enable_sentinel_rate_limiter": false,
  "enable_redis_rolling_limiter": false,
  "allow_master_keys": false,
  "policies": {
    "policy_source": "file",
    "policy_record_name": "/opt/tyk-gateway/policies/policies.json"
  },
  "hash_keys": true,
  "enable_hashed_keys_listing": true,
  "close_connections": false,
  "http_server_options": {
    "enable_websockets": true
  },
  "allow_insecure_configs": true,
  "coprocess_options": {
    "enable_coprocess": false,
    "coprocess_grpc_server": ""
  },
  "enable_bundle_downloader": true,
  "bundle_base_url": "",
  "global_session_lifetime": 100,
  "force_global_session_lifetime": false,
  "max_idle_connections_per_host": 500,
  "enable_jsvm": true
}

Even though we disabled Tyk healthcheck trough config, Redis is still bombarded with healthcheck and the /hello endpoint is available despite the API doc stating it would get disabled.

Can you please give us some hints what to check, how to optimize/minimize the communication between Elasticache and Tyk? With the current setup we reached only 1000 rps with an reasonable 50 ms response time, I am sure Tyk is capable to perform much better than this.

Thank you in advance,
Jozsef Kercso

After testing several different Tyk config combinations, we found the root cause of the issue: Tyk sends the analytics to Redis even if "enable_analytics": false is set. To really disable analytics collection, we need to clear the analytics_config.type configuration value as well. So we managed to disable analytics with the following settings:

"enable_analytics": false,
  "analytics_config": {
    "type": "",
    "csv_dir": "/tmp",
    "mongo_url": "",
    "mongo_db_name": "",
    "mongo_collection": "",
    "purge_delay": -1,
    "ignored_ips": []
  }

After this Tyk produces a much better request/second ratio.

Hey Jozsef, great find! I’m curious,

  • what’s the size of the Tyk instances?
  • what was the final RPS you achieved?
  • what was your target RPS initially?

Hey @munkiat,

here are the answers:

  • the size of the Tyk instances: m5n.2xlarge
  • the final RPS we achieved: we didn’t finish the tests yet, so I would not call them final. This is what we achieved so far with a single Tyk instance:
    • 7500rps with 10ms response time
    • 10000rps with 20ms response time
    • 11400rps with 50-60ms response time
      Note: that this is with a bare minimal Tyk: we switched off everything we could from the config.
  • target RPS: we would like to be able to remain close to 10k rps / single Tyk instance with a reasonable low response time while we enable different plugins. Obviously this depends on the performance of the plugins as well, not just Tyk itself.
    Note: 10k rps is considered for single REST and GraphQL requests. We will also test the stitching performance of the Universal Data Graph. I assume, the numbers will be quite different there.

Best Regards,
Jozsef

We continued to dig deeper into understanding how Tyk collects analytics data and we found the following:

We can see these Redis commands per request:

"zremrangebyscore" "myendpoint.Request" "-inf" "1693472809611289234"
"zrange" "myendpoint.Request" "0" "-1"
"zadd" "myendpoint.Request" "1693472869611289300" "1693472869611242643.3"
"pexpireat" "myendpoint.Request" "60"

The problem is when Tyk receives a few thousand requests per second:

In these cases a single ZRANGE command runs for seconds and retrieves many bytes of data. That impacts response times and network bandwidth:

$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
1 
$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
4004 
$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
26092
$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
49984 
$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
79530
$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
510138 
$ redis-cli zrange myendpoint.Request 0 -1 | wc -c
680453

and so on…

Also consumes a lot of connections

We hope that the result of this investigation can help to make Tyk analytics better.

2 Likes

@Jozsef_Kercso really like your attention to detail. I’m keen to see your UDG perf tests when you give it a spin.

Took a quick look (public holiday here in Singapore today). Some quick notes on my end:

  • Disabling analytics will definitely save some overheads in the testing
  • from this page, it seems that analytics_config.type=redis isn’t a valid option, so it’s a good call to clear it
  • Other configs like csv_dir, mongo_url etc. are pump-related configs, so I don’t expect it to have impact
  • The ZRANGE commands are definitely coming from healthchecks and you’re right, they are “expensive”

I’ve managed to turn off the health checks and zremrangebyscore, zrange, zadd are no longer appearing. I see that you’ve also set it to false, is there a chance that your config is not being taken in?

Hello @Jozsef_Kercso and welcome to the community :partying_face:

“Received (Tyk)”: 14.31 GiB of data received from Elasticache in every second

This seems very large but I am no performance expert, so I am unsure what exactly is going on.

What I’d like to focus on is

Even though we disabled Tyk healthcheck trough config, Redis is still bombarded with healthcheck and the /hello endpoint is available despite the API doc stating it would get disabled

The hello endpoint is the liveness check endpoint on the gateway and not the actual health check endpoint mentioned in the docs.Those are definitely more expensive to process than the liveness check. In our docs we use the term interchangeably so it may sound confusing.

Tyk is dependant on redis to be fully operational but some parts may still work without access to redis. The communication noticed are redis-liveness checks that the gateway sends to redis to ensure it’s still active.You can see it from the gateway source linked below

If the result of this is false, then you would observe error logs from the gateway warning you about an issue with your redis conenction. I don’t think this value can be modified or dynamically changed without building a custom version of the gateway.

Can you please give us some hints what to check, how to optimize/minimize the communication between Elasticache and Tyk?

This is where our planning for production documentation comes in handy. @munkiat has also given a couple of good suggestions but one to really keep in mind if analytics is necessary is to disable detailed recording or enable_separate_analytics_store

Regarding this, we are unable to reproduce the issue. The only values allowed are empty string "" or rpc. So it’s a bit weird that modifying this value worked for you. Can you reconfirm by reverting the changes?

Dear @Olu and @munkiat,

thank you for your answers. We double checked and here is my revisited answer:

  • you are correct, the analytics_config.type cannot have redis value. That was a copy/paste issue from our side.
  • it is also correct that both enable_analytics and enable_health_checks configuration values work as expected, so if we set them to false they will switch off the corresponding feature.
  • the root cause of the misunderstanding was that we played with setting them on and off in parallel and mistakenly we came to the conclusion that the zrange command (which takes a lot of time as we measured) is executed by the analytics module.

So, to summarize: after we set the enable_health_checks to false there was a considerable speed increase in Tyk. Basically we came to a very similar conclusion as this thread: Scaling Redis for Tyk in AWS - #5 by kalpik

Thank you for your inputs and for driving me in the good direction.

Jozsef

2 Likes