Tyk gateway memory consumption issue

Hi,
We have been observing strange behavior on the gateway environment for a few weeks now. The gateway container seems to steadily consume more and more memory over time and it keeps increasing until the host machine finally snaps. Initially, we could not narrow it down since we run a lot of monitoring and log tracking tools in the same host. However, we added docker monitoring recently and the container memory consumption patterns are in line with what we observe for the gateway environment as a whole. Does this look like there is an issue in the docker release of the gateway? Please check screenshots below.

Instance mem usage -

Gateway container mem usage -

Environment hosts memory consumption pattern. The points at which the usage curve collapses is when the host is killed and replaced by autoscaling.

Hi, can you provide more details about your setup? Are you using the latest Docker image tag? Is it CE or Pro edition?

Your Tyk configuration files might be useful for us too.

Hi, we are using the pro licensed edition. Tyk configuration is pretty standard(Find below). We had a low memory limit set for the gateway docker container initially, so we increased that to about 80% of the host machine’s available memory. We still saw the issue persist after that change.

We are using the v2.3.1 and v2.3.3 versions in 2 environments. We see the issue in both cases.

It had all the earmarks of memory leak in the gateway container, so we decided to seek inputs from the Tyk community.

{
“listen_port”: 8080,
“secret”: “xxxxxxxxxxxxxxxxxxx”,
“node_secret”: “xxxxxxxxxxxxxxxxxxx”,
“template_path”: “/opt/tyk-gateway/templates”,
“tyk_js_path”: “/opt/tyk-gateway/js/tyk.js”,
“middleware_path”: “/opt/tyk-gateway/middleware”,
“use_db_app_configs”: true,
“db_app_conf_options”: {
“connection_string”: “xxxxxxxxxxxxxxxxxxx”,
“node_is_segmented”: false,
“tags”: [“External Tyk Prod”]
},
“app_path”: “/opt/tyk-gateway/apps/”,
“storage”: {
“type”: “redis”,
“host”: “xxxxxxxxxxxxxxxxxxx”,
“port”: 6379,
“username”: “”,
“password”: “”,
“database”: xxxxxxxxxxxxxxxxxxx,
“optimisation_max_idle”: 100
},
“enable_analytics”: true,
“health_check”: {
“enable_health_checks”: true,
“health_check_value_timeouts”: 60
},
“optimisations_use_async_session_write”: true,
“enable_non_transactional_rate_limiter”: true,
“enable_sentinel_rate_limiter”: false,
“allow_master_keys”: false,
“policies”: {
“policy_source”: “service”,
“policy_connection_string”: “xxxxxxxxxxxxxxxxxxx”
},
“hash_keys”: true,
“close_connections”: true,
“allow_insecure_configs”: true,
“coprocess_options”: {
“enable_coprocess”: false,
“coprocess_grpc_server”: “”
},
“enable_bundle_downloader”: true,
“bundle_base_url”: “”,
“global_session_lifetime”: 100,
“force_global_session_lifetime”: false,
“max_idle_connections_per_host”: 100,
“http_server_options”: {
“enable_websockets”: true,
“flush_interval”: 1
},
“enable_custom_domains”: true
}

Are you running uptime tests?

No, we are not running uptime tests.

Are you reloading the gateway periodically?

No, gateway is deployed once and it gets killed once when it is replaced during autoscaling. No reloads in between.

Ok, what middleware are you running? DOes this happen on a gateway that does not see live traffic?

No, this is seen in a gateway that is heavily used. It sees constant traffic 24x7. And, if by middleware you mean any plugins for the APIs, no we don’t have that either.

Ok, that’s interesting - for now, I can only suggest some config tweaks that might improve things:

Set this to false, it can drag down your performance, AND it forces a redis write on every connection, you probably do not need this, if you want health checks then look at our statsd integration.

Then add this:

"uptime_tests": {
    "disable": true
}

It will stop the uptime host checker from running altogether, so the loop will not even run.

Increase this value to 1000 or more, and set a ceiling with:

"optimisation_max_active": 4000

This will ensure a connection pool is created that isn’t infinite, but recycles connections and will also give you better performance.

Are you using websockets at all?

We have enabled websockets with -
“http_server_options”: {
“enable_websockets”: true,
“flush_interval”: 1
}

But I don’t think we have any APIs using websockets currently.

I will try the changes you suggested in a test env, Martin. I will revert in a few hours to let you know of my observations with these changes.

Hi,
We conducted tests with these changes. Things look much better now. The memory usage doesn’t go off the rails and it steadies when the load on the gateway decreases unlike what we saw before.

We need a few clarifications about the redis connection pool usage. We see in our redis cluster that, on average, there is a maximum of 5 connections used. So, how does a value of 100 for optimisation_max_idle not suffice. We would like a little insight into the theory of how gateway uses this field and why we don’t see more connections being used on the cluster.

Thanks

The idle pool basically sets a floor for when the pool is spun up, so if yu go from 1 connection to 120 because of a traffic spike, then the app will never close more than 100 connections.

Because you had the health check enabled - it actually kicks off a redis command on each request to track metrics, so your average connection count will have increased as traffic increased.

Because there was no ceiling value, the app will have held on to connections, and can cause increased memory usage that tracks with traffic and doesn’t necessarily let go, it’s a very old feature and also why we are deprecating the feature.

I think the health checks were to blame, not the connection pool.

Also, the newer versions of Tyk use a distributed rate limiter, which reduces redis usage even further - the high pool count is just a recommended setting because it makes it better handling high volume traffic.