Memory Leak issue while upgrading TYK from version 4.3.4 to 5.8.5

Hi Team

We recently have upgraded TYK from v4.3.4 to v5.8.5. Initially the memory was at 20% on 3 nodes of TYK each having 4 core CPU and 8GB memory.

But gradually memory kept increasing and in ~2 months the memory on all nodes reached till 80%, we did not have pprof enabled earlier so don’t have any profiling data for memory usage.

Can anybody guide on direction to check what could have caused this leak issue.

Hi Mohit

Are you using JWT with JWKS URL caching on your APIs?

I ask, as the cache mechanism in 5.8.5 doesn’t release memory correctly, leading to a gradual increase in consumption over time. That would match the symptoms you are seeing?

I can see that this issue was resolved in version 5.8.6 and later, so an upgrade would resolve this

James

Hey James

Not sure about usage of JWKS URL caching, how can I know if its enabled or not?

Hi Mohit,

JWKS URL caching is enabled by default in Tyk Gateway whenever a JWKS URL is configured for JWT authentication. There is no explicit toggle to turn it on or off; if a JWKS URL is provided, Tyk will automatically cache the response to avoid making an HTTP request for every incoming JWT.

To check if you are using JWKS URLs (and therefore using the cache), you can look at your API definitions or global configuration. Are you using Tyk Dashboard, or only the Open Source gateway?:

1. In the Tyk Dashboard UI

  1. Go to the API Designer for your specific APIs.
  2. Navigate to the Authentication section (or Core Settings for Classic APIs).
  3. If JWT is selected as the authentication mode and a JWKS URL is provided, caching is automatically active.

2. In your API Definitions (JSON/YAML)
For OAS APIs:

  1. Open your API definition.
  2. Navigate to x-tyk-api-gateway.server.authentication.securitySchemes.[your_jwt_scheme].jwksURIs .
  3. If there are entries here with a url , JWKS caching is active. You might also see a cacheTimeout field (e.g., "cacheTimeout": "60s" ). If cacheTimeout is missing, the global default of 240 seconds is used.

For Classic APIs:

  1. Open your API definition JSON.
  2. Look for the jwt_jwks_uris array.
  3. If there are entries here with a url , JWKS caching is active. You might also see a cache_timeout field. If missing or set to 0 , the global default of 240 seconds is used.

3. Global Configuration
You can also check if the default cache timeout has been modified globally:
• Look in your tyk.conf file for the jwks_config.cache.timeout setting.
• Alternatively, check if the TYK_GW_JWKS_CACHE_TIMEOUT environment variable is set.

If you are using JWKS URLs, this matches the symptoms of the memory leak issue resolved in version 5.8.6, where the cache mechanism didn’t release memory correctly.

We are using the classic API definitions and not found jwt_jwks_uris in any of our APIs.
Also there is no jwks_config.cache.timeout or environment var for timeout.

It seems that no API is using this caching so memory leak seems to be from somewhere else.

Hi Mohit,

Thanks for checking that. Since you are not using JWKS caching, the memory leak is likely stemming from a different area.

To help us identify exactly what is consuming the memory, the best next step is to capture a heap profile using Go’s built-in pprof tool when the memory usage is high.

Here is how you can do that:

1. Enable the HTTP Profiler
You will need to enable the profiler on your Tyk Gateway nodes. You can do this by:

  • Setting "enable_http_profiler": true in your tyk.conf file.
  • OR setting the environment variable TYK_GW_ENABLE_HTTP_PROFILER=true.

(Note: Please be aware that this exposes the /debug/pprof/ endpoint on your gateway port, so it’s recommended to only enable this temporarily for troubleshooting or ensure it’s not publicly accessible).

2. Capture the Heap Profile
Once enabled, wait for the memory usage to climb to a high level (e.g., >60-70%). Then, run the following command from the gateway node to capture the heap profile:

curl -s http://localhost:8080/debug/pprof/heap > heap.out

(Replace localhost:8080 with your actual gateway host and port if different).

3. Share the Profile
Once you have the heap.out file, please share it with us here (or via a support ticket if you have a commercial license) so we can analyze it and pinpoint the exact cause of the leak.

Additionally, a quick question: Are you using any custom plugins (Python, JavaScript, gRPC, etc.) or plugin bundles? There have been a few memory-related fixes around plugin bundle verification and CoProcess in the 5.8.x release series that might be relevant depending on your setup.

Let us know once you have the profile or if you need any help capturing it!

Hi we tried to replicate the behaviour using a load test in test environment. Following are the details of test:

Load: 300 TPS
Duration: 10 hours
VM: 4 Core 8 GB

From initiating the test TYK memory kept increasing ~5-6% per hour and in around 7 hrs it reached 65-66%

At this point we captured the heap data and shared the heap file with you in a one to one message.

Regarding Plugins, we are using grpc custom plugins written in GO mainly for rate limiting by client ip and some header logging stuff.

Following is our tyk.conf

{
  "enable_http_profiler": true,
  "log_level": "info",
  "listen_port": 8081,
  "node_secret": "*****",
  "secret": "*****",
  "template_path": "/opt/tyk-gateway/templates",
  "tyk_js_path": "/opt/tyk-gateway/js/tyk.js",
  "use_logstash": false,
  "use_db_app_configs": false,
  "db_app_conf_options": {
    "connection_string": "",
    "node_is_segmented": false,
    "tags": []
  },
  "disable_dashboard_zeroconf": true,
  "app_path": "/opt/tyk-gateway/apps",
  "middleware_path": "/opt/tyk-gateway/middleware",
  "storage": {
    "type": "redis",
    "enable_cluster": true,
    "host" : "localhost",
    "hosts": {"redis-member-1.int": "6379", "redis-member-2.int": "6379", "redis-member-3.int": "6379", "redis-member-4.int": "6379", "redis-member-5.int": "6379", "redis-member-6.int": "6379"},
    "port": 6379,
    "username": "",
    "password": "",
    "database": 0,
    "optimisation_max_idle": 2000,
    "optimisation_max_active": 4000,
    "use_ssl": false,
    "ssl_insecure_skip_verify": true
  },
  "enable_analytics": true,
  "analytics_config": {
    "type": "mongo",
    "pool_size": 100,
    "csv_dir": "/tmp",
    "mongo_url": "",
    "mongo_db_name": "",
    "mongo_collection": "",
    "purge_delay": 100,
    "ignored_ips": [],
    "enable_detailed_recording": false,
    "enable_geo_ip": false,
    "geo_ip_db_path": "",
    "storage_expiration_time": 60,
    "normalise_urls": {
      "enabled": true,
      "normalise_uuids": true,
      "normalise_numbers": true,
      "custom_patterns": []
    }
  },
  "health_check": {
    "enable_health_checks": false,
    "health_check_value_timeouts": 60
  },
  "allow_master_keys": true,
  "policies": {
    "policy_source": "file",
    "policy_connection_string": "",
    "policy_record_name": "tyk_policies",
    "allow_explicit_policy_id": true,
    "policy_path": "/opt/tyk-gateway/policies"
  },
  "hash_keys": true,
  "suppress_redis_signal_reload": false,
  "enable_redis_rolling_limiter": false,
  "use_redis_log": false,
  "close_connections": true,
  "enable_non_transactional_rate_limiter": true,
  "enable_sentinel_rate_limiter": false,
  "experimental_process_org_off_thread": false,
  "enforce_org_quotas": false,
  "enforce_org_data_detail_logging": false,
  "local_session_cache": {
    "disable_cached_session_state": false
  },
  "http_server_options": {
    "use_ssl": true,
    "enable_strict_routes": false,
    "min_version": 771,
    "max_version": 772,
    "enable_websockets": true,
    "flush_interval": 1,
    "read_timeout": 2000,
    "write_timeout": 2000,
    "enable_path_suffix_matching": false,
    "enable_path_prefix_matching": false,
    "certificates": [
      {
        "domain_name": "*.int",
        "cert_file": "/tyk-ssl/int.crt",
        "key_file": "/tyk-ssl/int.pem"
      }
    ],
    "ssl_insecure_skip_verify": true
  },
  "streaming": {
    "enabled": true,
    "allow_unsafe": []
  },
  "uptime_tests": {
    "disable": true,
    "config": {
      "enable_uptime_analytics": false,
      "failure_trigger_sample_size": 3,
      "time_wait": 1,
      "checker_pool_size": 50
    }
  },
  "hostname": "",
  "enable_custom_domains": true,
  "enable_jsvm": false,
  "oauth_redirect_uri_separator": ";",

  "coprocess_options": {
    "enable_coprocess": true,

    "coprocess_grpc_server": "unix:///tmp/grpc-go.sock",
     "grpc_recv_max_size": 1073741824,
     "grpc_send_max_size": 1073741824
  },
  "enable_bundle_downloader": false,
  "bundle_base_url": "",
  "pid_file_location": "./tyk-gateway.pid",

  "allow_insecure_configs": true,
  "public_key_path": "",
  "close_idle_connections": false,
  "allow_remote_config": false,


  "global_session_lifetime": 100,
  "force_global_session_lifetime": false,
  "max_idle_connections_per_host": 100,

  "proxy_default_timeout": 1200,
  "health_check_endpoint_name": "health",
  "oas_config": {
    "validate_examples": false,
    "validate_schema_defaults": false
    },
  "proxy_ssl_insecure_skip_verify": true
}

Hi there,

Thank you for providing the heap profile and the details of your load test.

Based on our analysis of the heap data and your tyk.conf, this appears to be related to memory fragmentation rather than a classical memory leak. Specifically, the frequent allocation and deallocation of Coprocess objects (which are used for every gRPC request) can cause severe memory fragmentation over time. The Go garbage collector struggles to reclaim this memory efficiently under high throughput, leading to the steady increase in RAM usage you are observing.

Additionally, we noticed in your configuration that the gRPC message size limits are set extremely high (1GB):

"coprocess_options": {
  "grpc_recv_max_size": 1073741824,
  "grpc_send_max_size": 1073741824
}

These massive limits allow individual gRPC requests to consume huge amounts of memory for buffering, which significantly exacerbates the memory consumption and fragmentation issues during a load test.

Recommendations & Mitigations:

  1. Reduce gRPC Max Sizes: We highly recommend lowering grpc_recv_max_size and grpc_send_max_size to a more reasonable value (e.g., 4194304 for 4MB, or 33554432 for 32MB) unless you are actually passing gigabyte-sized payloads.
  2. Go GC Tuning: As a temporary mitigation, you can set the GOMEMLIMIT environment variable to about 80-90% of your VM’s memory limit (e.g., 6500MiB for an 8GB VM). This forces more aggressive garbage collection before the memory grows too high.

Permanent Fix Status:
We are actively working on a permanent fix for this issue by implementing a sync.Pool to reuse these Coprocess objects instead of creating new ones for every request. You can track the progress in PR #8124. Please note that this PR has not yet been merged, does not have a specific release associated with it yet, and still needs to go through our full QA process.

Let us know if adjusting the gRPC max sizes and applying the GOMEMLIMIT helps stabilize the memory in your test environment!

Hey Oel

Thanks for the analysis.

We tried the test with both the recommended configurations. We have changed grpc_recv_max_size & grpc_send_max_size to 32 MB and GOMEMLIMIT to 6GiB (75% of VM).

Memory trends are still similar to the earlier one its getting increased gradually (~5-6% per hour).

Also tried the same load on earlier version (4.3.4) for 2 hours, there its working normally (no increase).

We tried to analyse heap differences with the help of AI assistance and below are findings:
What the 4.3.4 heap shows:

  • Total live heap is only about 60 MB
  • Top live retainers are mostly ordinary buffers and short-lived request/session work:
    • bufio.NewWriterSize
    • bufio.NewReaderSize
    • bytes.makeSlice
    • gateway.BaseMiddleware.UpdateRequestSession
    • github.com/pmylund/go-cache.(*cache).Set
    • gateway.(*APISpec).Version
    • logrus.(*Entry).WithFields
    • gateway.(*CoProcessor).BuildObject
    • gateway.(*DefaultSessionManager).UpdateSession

What the 5.8.5 heap shows:

  • Total live heap is about 1.9 GB
  • The retained memory is dominated by:
    • internal/memorycache.(*Cache).Set
    • internal/memorycache.(*BucketStorage).Create
    • gateway.(*SessionLimiter).ForwardMessage
    • internal/memorycache.(*Item).touch

it mentioned a potential bug in the current code as well:
Bug: internal/memorycache.Cache.startCleanupTimer exits after one cleanup due to an unconditional break inside the for/select loop (internal/memorycache/cache.go:95). This causes BucketStorage entries used by SessionLimiter.limitDRL to accumulate indefinitely.

Could you please check and suggest next steps as above mentined mitigations did not help in our case.

Hi there,

Thank you so much for the detailed analysis and for sharing the AI findings. You really helped us track this down!

Our early analysis confirms that this is exactly the problem. The unconditional break in the memory cache cleanup timer is indeed causing the buckets to accumulate indefinitely under your load.

While we work on the fix, you can overcome this issue by bypassing the in-memory leaky bucket entirely. Since you are using rate limiting, you can switch to a Redis-backed rate limiter by adding one of the following to your tyk.conf (assuming you have Redis configured):

"enable_redis_rolling_limiter": true
• OR "enable_fixed_window_rate_limiter": true

This will shift the rate limiting state to Redis, completely avoiding the affected memory cache and stopping the leak.

We have already opened a pull request with a fix, which you can track here: fix: remove unconditional break in memorycache cleanup timer by probelabs[bot] · Pull Request #8180 · TykTechnologies/tyk · GitHub. We will need a bit more time to run it through proper QA and testing before it is officially released, so we don’t have an exact release date just yet.

Thanks again for your incredible help in finding this issue!