Increasing memory usage after 5.2 upgrade

Hello!

After upgrading to 5.2 we are experiencing an increasing usage of memory by the gateway pods (they are slowly reaching our limits). Is this an expected behavior?

Thanks!

We’re using Tyk API Gateway 5.2.0 (tykio/tyk-gateway:5.2.0), this is happening on production scenario where we have tens of API definitions. We do have Golang middlewares running, but we had no memory issues when we were running 4.0.1.
This is our Tyk conf file:

       {
          "listen_port": 8080,
          "template_path": "/opt/tyk-gateway/templates",
          "tyk_js_path": "/opt/tyk-gateway/js/tyk.js",
          "use_db_app_configs": false,
          "app_path": "/opt/tyk-gateway/api-definitions/",
          "log_level": "info",
          "storage": {
            "type": "redis",
            "host": "xxx",
            "port": 6379,
            "username": "tyk-thirdparty",
            "password": "<%= data.redis_password %>",
            "use_ssl": true,
            "database": 0,
            "optimisation_max_idle": 2000,
            "optimisation_max_active": 4000
          },
          "enable_analytics": true,
          "analytics_config": {
            "purge_delay": -1,
            "ignored_ips": []
          },
          "health_check": {
            "enable_health_checks": true,
            "health_check_value_timeouts": 60
          },
          "optimisations_use_async_session_write": true,
          "enable_non_transactional_rate_limiter": true,
          "enable_sentinel_rate_limiter": false,
          "enable_redis_rolling_limiter": false,
          "allow_master_keys": false,
          "policies": {
            "policy_source": "file",
            "policy_record_name": "/opt/tyk-gateway/policies/policies.json"
          },
          "hash_keys": true,
          "close_connections": false,
          "http_server_options": {
            "enable_websockets": true
          },
          "allow_insecure_configs": true,
          "coprocess_options": {
            "enable_coprocess": true,
            "coprocess_grpc_server": "tcp://:9111",
            "grpc_recv_max_size": 33554432,
            "grpc_send_max_size": 33554432
          },
          "dns_cache": {
            "enabled": true,
            "ttl": 30,
            "multiple_ips_handle_strategy": "random"
          },
          "enable_bundle_downloader": true,
          "bundle_base_url": "",
          "global_session_lifetime": 100,
          "force_global_session_lifetime": true,
          "max_idle_connections_per_host": 500,
          "enable_jsvm": false,
          "opentelemetry": {
            "enabled": false,
            "endpoint": "opentelemetry-collector.monitoring:4317",
            "resource_name": "tyk thirdparty (production)"
          },
          "newrelic": {
            "app_name": "---",
            "license_key": "<%= data.newrelic_license_key %>"
          }
        }

Hello again,
we noticed that using Tyk Gateway 5.2 keeping opentelemetry disabled, we don’t have the increasing memory issue. Shall we tune the opentelemetry integration using sampling for tracing?

Hello @scelentano the behaviour seems to align with the oTel collector not processing the traces fast enough. The gateway will store traces in memory until they are processed by the collector. Can you take a look at the memory and CPU consumption of the collector? What does that look like?

Please checkout this section around oTel sampling. By default we sample 50% of the traces. Depending on the resources available and what you need you might need to adjust that.

Hello,
we tried to apply ratio based sampling configuration but memory increase behaviour is still there.
We checked otel collector metrics and we see no CPU or memory issue, we also don’t see any peaks in span refused by the collector. What we also noticed is that the memory increase does not stop even during the night, when the traffic is much lower and we should have no latency issues with sending traces to the otel collector. This is suggesting a memory leak problem in Tyk’s Gateway. We will try to run pprof to have a memory profile for the gateway.

Thank you for the information @scelentano. I reported this internally and will be waiting for the pprof data. Thank you :slight_smile:

This is the heap profile:

File: tyk
Build ID: 57d2558493288f76f519ab48387a9d7fa1ce503b
Type: inuse_objects
Time: Sep 29, 2023 at 12:29pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 174093, 98.06% of 177537 total
Dropped 78 nodes (cum <= 887)
Showing top 10 nodes out of 87
      flat  flat%   sum%        cum   cum%
     72095 40.61% 40.61%      72095 40.61%  go.opentelemetry.io/otel/internal/global.(*meter).Int64Counter
     45878 25.84% 66.45%      45878 25.84%  go.opentelemetry.io/otel/internal/global.(*meter).Float64Histogram
     35468 19.98% 86.43%      35468 19.98%  regexp.(*Regexp).FindAllStringSubmatch.func1
      6554  3.69% 90.12%       6554  3.69%  runtime.(*scavengerState).init
      4612  2.60% 92.72%       4612  2.60%  runtime.allocm
      3782  2.13% 94.85%       3782  2.13%  runtime.malg
      3277  1.85% 96.69%       5184  2.92%  regexp.compile
      1517  0.85% 97.55%       1517  0.85%  regexp.onePassCopy
       910  0.51% 98.06%        910  0.51%  github.com/sirupsen/logrus.(*Entry).WithFields
         0     0% 98.06%       1256  0.71%  crypto/tls.(*Conn).HandshakeContext
(pprof)

Shows somehow that package meter from otelhttp and more specifically the maps holding the metrics counters are adding up on the heap. Any idea?

Could the reason why the service is leaking resources be that otelhttp handler is created for each request (and also the different counters)?

1 Like

Not sure. Would you be able to share the pprof files? Also are you using 5.2.0? arm or amd?

If not, could you share the inuse_memory profiles? Our engineers are trying to replicate this and this info would be very helpful.

Does decreasing the sampling ratio slow down the rate of memory consumption?

Hello,
yes we’re using We’re using Tyk API Gateway 5.2.0 (tykio/tyk-gateway:5.2.0).
Arch is amd64. I will share the profiles asap.
Yes decreasing the sampling ratio is slowing down the consumption, but it’s still increasing (more slowly) until reaching the pods limit, making them at some point crash and restart.

1 Like

Perfect thank you for all the information. Once we have the profiles we should be able to identify the culprit.

Hello,
go tool pprof --inuse_objects https://tyk.xyz.com/debug/pprof/heap
Unfortunately I could not attach the file so this is link where you can download it Filebin | l0ahb6rrj7f1vzwo
Thank you for the support!

1 Like

Thank you @scelentano for your help! We got the issue and we are looking to release the fix in 5.2.1. Stay tuned :slight_smile:

thank you so much for the support! do you know when the release is planned to happen?
thank you!

Hi @scelentano we actually delayed the 5.2.1 release to include this fix. It was supposed to go out last week. The fix just passed QA. My guess would be that this is released next week latest? I will keep the thread updated :slight_smile:

2 Likes

@scelentano the fix is available on 5.2.1-rc6 if you want to test it. :slight_smile:

@scelentano 5.2.1 was released yesterday. :slight_smile:

1 Like