Increasing memory usage after 5.2 upgrade

scelentano · September 27, 2023, 8:40am

Hello!

After upgrading to 5.2 we are experiencing an increasing usage of memory by the gateway pods (they are slowly reaching our limits). Is this an expected behavior?

Thanks!

scelentano · September 27, 2023, 1:36pm

We’re using Tyk API Gateway 5.2.0 (tykio/tyk-gateway:5.2.0), this is happening on production scenario where we have tens of API definitions. We do have Golang middlewares running, but we had no memory issues when we were running 4.0.1.
This is our Tyk conf file:

       {
          "listen_port": 8080,
          "template_path": "/opt/tyk-gateway/templates",
          "tyk_js_path": "/opt/tyk-gateway/js/tyk.js",
          "use_db_app_configs": false,
          "app_path": "/opt/tyk-gateway/api-definitions/",
          "log_level": "info",
          "storage": {
            "type": "redis",
            "host": "xxx",
            "port": 6379,
            "username": "tyk-thirdparty",
            "password": "<%= data.redis_password %>",
            "use_ssl": true,
            "database": 0,
            "optimisation_max_idle": 2000,
            "optimisation_max_active": 4000
          },
          "enable_analytics": true,
          "analytics_config": {
            "purge_delay": -1,
            "ignored_ips": []
          },
          "health_check": {
            "enable_health_checks": true,
            "health_check_value_timeouts": 60
          },
          "optimisations_use_async_session_write": true,
          "enable_non_transactional_rate_limiter": true,
          "enable_sentinel_rate_limiter": false,
          "enable_redis_rolling_limiter": false,
          "allow_master_keys": false,
          "policies": {
            "policy_source": "file",
            "policy_record_name": "/opt/tyk-gateway/policies/policies.json"
          },
          "hash_keys": true,
          "close_connections": false,
          "http_server_options": {
            "enable_websockets": true
          },
          "allow_insecure_configs": true,
          "coprocess_options": {
            "enable_coprocess": true,
            "coprocess_grpc_server": "tcp://:9111",
            "grpc_recv_max_size": 33554432,
            "grpc_send_max_size": 33554432
          },
          "dns_cache": {
            "enabled": true,
            "ttl": 30,
            "multiple_ips_handle_strategy": "random"
          },
          "enable_bundle_downloader": true,
          "bundle_base_url": "",
          "global_session_lifetime": 100,
          "force_global_session_lifetime": true,
          "max_idle_connections_per_host": 500,
          "enable_jsvm": false,
          "opentelemetry": {
            "enabled": false,
            "endpoint": "opentelemetry-collector.monitoring:4317",
            "resource_name": "tyk thirdparty (production)"
          },
          "newrelic": {
            "app_name": "---",
            "license_key": "<%= data.newrelic_license_key %>"
          }
        }

scelentano · September 27, 2023, 2:23pm

Hello again,
we noticed that using Tyk Gateway 5.2 keeping opentelemetry disabled, we don’t have the increasing memory issue. Shall we tune the opentelemetry integration using sampling for tracing?

zaid · September 28, 2023, 11:02am

Hello @scelentano the behaviour seems to align with the oTel collector not processing the traces fast enough. The gateway will store traces in memory until they are processed by the collector. Can you take a look at the memory and CPU consumption of the collector? What does that look like?

Please checkout this section around oTel sampling. By default we sample 50% of the traces. Depending on the resources available and what you need you might need to adjust that.

scelentano · September 29, 2023, 8:26am

Hello,
we tried to apply ratio based sampling configuration but memory increase behaviour is still there.
We checked otel collector metrics and we see no CPU or memory issue, we also don’t see any peaks in span refused by the collector. What we also noticed is that the memory increase does not stop even during the night, when the traffic is much lower and we should have no latency issues with sending traces to the otel collector. This is suggesting a memory leak problem in Tyk’s Gateway. We will try to run pprof to have a memory profile for the gateway.

zaid · September 29, 2023, 8:49am

Thank you for the information @scelentano. I reported this internally and will be waiting for the pprof data. Thank you

scelentano · September 29, 2023, 10:30am

This is the heap profile:

File: tyk
Build ID: 57d2558493288f76f519ab48387a9d7fa1ce503b
Type: inuse_objects
Time: Sep 29, 2023 at 12:29pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 174093, 98.06% of 177537 total
Dropped 78 nodes (cum <= 887)
Showing top 10 nodes out of 87
      flat  flat%   sum%        cum   cum%
     72095 40.61% 40.61%      72095 40.61%  go.opentelemetry.io/otel/internal/global.(*meter).Int64Counter
     45878 25.84% 66.45%      45878 25.84%  go.opentelemetry.io/otel/internal/global.(*meter).Float64Histogram
     35468 19.98% 86.43%      35468 19.98%  regexp.(*Regexp).FindAllStringSubmatch.func1
      6554  3.69% 90.12%       6554  3.69%  runtime.(*scavengerState).init
      4612  2.60% 92.72%       4612  2.60%  runtime.allocm
      3782  2.13% 94.85%       3782  2.13%  runtime.malg
      3277  1.85% 96.69%       5184  2.92%  regexp.compile
      1517  0.85% 97.55%       1517  0.85%  regexp.onePassCopy
       910  0.51% 98.06%        910  0.51%  github.com/sirupsen/logrus.(*Entry).WithFields
         0     0% 98.06%       1256  0.71%  crypto/tls.(*Conn).HandshakeContext
(pprof)

scelentano · September 29, 2023, 10:52am

Shows somehow that package meter from otelhttp and more specifically the maps holding the metrics counters are adding up on the heap. Any idea?

scelentano · September 29, 2023, 1:48pm

scelentano · September 29, 2023, 2:20pm

Could the reason why the service is leaking resources be that otelhttp handler is created for each request (and also the different counters)?

zaid · September 30, 2023, 7:38am

Not sure. Would you be able to share the pprof files? Also are you using 5.2.0? arm or amd?

If not, could you share the inuse_memory profiles? Our engineers are trying to replicate this and this info would be very helpful.

Does decreasing the sampling ratio slow down the rate of memory consumption?

scelentano · October 2, 2023, 8:27am

Hello,
yes we’re using We’re using Tyk API Gateway 5.2.0 (tykio/tyk-gateway:5.2.0).
Arch is amd64. I will share the profiles asap.
Yes decreasing the sampling ratio is slowing down the consumption, but it’s still increasing (more slowly) until reaching the pods limit, making them at some point crash and restart.

zaid · October 2, 2023, 8:51am

Perfect thank you for all the information. Once we have the profiles we should be able to identify the culprit.

scelentano · October 2, 2023, 9:13am

Hello,
go tool pprof --inuse_objects https://tyk.xyz.com/debug/pprof/heap
Unfortunately I could not attach the file so this is link where you can download it Filebin | l0ahb6rrj7f1vzwo
Thank you for the support!

zaid · October 4, 2023, 5:24pm

Thank you @scelentano for your help! We got the issue and we are looking to release the fix in 5.2.1. Stay tuned

scelentano · October 5, 2023, 9:15am

thank you so much for the support! do you know when the release is planned to happen?
thank you!

zaid · October 6, 2023, 9:39am

Hi @scelentano we actually delayed the 5.2.1 release to include this fix. It was supposed to go out last week. The fix just passed QA. My guess would be that this is released next week latest? I will keep the thread updated

zaid · October 6, 2023, 10:45am

@scelentano the fix is available on 5.2.1-rc6 if you want to test it.

zaid · October 12, 2023, 7:42am

@scelentano 5.2.1 was released yesterday.