Hello!
After upgrading to 5.2 we are experiencing an increasing usage of memory by the gateway pods (they are slowly reaching our limits). Is this an expected behavior?
Thanks!
Hello!
After upgrading to 5.2 we are experiencing an increasing usage of memory by the gateway pods (they are slowly reaching our limits). Is this an expected behavior?
Thanks!
We’re using Tyk API Gateway 5.2.0 (tykio/tyk-gateway:5.2.0), this is happening on production scenario where we have tens of API definitions. We do have Golang middlewares running, but we had no memory issues when we were running 4.0.1.
This is our Tyk conf file:
{
"listen_port": 8080,
"template_path": "/opt/tyk-gateway/templates",
"tyk_js_path": "/opt/tyk-gateway/js/tyk.js",
"use_db_app_configs": false,
"app_path": "/opt/tyk-gateway/api-definitions/",
"log_level": "info",
"storage": {
"type": "redis",
"host": "xxx",
"port": 6379,
"username": "tyk-thirdparty",
"password": "<%= data.redis_password %>",
"use_ssl": true,
"database": 0,
"optimisation_max_idle": 2000,
"optimisation_max_active": 4000
},
"enable_analytics": true,
"analytics_config": {
"purge_delay": -1,
"ignored_ips": []
},
"health_check": {
"enable_health_checks": true,
"health_check_value_timeouts": 60
},
"optimisations_use_async_session_write": true,
"enable_non_transactional_rate_limiter": true,
"enable_sentinel_rate_limiter": false,
"enable_redis_rolling_limiter": false,
"allow_master_keys": false,
"policies": {
"policy_source": "file",
"policy_record_name": "/opt/tyk-gateway/policies/policies.json"
},
"hash_keys": true,
"close_connections": false,
"http_server_options": {
"enable_websockets": true
},
"allow_insecure_configs": true,
"coprocess_options": {
"enable_coprocess": true,
"coprocess_grpc_server": "tcp://:9111",
"grpc_recv_max_size": 33554432,
"grpc_send_max_size": 33554432
},
"dns_cache": {
"enabled": true,
"ttl": 30,
"multiple_ips_handle_strategy": "random"
},
"enable_bundle_downloader": true,
"bundle_base_url": "",
"global_session_lifetime": 100,
"force_global_session_lifetime": true,
"max_idle_connections_per_host": 500,
"enable_jsvm": false,
"opentelemetry": {
"enabled": false,
"endpoint": "opentelemetry-collector.monitoring:4317",
"resource_name": "tyk thirdparty (production)"
},
"newrelic": {
"app_name": "---",
"license_key": "<%= data.newrelic_license_key %>"
}
}
Hello again,
we noticed that using Tyk Gateway 5.2 keeping opentelemetry disabled, we don’t have the increasing memory issue. Shall we tune the opentelemetry integration using sampling for tracing?
Hello @scelentano the behaviour seems to align with the oTel collector not processing the traces fast enough. The gateway will store traces in memory until they are processed by the collector. Can you take a look at the memory and CPU consumption of the collector? What does that look like?
Please checkout this section around oTel sampling. By default we sample 50% of the traces. Depending on the resources available and what you need you might need to adjust that.
Hello,
we tried to apply ratio based sampling configuration but memory increase behaviour is still there.
We checked otel collector metrics and we see no CPU or memory issue, we also don’t see any peaks in span refused by the collector. What we also noticed is that the memory increase does not stop even during the night, when the traffic is much lower and we should have no latency issues with sending traces to the otel collector. This is suggesting a memory leak problem in Tyk’s Gateway. We will try to run pprof to have a memory profile for the gateway.
Thank you for the information @scelentano. I reported this internally and will be waiting for the pprof data. Thank you
This is the heap profile:
File: tyk
Build ID: 57d2558493288f76f519ab48387a9d7fa1ce503b
Type: inuse_objects
Time: Sep 29, 2023 at 12:29pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 174093, 98.06% of 177537 total
Dropped 78 nodes (cum <= 887)
Showing top 10 nodes out of 87
flat flat% sum% cum cum%
72095 40.61% 40.61% 72095 40.61% go.opentelemetry.io/otel/internal/global.(*meter).Int64Counter
45878 25.84% 66.45% 45878 25.84% go.opentelemetry.io/otel/internal/global.(*meter).Float64Histogram
35468 19.98% 86.43% 35468 19.98% regexp.(*Regexp).FindAllStringSubmatch.func1
6554 3.69% 90.12% 6554 3.69% runtime.(*scavengerState).init
4612 2.60% 92.72% 4612 2.60% runtime.allocm
3782 2.13% 94.85% 3782 2.13% runtime.malg
3277 1.85% 96.69% 5184 2.92% regexp.compile
1517 0.85% 97.55% 1517 0.85% regexp.onePassCopy
910 0.51% 98.06% 910 0.51% github.com/sirupsen/logrus.(*Entry).WithFields
0 0% 98.06% 1256 0.71% crypto/tls.(*Conn).HandshakeContext
(pprof)
Shows somehow that package meter
from otelhttp
and more specifically the maps holding the metrics counters are adding up on the heap. Any idea?
Could the reason why the service is leaking resources be that otelhttp handler is created for each request (and also the different counters)?
Not sure. Would you be able to share the pprof files? Also are you using 5.2.0? arm or amd?
If not, could you share the inuse_memory profiles? Our engineers are trying to replicate this and this info would be very helpful.
Does decreasing the sampling ratio slow down the rate of memory consumption?
Hello,
yes we’re using We’re using Tyk API Gateway 5.2.0 (tykio/tyk-gateway:5.2.0).
Arch is amd64. I will share the profiles asap.
Yes decreasing the sampling ratio is slowing down the consumption, but it’s still increasing (more slowly) until reaching the pods limit, making them at some point crash and restart.
Perfect thank you for all the information. Once we have the profiles we should be able to identify the culprit.
Hello,
go tool pprof --inuse_objects https://tyk.xyz.com/debug/pprof/heap
Unfortunately I could not attach the file so this is link where you can download it Filebin | l0ahb6rrj7f1vzwo
Thank you for the support!
Thank you @scelentano for your help! We got the issue and we are looking to release the fix in 5.2.1. Stay tuned
thank you so much for the support! do you know when the release is planned to happen?
thank you!
Hi @scelentano we actually delayed the 5.2.1 release to include this fix. It was supposed to go out last week. The fix just passed QA. My guess would be that this is released next week latest? I will keep the thread updated