High Latency in Tyk

Hi, I did a basic tyk setup and currently experience really high latency. With caching for the endpoint enabled.

~/Repos/wrk2 (master) $ ./wrk -t1 -c100 -d60s -R3000 --latency http://tyk01.my.domain:8080/v1/deals?api_key=58b76d8c5e321209fcc2ab79b6d59fc6ae6d4ed3700d76ed116c54cf
Running 1m test @ http://tyk01.us.ad.westfield.com:8080/v1/deals?api_key=58b76d8c5e321209fcc2ab79b6d59fc6ae6d4ed3700d76ed116c54cf
  1 threads and 100 connections
  Thread calibration: mean lat.: 145.773ms, rate sampling interval: 1119ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.47s    14.42s   46.99s    83.04%
    Req/Sec   173.93    458.42     2.71k    93.18%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%  933.89ms
 75.000%    3.04s
 90.000%   32.77s
 99.000%   46.53s
 99.900%   46.89s
 99.990%   47.02s
 99.999%   47.02s
100.000%   47.02s

It is on the local network and the other gateway I am testing does not have this issue. So definitely not related to the way the network is setup. The box has 2 CPUs and 4 RAM.

The configuration file is: 
  1 {
  2   "listen_port": 8080,
  3   "secret": "352d20ee67be67f6340b4c0605b044b7",
  4   "template_path": "/opt/tyk-gateway/templates",
  5   "tyk_js_path": "/opt/tyk-gateway/js/tyk.js",
  6   "use_db_app_configs": true,
  7   "app_path": "/opt/tyk-gateway/apps",
  8   "middleware_path": "/opt/tyk-gateway/middleware",
  9   "storage": {
 10     "type": "redis",
 11     "host": "10.17.127.84",
 12     "port": 6379,
 13     "username": "",
 14     "password": "",
 15     "database": 0,
 16     "optimisation_max_idle": 2000,
 17     "optimisation_max_active": 4000
 18   },
 19   "enable_analytics": false,
 20   "analytics_config": {
 21     "type": "csv",
 22     "pool_size": 100,
 23     "csv_dir": "/tmp",
 24     "mongo_url": "mongodb://mongo02.us.ad.mydomain.com:27017/tyk_analytics",
:set number                                                                                                                                                                                                                                 1,1           Top
 64         "enable_websockets": true
 65   },
 66   "hostname": "",
 67   "enable_custom_domains": true,
 68   "enable_jsvm": true,
 69   "oauth_redirect_uri_separator": ";",
 70   "coprocess_options": {
 71     "enable_coprocess": false,
 72     "coprocess_grpc_server": ""
 73   },
 74   "pid_file_location": "./tyk-gateway.pid",
 75   "allow_insecure_configs": true,
 76   "public_key_path": "",
 77   "close_idle_connections": false,
 78   "allow_remote_config": false,
 79   "enable_bundle_downloader": true,
 80   "bundle_base_url": "",
 81   "global_session_lifetime": 100,
 82   "force_global_session_lifetime": false,
 83   "max_idle_connections_per_host": 100,
 84   "use_syslog": true,
 85   "syslog_transport": "udp",
 86   "syslog_network_address": "syslog.mynet.net:514"
 87 }

Is there anything I can do to decrease this on the config level? System-level?
As I mentioned before, the network interface and the network itself do have the capacity for the throughput, as the other gateway has an identical machine on the same network.

Also, I did follow this guide. https://tyk.io/tyk-documentation/deploy-tyk-premise-production

Hmmm, that’s odd, but there’s definitely things that you can do:

  • Are you getting any log output from the gateway itself? It may indicate a bottleneck
  • The config seems to be using a DB based config (dashboard), but the sections to configure that are missing - where is your API definition being loaded from, or was that redacted?
  • Have you tried disabling syslog?
  • One thing that usually gets people is the ulimit setting, have you increased the file handles in Tyk (see the resource limits section in the guide you linked to)?

Most importantly, You are also missing the major optimisation settings from that guide, you should add the following to your tyk.conf:

"close_connections": true,
"experimental_process_org_off_thread": true,
"enable_non_transactional_rate_limiter": true,

I think you are using the old rate limiter, which has a hard sync with redis and is expensive and will add latency.

Thanks for your reply.

The log did not indicate anything suspicions. The ulimit on open file handles was properly set to be 80000 as well as `security/limits.conf.

However, I have tried your suggestions, namely adding the lines you have specified into the configuration file and as well as turning off syslog completely and I can now see substantial improvement for a 3 minute load test:

Running 3m test @ http://tyk01.us.ad.mydomain.com:8080/v1/deals?api_key=58b76d8c5e321209fcc2ab79b6d59fc6ae6d4ed3700d76ed116c54cf
  1 threads and 100 connections
  Thread calibration: mean lat.: 8.230ms, rate sampling interval: 45ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   249.05ms  865.12ms   5.28s    92.72%
    Req/Sec     3.04k     1.45k    7.93k    67.67%
  Latency Distribution (HdrHistogram - Recorded Latency)
 50.000%    2.93ms
 75.000%   27.63ms
 90.000%   96.38ms
 99.000%    4.57s
 99.900%    5.07s
 99.990%    5.17s
 99.999%    5.24s
100.000%    5.29s

The api is defined via a json file in gui + caching turned on in options there:

  1 {
  2     "id": "58a4fe196404310a035c7174",
  3     "name": "Deal Service",
  4     "slug": "deal-service",
  5     "api_id": "c26c389462a9435375602711e50794f5",
  6     "org_id": "58990a186404311bde66ab64",
  7     "use_keyless": false,
  8     "use_oauth2": false,
  9     "use_openid": false,
 10     "openid_options": {
 11         "providers": [],
 12         "segregate_by_client": false
 13     },
 14     "oauth_meta": {
 15         "allowed_access_types": [],
 16         "allowed_authorize_types": [],
 17         "auth_login_redirect": ""
 18     },
 19     "auth": {
 20         "use_param": true,
 21         "param_name": "",
 22         "use_cookie": false,
 23         "cookie_name": "",
 24         "auth_header_name": "api_key"
 25     },
 26     "use_basic_auth": false,
 27     "enable_jwt": false,
 28     "use_standard_auth": true,
 29     "enable_coprocess_auth": false,
...and so on... 

So the result looks somtehing like:

Uploading…
Uploading…

So something happens at 60%'s percentile after which there is a significant growth in latency. It becomes more apparent when I do 10 minute tests. The latency reaches around 45ms for 60%. Any ideas about what could be the source of it? It does not correspond to cpu or memory load on the server.

You could try increasing this value:

max_idle_connections_per_host: 100

As it means that only 100 idle connections will be reused and after that new ones will be created.