Faulty rate limit tracking

Hi,
We have been observing rate limit tracking is not happening as per the policy set for all our API keys in a stage deployment of Tyk. We are only seeing it in one of the 2 stage Tyk environments running the same version of gateway and dashboard. Please find the JSON of a sample policy and key for an API for which we are seeing this.

Policy -
{
“_id” : ObjectId(“58983b666971d10001e8c51d”),
“org_id” : “xxxxxxxxxxxxxx”,
“rate” : 20,
“per” : 60,
“quota_max” : NumberLong(-1),
“quota_renewal_rate” : NumberLong(60),
“access_rights” : {
“xxxxxxxxxxxxxx” : {
“apiname” : “httpbin”,
“apiid” : “xxxxxxxxxxxxxx”,
“versions” : [
“Default”
],
“allowed_urls” : []
}
},
“hmac_enabled” : false,
“active” : true,
“name” : “TestPolicy”,
“is_inactive” : false,
“date_created” : Date(-62135596800000),
“tags” : [
“testpolicy”
],
“key_expires_in” : NumberLong(0),
“partitions” : {
“quota” : true,
“rate_limit” : true,
“acl” : true
},
“last_updated” : “1497848209”
}

API key value in Redis -
{
“last_check”: 0,
“allowance”: 20,
“rate”: 20,
“per”: 60,
“expires”: 0,
“quota_max”: -1,
“quota_renews”: 1497848223,
“quota_remaining”: -1,
“quota_renewal_rate”: 60,
“access_rights”: {
“xxxxxxxxxxxxx”: {
“api_name”: “httpbin”,
“api_id”: “xxxxxxxxxxxxx”,
“versions”: [
“Default”
],
“allowed_urls”: []
}
},
“org_id”: “xxxxxxxxxxxxx”,
“oauth_client_id”: “”,
“oauth_keys”: null,
“basic_auth_data”: {
“password”: “”,
“hash_type”: “”
},
“jwt_data”: {
“secret”: “”
},
“hmac_enabled”: false,
“hmac_string”: “”,
“is_inactive”: false,
“apply_policy_id”: “58983b666971d10001e8c51d”,
“data_expires”: 0,
“monitor”: {
“trigger_limits”: null
},
“enable_detail_recording”: false,
“meta_data”: {
“Contact”: "[email protected]"
},
“tags”: [
“testpolicy”
],
“alias”: "[email protected]",
“last_updated”: “1497848209”,
“id_extractor_deadline”: 0,
“session_lifetime”: 0
}

In this example key-policy for an API, rate limit is set to 20 per 60 seconds. However, we see that on the 10th request, we see 429(Rate limit exceeded) error. Similarly, if rate is set to 5, the 3rd request onwards returns 429. Only 1 less that half the number of requests set in the rate limit for the key is allowed. This is seen for all APIs in that Tyk environment. Has anyone observed this before? If yes, please suggest a solution for this issue. Thanks in advance.

Are you sending requests to both servers via a load balancer?

Tyk detects other nodes in the cluster and will increase the value of a token in its leaky bucket depending on how many servers there are and the relative load on each instance.

This is why you see this “halving” behaviour - but this could only really happen if you are sending requests to the gateways unevenly?

Yes, our gateway is a load balanced environment with 2 nodes. However, only one of the gateway nodes should handle proxying of a request right? Also, we are seeing this in a load test with a pretty even load on the API. And manual testing of the API is also yielding the same results.

To add to that, we also see only 1 entry for 1 request sent in analytics. But rate limit is counted twice for each.

I see - have you tried with a per-second rate limit? This might only affect higher time intervals, we’re investigting anyway to try and replicate.

Yes, we set the rate to 100 per second for an API and ran a load test. They say they didn’t see more than 5-6 requests per second at any point. Still, they observed failures due to 429 errors.

Ok that’s really odd - can you share your tyk.conf please?

tyk.conf -

{
“listen_port”: 8080,
“secret”: “xxxxxxxxxxxxxxxxxxx”,
“node_secret”: “xxxxxxxxxxxxxxxxxxx”,
“template_path”: “/opt/tyk-gateway/templates”,
“tyk_js_path”: “/opt/tyk-gateway/js/tyk.js”,
“middleware_path”: “/opt/tyk-gateway/middleware”,
“use_db_app_configs”: true,
“db_app_conf_options”: {
“connection_string”: “xxxxxxxxxxxxxxxxxxx”,
“node_is_segmented”: false,
“tags”: [“tyk int preprod env”]
},
“app_path”: “/opt/tyk-gateway/apps/”,
“storage”: {
“type”: “redis”,
“host”: “xxxxxxxxxxxxxxxxxxx”,
“port”: 6379,
“username”: “”,
“password”: “”,
“database”: xxxxxxxxxxxxxxxxxxx,
“optimisation_max_idle”: 100
},
“enable_analytics”: true,
“health_check”: {
“enable_health_checks”: true,
“health_check_value_timeouts”: 60
},
“optimisations_use_async_session_write”: true,
“enable_non_transactional_rate_limiter”: true,
“enable_sentinel_rate_limiter”: false,
“allow_master_keys”: false,
“policies”: {
“policy_source”: “service”,
“policy_connection_string”: “xxxxxxxxxxxxxxxxxxx”
},
“hash_keys”: false,
“close_connections”: true,
“allow_insecure_configs”: true,
“coprocess_options”: {
“enable_coprocess”: false,
“coprocess_grpc_server”: “”
},
“enable_bundle_downloader”: true,
“bundle_base_url”: “”,
“global_session_lifetime”: 100,
“force_global_session_lifetime”: false,
“max_idle_connections_per_host”: 100,
“enable_custom_domains”: true,
“http_server_options”: {
“enable_websockets”: true,
“flush_interval”: 1
}
}

Your conf looks OK (though you really should disable the health check API, it’s a performance drag)

We are still trying to replicate on the release branch, but I set up a little straw-man test against a cluster I am running from master, using the distributed limiter, against an L3 (DNS) round-robin to 3 gateways, the test token had a 10 requests per second limit (the blue line is the successful request rate, this is the one we want to pay attention to):

I know you can’t see it properly there, but the rate hovers around 10 r/ps, as expected (test request rate was 50 r/ps flat for 1 minute).

If I lift the rate limit for the key to 100 p/s and re-run the test, we see this:

Basically we see far less errors and a higher success rat (the blip at the start is the level re-adjusting as we re-used the same token).

One thing worth checking, are there any other tyk gateways using the same redis DB outside of the two mentioned here?

Thanks Martin. I will run these observations by my team. I will get back on this shortly. I’ll try to redeploy the environment and replace the underlying gateway nodes(thereby registering new nodes as active nodes) in the meantime, see if that helps. Also, we are not even using the same Redis cluster, let alone the same Redis DB in that cluster for another gateway. Each of our Tyk deployments have their own Redis clusters.

We’re going to try and replicate on our side to make sure we’re not making faulty assumptions.

Hi Martin, should we look at any configuration changes that will resolve this? We haven’t had any success so far.

We haven’t been able to replicate this yet - we have our QA looking at it, but as I said - running my own test showed expected results.

Is there anything interesting in the gateway process logs?

One useful diagnostic tool is the clean logs of a freshly stared gateway thatvexhibits this error - might tell us more about your setup.