Faulty rate limit tracking

hakhilesh · June 19, 2017, 5:19am

Hi,
We have been observing rate limit tracking is not happening as per the policy set for all our API keys in a stage deployment of Tyk. We are only seeing it in one of the 2 stage Tyk environments running the same version of gateway and dashboard. Please find the JSON of a sample policy and key for an API for which we are seeing this.

Policy -
{
“_id” : ObjectId(“58983b666971d10001e8c51d”),
“org_id” : “xxxxxxxxxxxxxx”,
“rate” : 20,
“per” : 60,
“quota_max” : NumberLong(-1),
“quota_renewal_rate” : NumberLong(60),
“access_rights” : {
“xxxxxxxxxxxxxx” : {
“apiname” : “httpbin”,
“apiid” : “xxxxxxxxxxxxxx”,
“versions” : [
“Default”
],
“allowed_urls” : []
}
},
“hmac_enabled” : false,
“active” : true,
“name” : “TestPolicy”,
“is_inactive” : false,
“date_created” : Date(-62135596800000),
“tags” : [
“testpolicy”
],
“key_expires_in” : NumberLong(0),
“partitions” : {
“quota” : true,
“rate_limit” : true,
“acl” : true
},
“last_updated” : “1497848209”
}

API key value in Redis -
{
“last_check”: 0,
“allowance”: 20,
“rate”: 20,
“per”: 60,
“expires”: 0,
“quota_max”: -1,
“quota_renews”: 1497848223,
“quota_remaining”: -1,
“quota_renewal_rate”: 60,
“access_rights”: {
“xxxxxxxxxxxxx”: {
“api_name”: “httpbin”,
“api_id”: “xxxxxxxxxxxxx”,
“versions”: [
“Default”
],
“allowed_urls”: []
}
},
“org_id”: “xxxxxxxxxxxxx”,
“oauth_client_id”: “”,
“oauth_keys”: null,
“basic_auth_data”: {
“password”: “”,
“hash_type”: “”
},
“jwt_data”: {
“secret”: “”
},
“hmac_enabled”: false,
“hmac_string”: “”,
“is_inactive”: false,
“apply_policy_id”: “58983b666971d10001e8c51d”,
“data_expires”: 0,
“monitor”: {
“trigger_limits”: null
},
“enable_detail_recording”: false,
“meta_data”: {
“Contact”: "[email protected]"
},
“tags”: [
“testpolicy”
],
“alias”: "[email protected]",
“last_updated”: “1497848209”,
“id_extractor_deadline”: 0,
“session_lifetime”: 0
}

In this example key-policy for an API, rate limit is set to 20 per 60 seconds. However, we see that on the 10th request, we see 429(Rate limit exceeded) error. Similarly, if rate is set to 5, the 3rd request onwards returns 429. Only 1 less that half the number of requests set in the rate limit for the key is allowed. This is seen for all APIs in that Tyk environment. Has anyone observed this before? If yes, please suggest a solution for this issue. Thanks in advance.

Martin · June 19, 2017, 5:47am

Are you sending requests to both servers via a load balancer?

Tyk detects other nodes in the cluster and will increase the value of a token in its leaky bucket depending on how many servers there are and the relative load on each instance.

This is why you see this “halving” behaviour - but this could only really happen if you are sending requests to the gateways unevenly?

hakhilesh · June 19, 2017, 5:56am

Yes, our gateway is a load balanced environment with 2 nodes. However, only one of the gateway nodes should handle proxying of a request right? Also, we are seeing this in a load test with a pretty even load on the API. And manual testing of the API is also yielding the same results.

hakhilesh · June 19, 2017, 6:01am

To add to that, we also see only 1 entry for 1 request sent in analytics. But rate limit is counted twice for each.

Martin · June 19, 2017, 6:09am

I see - have you tried with a per-second rate limit? This might only affect higher time intervals, we’re investigting anyway to try and replicate.

hakhilesh · June 19, 2017, 6:19am

Yes, we set the rate to 100 per second for an API and ran a load test. They say they didn’t see more than 5-6 requests per second at any point. Still, they observed failures due to 429 errors.

Martin · June 19, 2017, 6:32am

Ok that’s really odd - can you share your tyk.conf please?

hakhilesh · June 19, 2017, 6:37am

tyk.conf -

{
“listen_port”: 8080,
“secret”: “xxxxxxxxxxxxxxxxxxx”,
“node_secret”: “xxxxxxxxxxxxxxxxxxx”,
“template_path”: “/opt/tyk-gateway/templates”,
“tyk_js_path”: “/opt/tyk-gateway/js/tyk.js”,
“middleware_path”: “/opt/tyk-gateway/middleware”,
“use_db_app_configs”: true,
“db_app_conf_options”: {
“connection_string”: “xxxxxxxxxxxxxxxxxxx”,
“node_is_segmented”: false,
“tags”: [“tyk int preprod env”]
},
“app_path”: “/opt/tyk-gateway/apps/”,
“storage”: {
“type”: “redis”,
“host”: “xxxxxxxxxxxxxxxxxxx”,
“port”: 6379,
“username”: “”,
“password”: “”,
“database”: xxxxxxxxxxxxxxxxxxx,
“optimisation_max_idle”: 100
},
“enable_analytics”: true,
“health_check”: {
“enable_health_checks”: true,
“health_check_value_timeouts”: 60
},
“optimisations_use_async_session_write”: true,
“enable_non_transactional_rate_limiter”: true,
“enable_sentinel_rate_limiter”: false,
“allow_master_keys”: false,
“policies”: {
“policy_source”: “service”,
“policy_connection_string”: “xxxxxxxxxxxxxxxxxxx”
},
“hash_keys”: false,
“close_connections”: true,
“allow_insecure_configs”: true,
“coprocess_options”: {
“enable_coprocess”: false,
“coprocess_grpc_server”: “”
},
“enable_bundle_downloader”: true,
“bundle_base_url”: “”,
“global_session_lifetime”: 100,
“force_global_session_lifetime”: false,
“max_idle_connections_per_host”: 100,
“enable_custom_domains”: true,
“http_server_options”: {
“enable_websockets”: true,
“flush_interval”: 1
}
}

Martin · June 19, 2017, 7:13am

Your conf looks OK (though you really should disable the health check API, it’s a performance drag)

We are still trying to replicate on the release branch, but I set up a little straw-man test against a cluster I am running from master, using the distributed limiter, against an L3 (DNS) round-robin to 3 gateways, the test token had a 10 requests per second limit (the blue line is the successful request rate, this is the one we want to pay attention to):

I know you can’t see it properly there, but the rate hovers around 10 r/ps, as expected (test request rate was 50 r/ps flat for 1 minute).

If I lift the rate limit for the key to 100 p/s and re-run the test, we see this:

Basically we see far less errors and a higher success rat (the blip at the start is the level re-adjusting as we re-used the same token).

One thing worth checking, are there any other tyk gateways using the same redis DB outside of the two mentioned here?

hakhilesh · June 19, 2017, 8:03am

Thanks Martin. I will run these observations by my team. I will get back on this shortly. I’ll try to redeploy the environment and replace the underlying gateway nodes(thereby registering new nodes as active nodes) in the meantime, see if that helps. Also, we are not even using the same Redis cluster, let alone the same Redis DB in that cluster for another gateway. Each of our Tyk deployments have their own Redis clusters.

Martin · June 19, 2017, 9:38am

We’re going to try and replicate on our side to make sure we’re not making faulty assumptions.

hakhilesh · June 21, 2017, 5:40am

Hi Martin, should we look at any configuration changes that will resolve this? We haven’t had any success so far.

Martin · June 21, 2017, 6:52am

We haven’t been able to replicate this yet - we have our QA looking at it, but as I said - running my own test showed expected results.

Is there anything interesting in the gateway process logs?

One useful diagnostic tool is the clean logs of a freshly stared gateway thatvexhibits this error - might tell us more about your setup.