MDCB vs distributed gateways and a single redis/dashboard

billxinli · February 28, 2018, 9:15pm

Assume we have 4 data centers:

Canada North
Canada Center
Canada West
Canada East

What advantages would the MDCB setup have over the simple setup?

MDCB setup

Canada North (Master Datacenter)
Canada Center (Slave Datacenter)
Canada West (Slave Datacenter)
Canada East (Slave Datacenter)

In this setup, I basically followed this setup: https://tyk.io/docs/manage-multiple-environments/with-on-premise/multi-data-center-bridge/

Disadvantages:

Additional cost?
Delayed checks for key, rate checking, since these are “cached” locally

But the significant advantage would be round trips for key and rate limit checking.

Simple setup

Canada North (Dashboard, Gateways)
Canada Center (Gateways)
Canada West (Gateways)
Canada East (Gateways)

In this setup, the gateways from the different regions connects to the dashboard instance in Canada North.

Disadvantages:

Request round trip across data center for key, rate checking

Martin · February 28, 2018, 9:44pm

There are 3 main avantages:

1. Better uptime during crisis:

Imagine a scenario where your master DC burns down, so no MongoDB, no Redis, no MDCB component, and no master Dashboard or Gateway. At the same time, you are experiencing increasing load in your other DC’s.

With the dashboard-only setup (where you have a master gateway and dashboard with the increased round trip time)

Gateways in other DCs would not be able to scale, since the gateways cannot bootstrap from the master dashboard
Gateways in other DCs will no longer be able to validate most requests because Redis has failed
Traffic would grind to a halt until the DB and Redis can be brought back on-line

With MDCB-enabled gateways:

Gateways “stash” an encrypted version of their API and Policy configuration in the local redis
Gateways that are coming online during a scaling event can detect master MDCB downtime and will use the “last good” configuration found in redis
Since running gateways have already been caching tokens that are in the active traffic flow from MDCB up until the downtime event, all gateways can service existing traffic, only new tokens will be rejected (and this can be mitigated by injecting those directly into the gateways using the local slaved gateway API)
Once master is restored, the gateways will all hot-reload to fetch new configurations and resume normal operations
Gateways will only record a buffered window of analytics so as not to overwhelm redis or flood MDCB when it comes back online

Overall, the MDCB based setup is more resilient. It also means that redis and mongo in master can be configured for DR (hot backup or replica to another DC), and use DNS switching to swap over to a hot standby.

Admittedly the above DR can be handled with the dashboard-only setup, but it does involve additional licensing and you still have the problem of immediate downtime.

2. Latency reduction

Because the gateways cache keys and all operations locally, all operations can be geographically localised - this means that traffic from Canada, to Canada will all have rate limiting and checks applied within the same DC and round trip time is massively reduced.

Also, the lookup to MDCB is via a resilient compressed RPC channel that is designed to handle ongoing and unreliable connectivity, it is also encrypted, and so safer to use over the open internet or inter-DC links.

This can be done with Redis using new TLS features, but (as far as I am aware) the gateways do not support this yet.

3. Organisational Benefits

MDCB-slaved gateways are tied to a single organisation in the dashboard - this means that you can set up different teams as organisations in the dashboard, and each team can run it’s own set of gateways that are logically isolated.

This can be achieved with a dashboard-only setup, but requires gateway sharding (tagging) and behavioural policy on the user’s side to ensure that all APIs are tagged correctly, otherwise they do not load.

With an MDCB setup you get the ability to do both - segment out teams with their own gateway clusters, and also sub-segment those gateways with tagging.

Potentially:

Additional cost: Yes MDCB has a separate price point
Delayed checks for key, rate checking, since these are “cached” locally: This only happens on the first request that any single gateway in a DC cluster sees, this has one round-trip to the master DC to fetch and cache the data locally, however after this all operations happen locally, so the setup behaves the same way as a fully-dc-local gateway deployment would work (see above). If using “generative” tokens, such as JWT or the OIDC protocol, then there is no round-trip at all.

This is only partially correct, the gateways ill register with the dashboard in Canada north, but will also need to connect with the redis DB there, which may be undesirable as usually it is recommended to lock redis down into it’s own VPC and only allow trusted firewalled traffic.

Hope that helps

billxinli · February 28, 2018, 9:55pm

This is super helpful. I will ensure to add this post to sell it to the higher-up business folks with the credit card!

Additional cost: Yes MDCB has a separate price point

Roughly, how is MDCB component priced? Or is this a bit more complicated and would require a custom quote?

Also, what level of difficulty would be involved in getting a demo key for the MDCB component?

Thanks Martin!

Martin · March 1, 2018, 4:10am

One of our account team will get in touch with pricing and quotes, getting a demo key can definitely be arranged.

Cheers,
M.

billxinli · March 2, 2018, 10:58pm

I assume the MDCB component is a separate binary, in which I would need to get (from an account rep?)/install on premise?

With regarding to tyk pump, I assume each master/slave node would have an instance of Tyk pump to aggregate the data from the local redis back to the master mongo instance for analytics?

billxinli · March 2, 2018, 11:16pm

The below is taken from: https://tyk.io/docs/manage-multiple-environments/with-on-premise/multi-data-center-bridge/mdcb-setup/

    "slave_options": {
        "use_rpc": true,
        "rpc_key": "{ORGID}",
        "api_key": "{APIKEY}",
        "connection_string": "{your-mdcb-instance-domain:9090}",
        "enable_rpc_cache": true,
        "bind_to_slugs": true,
        "group_id": "ny",
        "use_ssl" : true,
        "ssl_insecure_skip_verify", true
    },
    
    "auth_override": {
        "force_auth_provider": true,
        "auth_provider": {
            "name": "",
            "storage_engine": "rpc",
            "meta": {}
        }
    }

I am curious what is the ORGID and APIKEY?

ORGID (55780af69b23c30001000049) I am assuming to the be org that we just prepared for MDCB access? (What if I have multiple orgs)?

And where would I be able to find the APIKEY here?

    GET /admin/organisations/{org-id}
    
    {
        "_id" : "55780af69b23c30001000049",
        "owner_slug" : "portal-test",
        "developer_quota" : 500,
        "hybrid_enabled" : false,
        "ui" : {
            "uptime" : {},
            "portal_section" : {},
            "designer" : {},
            "dont_show_admin_sockets" : false,
            "dont_allow_license_management" : false,
            "dont_allow_license_management_view" : false,
            "login_page" : {},
            "nav" : {}
        },
        "owner_name" : "Portal Test",
        "cname_enabled" : true,
        "cname" : "api.test.com",
        "apis" : [ 
            {
                "api_human_name" : "HttpBin (again)",
                "api_id" : "2fdd8512a856434a61f080da67a88851"
            }
        ],
        "developer_count" : 1,
        "event_options" : {}
    }

Martin · March 2, 2018, 11:34pm

Yes you will - I’ll send you a link over.

The slave nodes actually pump to MDCB, and you can then have MDCB either push the raw data to Redis for a pump at the master level to fetch, or have MDCB do it, if you want to siphon data into other BI tools or timeseries DBs, then pushing the data from MDCB to Pump in master is the way to go, since you can then use multiple pumps.

ORG ID is the ID of the organisation you created in your master dashboard, you can find both the API Key and the Org ID from the Users section in the dashboard (either create or select an existing user, both values will show in the profile) - ideally you would create a dedicated user for MDCB slaves to log in as.

cajund · March 5, 2018, 4:45pm

Martin,

I think I can speak for the entire community in stating that your detailed and thoughtful responses are invaluable.

I have a similar expected scenario, but instead of redundancy among different installations, I have what we can probably call “NOC mode.” We are rolling in Tyk into a larger install that will be placed in a client’s DC or AWS account (they hold the uptime responsibility). However, we want to be able to monitor traffic, errors and response time in a centralized location. Each client install does not need the Dashboard UI, only we need it centrally to perform this “NOC” operation.

I already have a conversation going with the sales team, but technically is this possible and what would it look like? They have suggested MDCB, but from the description above, this is not exactly what I am after.

Thanks for your help.

Martin · March 5, 2018, 10:42pm

@cajund It depends on where the APIs are being configured from, if the configuration is happening with your team in your NOC, and the client gateways in AWS / Own DC are remotely controlled/configured there, then MDCB is indeed recommended.

One of the main design ideas behind MDCB was to enable multi-DC gateway management and fragmentation from a central location, the DR and Availability aspects are additional benefits that come with such a distributed setup.

If the gateways are actually independent and configured locally to your clients (perhaps by your team on site), then you could use multiple Pro installs and look into a custom Tyk Pump module that can shuffle analytics data back up to a data sink you control (this could be almost anything - e.g. ElasticSearch / Influx), we provide a few out of the box, but you can roll your own quite easily, this would fulfil the remote monitoring requirement but would make management more arduous.

Martin · March 5, 2018, 10:43pm

@billxinli (I briefly saw your post before it was withdrawn - it could be that the API key you were using was not properly set - this can happen sometimes, just resetting the user API key can fix a login problem)

cajund · March 6, 2018, 2:44pm

Thanks again Martin. This latter configuration may be what we are after. I will let Andrew know.

Cheers!

billxinli · March 6, 2018, 4:11pm

Thanks! Yes, cached docker image, which contained cached keys, since the config (org id, and auth key) needs to be part of tyk.conf, it makes the bootstrap process a little bit more complicated.

Also, one test that I ran was:

Start running curl on loop, hitting one of the API endpoints.
Shutting down the main gateway… everything is running
Shutting down the main dashboard… everything is running
Shutting down the sink… everything is running, BUT, requests from curl will essentially will be blocked because the gateway is trying to contact the sink, and it would fail since the sink is down, once the request timeout, the curl process will proceed again, until it tries to contact sink again, and blocking the gateway. Is this the desired behaviour? (This makes sink feel like a single point of failure

gateway6_1                     | time="Mar  6 16:01:19" level=error msg="Can't purge cache, failed to ping RPC: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=20s"
gateway5_1                     | time="Mar  6 16:01:23" level=error msg="Can't purge cache, failed to ping RPC: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=20s"
gateway4_1                     | time="Mar  6 16:01:23" level=error msg="Can't purge cache, failed to ping RPC: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=20s"
gateway5_1                     | time="Mar  6 16:01:27" level=info msg="Can't connect to RPC layer"
gateway4_1                     | time="Mar  6 16:01:27" level=info msg="Can't connect to RPC layer"
gateway6_1                     | time="Mar  6 16:01:27" level=info msg="Can't connect to RPC layer"
-> time out here router_1                       | [06/Mar/2018:16:01:30 +0000] 172.28.0.1 - - -   to: 172.28.0.12:8080: GET /customer-management-api/customer HTTP/1.0 upstream_response_time - msec 1520352090.334 request_time 30.025
gateway5_1                     | time="Mar  6 16:01:37" level=warning msg="Keysapce warning: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=30s"
gateway4_1                     | time="Mar  6 16:01:37" level=warning msg="Keysapce warning: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=30s"
gateway6_1                     | time="Mar  6 16:01:37" level=warning msg="Keysapce warning: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=30s"
gateway6_1                     | time="Mar  6 16:01:39" level=error msg="Can't purge cache, failed to ping RPC: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=20s"
gateway5_1                     | time="Mar  6 16:01:43" level=error msg="Can't purge cache, failed to ping RPC: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=20s"
gateway4_1                     | time="Mar  6 16:01:43" level=error msg="Can't purge cache, failed to ping RPC: gorpc.Client: [sink:9090]. Cannot obtain response during timeout=20s"
gateway5_1                     | time="Mar  6 16:01:57" level=info msg="Can't connect to RPC layer"
gateway4_1                     | time="Mar  6 16:01:57" level=info msg="Can't connect to RPC layer"
gateway6_1                     | time="Mar  6 16:01:57" level=info msg="Can't connect to RPC layer"

billxinli · March 6, 2018, 7:03pm

Also, I am not sure if this is a feature or bug, but with MDCB, you only see the master gateway under the “Active Tyk Nodes” in the dashboard, and the slave gateways are not displayed in this list.

We also see that only 1 gateway license is used. (Which I assume this is why MDCB has their own cost?)

leon · March 7, 2018, 6:12pm

Well, it is definitely not a feature, more like a bug. Not really related to licensing, more like a technical issue. But it is in our backlog.