Impact:
At 23:20 on the 23rd of February 2016 Tyk Cloud experienced severe downtime for a period of 7 hours.
Reason:
The downtime was caused due to a capacity issue in our primary gateway cluster, causing the containers to fail but the instances to remain up. Fallback reporting procedures that should have initiated a failover, failed to come on-line which caused continual downtime until manual intervention could bring the services back up.
Actions:
The team are taking twofold action: In the short term, ensure that automated action/response infrastructure is properly distributed with full redundancies, as well as implementing short-term interim fix actions in our cluster configuration and monitoring services. Long term: We are re-architecting our overall container infrastructure to better isolate processes that reach capacity problems and implement custom strategies to handle each failure case.
I’d like to personally apologise to our users for the failure and the delay in handling it, we are doing everything we can to ensure Tyk Cloud is the best cloud based API management platform available.
Martin Buhr
Founder @Tyk.io