POST Request for creating Tyk Policy fails

al43k2l · June 10, 2023, 6:26pm

Hi @Olu! I hope you don’t mind asking me a few more related questions to this topic: I’m running into the same issue as described in this thread. I have Open Source Tyk Gateway installed using the Tyk Helm Charts. What I don’t understand is the following:

While experimenting with this issue, I mounted a non-persistent volume to the Gateway (emptyDir) just for testing. It worked, but when I killed the pod to see how it behaves on Gateway pod restart, things got bad:
- Operator had to be restarted to reconcile settings with the Gateway - from this thread I understand that this is by design. Has anything changed since that 2020 post? Resp. are there any recommendations how to deal with this: e.g. if in production for some reason the pod gets restarted (like K8S scaling), how to ensure Operator updates the Gateway ASAP?
- After forcefully restarting Operator to trigger a reconciliation, Operator logs showed errors that it wasn’t able to apply the missing policies and the Gateway showed errors that it couldn’t find the requested policies (as the volume was wiped on restart). While I understand that with a persistent volume this shouldn’t happen, I was still a bit surprised that Operator/Gateway couldn’t get out of this error state. I had to delete all SecurityPolicies, reinstall Operator+Gateway to get it working again. Is there a more graceful way to get Operator+Gateway to recover from such error states?
What if I would like to have multiple gateway pods for load distribution/redundancy: does the Tyk Operator make sure that all Gateway pods get updated with policy changes as they’re stored as files inside containers? I noticed the following note in the installation guide:

Please note that by default, Gateway runs as Deployment with ReplicaCount is 1. You should not update this part because multiple instances of OSS gateways won’t sync the API Definition.

Does this mean that the OSS Gateway on K8S can’t be used with multiple pods at all? That would be a challenge for our production workloads that require some redundancy to spread load + reduce risk of downtimes.