Gateway loses all APIDefinitions and Policies if the gateway pod is restarted

Describe the bug
In case if the gateway (headless, installed with helm) is restarted, all APIDefinitions and Policies are deleted even though the APIDefinition and Policies resources are still there. I understand that the gateway keeps the APIDefinition and Policies at ‘/mnt/tyk-gateway/apps’ and ‘/mnt/tyk-gateway/policies’ which obviously gets deleted once the pod is restarted. But this natural to assume that operator-controller would recreate the object is gateway if it gets 404 from gateway.

Reproduction steps
Steps to reproduce the behaviour:

  1. Create an APIDefinition
  2. Restart gateway pods
  3. You will get 404 for that API even though the APIDefinition exists

Actual behaviour
Operator does not recreate the objects into gateway. have to delete the APIDefinition and policy resources from kubernetes and then needs to be recreated

Expected behaviour
APIDefinition, Policy, etc objects should be re-pushed into gateway in case operator gets 404 but the Custom resource exists.

The Tyk Operator has no way to detect that the gateway has restarted because the latest known status was that those APIs/Policies were loaded. I believe the operator will eventually reconcile based on a timeout value that is set but you will have to wait for that timeout.

I had a similar use case before where I would trigger a change in gateway settings that causes it to restart. I got around it by adding those settings as annotations to the operator manager. That forced the restart of the operator manager which ultimately allows the definitions to reconcile.

Thanks @zaid for your reply. I am not sure if I understand how this annotation is going to help. Is there any documentation regarding this?

When the annotations get updated the operator is forced to restart and therefore it will be forced to reconcile the API defs and add them to the gateway, this is however a workaround. I don’t know if we have any docs around this specifically. Let me reach out to the operator product manager and get some clarity on this. I believe that it is something we wanted to improve.

Operator presently relies on cache expiry on the Kubernetes side to trigger the reconciliation, as explained by Zaid here and Burak on Github. The cache expiry timeout is 10 hours, which is too long for this use case.

The annotations option mentioned above can be used as a solution. You may also manually restart Operator controller or use Persistent Volume for apis.

We are looking at better ways to detect out of sync events. Allowing the Operator to manage Gateway instances using the Gateway API is one option. This is something we are working on right now.

1 Like