Pipeline takes 5 minutes to fail

jmakacek · January 5, 2024, 2:31pm

Hello,

We have noticed that in a lot of cases when there is any sort of an error in an object, the pipeline takes 5 minutes to fail on that object. Is there a reason for this please?

For example:
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 10s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 20s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 30s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 40s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 50s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 1m0s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 1m10s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 1m20s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 1m30s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 1m40s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 1m50s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 2m0s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 2m10s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 2m20s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 2m30s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 2m40s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 2m50s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 3m0s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 3m10s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 3m20s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 3m30s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 3m40s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 3m50s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 4m0s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 4m10s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 4m20s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 4m30s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 4m40s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 4m50s elapsed]
module.routing_queue.genesyscloud_routing_queue.cicd_test: Still modifying... [id=..., 5m0s elapsed]

... Failure ...

Is it possible to get a quicker failure please? We do pipeline re-runs for various reasons with many developers, and we would like to avoid the delay if at all possible to move code through the pipeline faster, especially in Dev environments.

Thank you.

John_Carnell · January 5, 2024, 3:44pm

Hi Jmakacek,

Most of the retry logic you are seeing there is because we often have to "wait" for asynchronous behavior before we can say yes something is done. So we are re-reading an object waiting for it to get into a state we expected. Otherwise, we would have situations where Terraform calls our APIs, and our APIs return a 200, but because of eventual consistency or async delays, the object is not ready to be processed.

The queue APIs had some issues a while ago where the Dynamo tables and we had to tune the retry rate up high (it was not a solution, but a workaround). I can check if the routing team has pushed a fix (we could then bring this retry back down).

The real thing is why are you getting the issue? Have you set the TF_LOG=debug for additional information? Or is this post more about the length of time it takes for the timeouts to occur? I am just asking to make sure I am solving the right problem.

Thanks,
John Carnell
Director, Developer Engagement

jmakacek · January 26, 2024, 1:58pm

Hello,

This seems to happen with almost every error, whether it's a synchronization issue or an error in our Cx-As-Code implementation. In fact, the only time I remember this happening due to sync/lock is when updating DIDs in IVR Routing configurations.

I will have a look to see if I can find another example of this, but it often happens that it fails on an error which we would expect to be caught in the plan stage, rather than finding out during deployment (apply).

Thank you.

jmakacek · January 26, 2024, 2:10pm

Hello again,

I just remembered, the original failure in this topic was this one: Failure for Queue - media_settings_callback - enable_auto_answer

The purpose of this topic was to try to find out why it took so long to fail - it did not fail during Plan stage, and took a long time to fail during deployment.

I will still try to find other examples of this.

Thank you.

John_Carnell · January 26, 2024, 2:49pm

Hi Jmakacek,

It would fail or take extra long on a plan as Terraform only executes reads during a plan. It never executes writes. When we write something with CX as Code, we immediately go into a read loop and wait until we get confirmation that our changes have been made. In the case of queues, we had a significant hot spot that was uncovered with the DynamoDB tables backing the queue definitions. Because of this hotspot, we had to implement a much longer-than-normal timeout period to wait for the backlog of writes to be processed by the API.

Let me know if you have some specific examples as we can try to work our way back on these.

Thanks,
John Carnell
Director, Developer Engagement

system · February 26, 2024, 2:49pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.