Mass queue creation failing in terraform

mibaile · February 28, 2024, 7:01pm

We are using Genesys Cloud terraform provider 1.30.2 and have started working on building a large number of queues because the platform won't support more than 50 skills per agents and our envrionment requires much more than that to use skills. First, we use a setproduct function to create a list of unique queue names. Then we are using a for each loop to build queues with unique queue names based on the results of those sets. The terraform plan correctly shows the correct number of resulting queues and queue names, but during execution it starts complaining that the queues already exist for some reason (even though they didn't already exist). 138 of the 510 that should get created failed. If we review the logs, there is no attempt that can be found to create the same queue name twice.

This an example error it is showing when it 'fails' to create the queue.
error while trying to create queue: IFP_Medicare_ThisYear_CT. Err API Error: 400 - Queue name 'IFP_Medicare_ThisYear_CT' is currently in use by queue

When the apply finishes, we noticed that there indeed is an object that was created that matches the name of the queues it threw an errors about, but it doesn't function properly. For some reason the skillgroup assignments don't pull members into the queue properly (even though you can clearly see members in the skill group itself). Also, the queue object can't be deleted. If you attempt to delete it to attempt to re-run the job to create it, you get a message that the object was deleted successfully, yet the object is still in the GUI. It seems like the queues got created partially, but are somehow orphaned in some way and can't be managed properly.

We have terraform logs that we can provide. We attempted a second run to see if it would somehow overwrite or fix them, but it also failed. Logs cover both attempts.

Here is code for making the objects which seems to produce exactly what we're expecting.

locals {

IFPqueueNames = setproduct(["HealthCare_", "Medicare_"], ["ThisYear_", "NextYear_"], ["AL","AK","AZ","AR","CA","CO","CT","DE","DC","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"])
IFPfinalQueueList = [for q in local.IFPqueueNames : "${q[0]}${q[1]}${q[2]}"]

}

resource "genesyscloud_routing_queue" "IFP_LicensedQueues" {
for_each = toset(local.IFPfinalQueueList)
name = "IFP_${each.key}"
skill_evaluation_method = "BEST"
skill_groups = [data.genesyscloud_routing_skill_group.IFP.id]
division_id = data.genesyscloud_auth_division.Home.id
media_settings_email {
alerting_timeout_sec = 300
service_level_duration_ms = 86400000
service_level_percentage = 0.8
}
media_settings_call {
alerting_timeout_sec = 8
service_level_duration_ms = 20000
service_level_percentage = 0.8
}
media_settings_callback {
alerting_timeout_sec = 30
service_level_duration_ms = 20000
service_level_percentage = 0.8
enable_auto_answer = true
}
whisper_prompt_id = data.genesyscloud_architect_user_prompt.whisperIfp.id
auto_answer_only = false
enable_manual_assignment = false
media_settings_chat {
alerting_timeout_sec = 30
service_level_duration_ms = 20000
service_level_percentage = 0.8
}
media_settings_message {
alerting_timeout_sec = 30
service_level_duration_ms = 20000
service_level_percentage = 0.8
}
acw_wrapup_prompt = "OPTIONAL"
calling_party_name = ""
enable_transcription = true
queue_flow_id = data.genesyscloud_flow.InQueueFlow.id
default_script_ids = {
CALL = data.genesyscloud_script.AgentUI.id
}
}

John_Carnell · February 28, 2024, 9:30pm

Hi Mike,

Thanks for reaching out to us. Can you please open a ticket with Care because this potentially looks like an issue with the underlying Queue API? Please make sure you:

Reference this developer forum post in the Care ticket. Without it, Care will usually route you back to the developer forum post.
Please post the Care ticket here so I can chat with the Care managers to make sure it does not get closed.

A few months ago we had some issues with asynchronous processing and eventual consistency with the routing queue APIs and we were running into a similar situation where the queue is being created but the API reported errors back so the queue never gets added to the underlying backing state. Thus, when you go and re-run your Terraform job CX as Code complains that the queue already exists.

Thanks and I will have someone take a look at it once we get the Care ticket #.

Thanks,
John Carnell
Director, Developer Engagement

John_Carnell · February 28, 2024, 9:31pm

Also Mike,

Do you have the original error that was in the logs when the queue "failed" to create?

John

mibaile · February 28, 2024, 9:32pm

Yes, it's the one I posted already. We have the original logs from the issue. Can we share them? Would that help? We reviewed them to confirm that we really didn't have two attempts on the same queue name in some way (from the setproduct/for loop, but seems like we're clear on that right now).

Here's the full line for you

[INFO] provider.terraform-provider-genesyscloud_v1.30.2.exe: 2024/02/27 11:08:40 error while trying to create queue: IFP_Medicare_ThisYear_CT. Err API Error: 400 - Queue name 'IFP_Medicare_ThisYear_CT' is currently in use by queue 73df56ea-dccc-4e78-9ff9-58c7ba16fd6a. (a597f1e7-1155-4fe2-9a0a-c77d0c33c67d): timestamp=2024-02-27T11:08:40.656-0700

Here is the ticket number we opened before we contacted you. Care said they couldn't help us.
0003453868

Was anything changed? We are noticing that we can finally delete these queue objects now?

John_Carnell · February 28, 2024, 10:03pm

Hi Mike,

Nothing has changed as far as I am aware. I have escalated your ticket internally to make sure it gets reopened and assigned to our engineering team. Sorry for the confusion, but CX as Code is an open source project and support for it always starts in the Developer Forum. If we ascertain it is a bug somewhere else in the actual product, then we work with the Care team to make sure the case gets re-opened and we get a ticket around this to track.

Thanks,
John Carnell
Director, Developer Engagement

mibaile · February 28, 2024, 10:11pm

Thanks for the quick response. So, we're now able to delete all those queues which had errors, not sure why. However, we're hesitant to try re-adding until we know more about what's happening. We though about trying to add the count of queues slowly over time by reducing the sets in the set product to drop down the level of queues and slowly increase the complexity to build up to the level we need. Do you think that is worth trying? Or should we wait?

John_Carnell · February 29, 2024, 12:59pm

Hi Mike,

You can try to add the queues slowly. I have had your care case re-opened. If you attach your logs (if you have them) and your terraform file, we will begin an investigation on our end. If you have the date and time this occurred in the ticket we can start digging through our logs on our end. I might need to pull in our routing team too.

Thanks,
John Carnell
Director, Developer Engagement

mibaile · February 29, 2024, 4:23pm

The logs from terraform are definitely available, they are attached to the care ticket. We had debug mode turned on. Do you need them here as well? The terraform config file is also attached. Let us know if you don't have what you need still and we'll make sure you get them. Looks like the upload function on this forum won't take our zipped up log file.

John_Carnell · February 29, 2024, 5:29pm

Hi Mike,

No if there are on the Care ticket, I can get to them.

Thanks,
John

John_Carnell · February 29, 2024, 6:10pm

Hey Mike,

Just to confirm. Did you get any kind of queue errors in a previous run before you ran the 400. We have seen some problems in the past with the routing APIs where the routing API returns an error (e.g. a 500) on queue creation, but still created the queue on the backend.

When Terraform/CX as Code is rerun then we see the 400 error because the queue was never added to the CX as Code backing state (e.g. the underlying Queue Creation called failed). The next time the job runs CX as Code fails because the queue already exists.

I have our routing team engaged and I am seeing some 500 errors being returned from queue creation API calls. I just want to make sure if this is the sequence of events that has occurred.

Thanks,
John Carnell
Director, Developer Engagement

John_Carnell · February 29, 2024, 6:37pm

Hi Mike,

I just talked to the routing queue team and they confirmed my suspicion. Because you are creating so many queues at one time and then assigning them to the same skill group, you are causing a hot spot in the routing queues table(s). They have a PR up to resolve this, but they could not give me a good ETA as they have to go through testing, our deployment pipeline, and then a table migration.

There are a couple of ways you can deal with this:

You can roll out the queues in a small number and small batch size. This will limit the opportunity for this hot spot to be hit.
You could roll out the queues without the skill group at once and then roll out skill groups in small batch sizes. This would probably require you to restructure your queue rollouts so that rather than use a single queue definition with essentially a for loop, you roll out multiple resources with a much smaller for loop. Create the queues and then add the skill group incrementally.

Sorry it's not quite simple, we found the bug and we can roll out a patch on CX as Code. It is going to evolve the API team making modifications and that has a longer tail. The overall Care case will remain open so you can track the progress of it.

Thanks,
John Carnell
Director, Developer Engagement

mibaile · February 29, 2024, 8:21pm

John,

Thanks for hitting this so quickly and giving us all our options. We figured we might just need to do that, but didn't want to start down that path till we knew it was going to help us avoid the issue. The fact is, we got through most of the queues before the issues started happening. We'll discuss internally figure out an approach to address.

Thanks!

Mike

John_Carnell · February 29, 2024, 8:31pm

Hi Mike,

Sounds good. Appreciate the patience.

Thanks,
John Car

mibaile · March 1, 2024, 12:36am

John,

Here's what happened trying out different combinations:

Loop of 102 queues with skill groups. Fails with errors.

Loop of 51 queue no skill groups - add skill groups to queues after. Seems to work ok (terraform doesn't show any errors).

Loop of 102 queue no skill groups - add skill groups to queues after. Seems to work ok (terraform doesn't show any errors).

Loop of 204 queue no skill groups - add skill groups to queues after. Seems to work ok (terraform doesn't show any errors).

Add 2 loops of 102 and 1 loop of 204 - add skill groups to queues after. This reported no errors but we noticed in looking through the queues on the platform that there were a number where the skill group association was there, but not working again. However, it did allow us to delete and re-add the skill group without an issue manually. We're not sure if some of the smaller combinations above may have been resulting in similar issues or not, but out of 400+ queue there were only a small handful to fix, which isn't too much hassle.

My guess is that fixing these associations would go under the radar for terraform as its only an association between objects, and not an actual new id or anything that is being created. I ran a terraform plan after having executed a few of these changes manually and it made no complaints.

Our target is 510 queues, which we will be able to do tomorrow we hope we have to wait for the platform to allow us to delete the queues again, which in some cases seems to take a day or so (who knows why) to allow us to actually delete them. If you find out that this fix is updated please let us know, otherwise we'll continue on present course.

system · March 31, 2024, 12:36am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.