How to send a transcribed message to an API during a call

Let me start off by saying I am very new to Genesys and I am trying to understand the options for something we are trying to accomplish, but I've searched high and low for answers and I just get can't a handle on it.

We would like employees to be able to call in, enter their employee Id with the phone keypad and basically leave a message about a problem they are experiencing (e.g., faulty keyboard, need a replacement device, etc). That message would be transcribed to convert it to text, and the text would be sent to an Azure-hosted API that would log a ticket in a helpdesk system. I was hoping to do this in real-time, where the text of the transcription would be sent to the API while the employee is still on the phone, allowing us to return the ticket number.

I've found a few options:

  • Call transcripts - We could turn on transcriptions and use the conversation Id to retrieve the transcript from the AWS S3 bucket by the URL returned by calling the Speech and Analytics API. This would not be real-time, so the best we could do is tell the employee they will receive an email (based on their employee Id) when the ticket is created. I think we'd have to maybe poll the API to get the AWS S3 URL, as the transcription can take some time before the URL would be available...?

  • Notification API - We could somehow use the Notification API to get near real-time transcription data, but I think that would be a separate web socket process, so we'd need to know that the transcription was done (employee presses pound to indicate they've complete their message, or something), then correlate the full transcription with a separate API call at the end of the flow to create the ticket.

  • AudioHook - We could use AudioHook to stream the call to...somewhere, maybe Azure's Speech service for transcription? This would incur more costs and I'm not sure how it would all work...

Is there a better way to do this, or is one of these options the way to go? I'm really just looking for guidance as to where to start...

Thanks,
Adam

I think you've uncovered the options; I can't think of anything else off the top of my head.

Using flow transcripts is likely the most straightforward option, but as identified, you can't get the transcript until after the call is over. This could potentially be mitigated in a couple ways. One way might be to go ahead and create the case during the call with boilerplate text and then have an asynchronous process that gets the transcript and goes back to update the case description once it's available. If your time to assign those cases is longer than the time to process the transcript, it should be ok. Another could be to disconnect the call and create the case after the transcript is available, then use an agentless campaign to call the person back and read them the case number at that time.

I haven't used the transcriptions notification topic before, so I can't vouch for it containing exactly the data you're looking for. I'll assume that it does since you're suggesting it though. Using WebSockets for this isn't recommended because you will require guaranteed delivery. WebSockets are an at least once attempted and unordered delivery, meaning that each message may be received zero or more times and not in a FIFO order. Since you're reconstructing words to make a coherent transcription, order and completeness matter very much. Using an AWS Event Bridge integration has guaranteed delivery of messages, though the order may still be a concern. You should be able to process them using the emitted timestamps to ensure ordering though. I would definitely recommend doing a PoC if you go down this route to confirm you can get complete and correct transcripts; it seems like it has the potential to be flaky or require some convoluted logic.

AudioHook will definitely get the audio in your hands in as near real time as is possible. However, you then need to integrate with a 3rd party service to do the transcription. It's definitely the most complicated option and will require a lot more development effort. But you also get a whole lot more control over how everything is processed (since you have to build it all yourself!).

Both notifications and AudioHook have the challenging of signaling when to start and stop monitoring the caller's speech. This signaling can be accomplished via data actions from the flow. Your service that's handling the transcriptions would need to have public-facing REST API endpoints for the flow to invoke to tell it when to start and stop for a conversation. I don't do much Architect work, but I think you can accomplish this using a menu with a long timeout. So it would be play prompt "record after the beep", data action to start recording, play "beep", enter menu with long timeout, press key to trigger data action to stop.

Here's a blueprint using process automation triggers to do something like the first option: https://developer.genesys.cloud/blueprints/send-message-transcript-by-email/

This is very helpful information! I appreciate the response. One(ish) followup question, if you don't mind: with flow transcripts, how do we know the transcript is available? I mentioned potentially needing to poll the Speech and Analytics API (specifically, /api/v2/speechandtextanalytics/conversations/{conversationId}/communications/{communicationId}/transcripturl) until we get a 200 response, or is there some other mechanism that we can use? And what happens if we never get that 200 response? I can't imagine the transcription is "guaranteed", and it goes away after 10 min, so there is the real potential to miss something here. I guess we may have to do the boilerplate text you mentioned, and then potentially update it to "Transcription Failed, contact employee directly" or something.

I don't believe there's a notification for when the transcription download is available, so you must poll. You can make some inferences about when it's definitely not available though, which would be during the call and including a short period after the call disconnects. That after-call duration is not a static value and will be variable based on many factors. But you can get a benchmark from testing when you implement this; I would guess ~5-10 seconds. Start your polling once you reasonably expect it might be available.

You're correct that a transcription isn't guaranteed, so make sure you stop polling after a while so you don't get hung processes. 10+ minutes seems like a definite error threshold, but it's likely under a minute or two under normal circumstances.

You could use a data action in your flow just before you disconnect to invoke your service with the conversation ID so it knows to look for that conversation in a few seconds. You could subscribe to a topic like v2.detail.events.conversation.{id}.customer.end or v2.detail.events.conversation.{id}.flow.outcome using event bridge to get notified when calls end, but you'll have to filter the invocations for only the conversations you want.

For the event bridge stuff, you are required to use AWS for that but there's a blueprint demonstrating how to build a passthrough to another cloud service: https://developer.genesys.cloud/blueprints/genesyscloud-eventbridge-eventgrid-blueprint/

Thanks, Tim, I definitely feel better having two concrete ways we could go, even if one of them is more an asynchronous solution and one is more Rube Goldberg-esk.

Hi @adam.conley,
We have recently done a POC with Amazon event bridge, as @tim.smith suggested. It surely gives us a great experience with minimal effort.

Regards

1 Like

@Vineet_Kakroo , any chance you could provide some high-level details about your POC? I'd be interested in how you collect all of the various transcription events and put them in the correct order to reconstruct the full transcript. We are an Azure shop, but I'm sure there are equivalents, so any details you could provide would be greatly appreciated.

Hi @adam.conley , I won't be able help you much with the Azure side of things.
For us the POC was simple, we setup Amazon Event bridge, setup the specific event we want to listen to, and on the AWS end, setup JSON parsing to retrieve the transcript part of the event.

Reading about your use-case, I got thinking want to suggest another way you may achieve your required outcome. We do this in a few of our flows;

  • Setup a new inbound flow and within it call a bot
  • In the bot flow, do not setup any intent, just the step to ask the question like "why are you calling us today", using the "Ask for Slot" step.
  • Once your employees say something, you can capture it using in-built variable, Session.LastCollectedUtterance and pass it back to the inbound flow. This variable is the transcribed utterance for what has been said (or what would be the best outcome that bot understood)
  • In the inbound flow, you can call your API, create the ticket with the utterrance from the bot, then plan back the ticket details or email them to the employee

Hope this give you some additional options to work on.

Regards

3 Likes

@Vineet_Kakroo , I have to say, people like you are why communities like this flourish. Thanks for the information, and the alternative approach. I think you are talking about the Amazon Lex bot integration, right? That may be more AWS investment than we want, but I will definitely investigate that.

Just a few questions on the AWS EventBridge POC:

  1. Did you have to store the text of each transcription event somewhere so you could put them back together?
  2. Did you assume that events would come in the proper order, or did you handle cases where events might come out of order? For example, if the employee says "This is" and then pauses, as I understand it, that would be one transcription event, and if they continue saying "a test", that is a separate event. As @tim.smith mentioned, these events may arrive out of order, so the "a test" event could come before the "This is" event. I was just wondering if and how you accommodated that.
  3. What about determining that the caller has stopped speaking? Is that a separate event that you handle to continue processing the full transcription?

Thanks again for your help!

Hi @adam.conley, no I am not talking about Amazon Lex bot. I am talking about native Genesys bot. You can create a simple bot natively, ask the question, get a response, bot converts it to TTS from what it understood, and then you can use it however you want.
For EventBridge,

  • The POC required that we get the transcript, store it for 24hours in a store, do whatever we needed to do with it within these 24hours and then discard them.
  • The additional process that worked on the transcripts within these 24 hours are supposed to do a varied activities, from analytics, reporting etc
  • As mentioned before, the EventBridge send the whole transcribed conversation as part of the event notification payload, you just need to retrieve it from the JSON. Each of the individual transcribed sentences come with their own timestamp & internal/external elements to help you order them correctly. I think these are anyway ordered correctly as you see on the conversation transcription screen
  • So your concerns about out-of-order or determining caller has stopped speaking, does not arise as the EventBridge will only send the notification of a conversation once the conversation has ended and its transcript generated, not before

Hope this helps.

Regards

Thanks again @Vineet_Kakroo. I definitely have some more digging to do and you have been very helpful!

Thanks alot for the useful topic. I was going to start a new topic for the same issue.

ANyWay Thanks a lot....!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.