Issue querying Speech & Text Analytics / transcripturl in parallel

Rahul_Vivek_Sirpur · November 6, 2024, 10:38pm

We are working on the ingestion of the Speech & Text Analytics / Transcript data and having some issues with parallel API queries – please find details below

The endpoint we are using is /api/v2/speechandtextanalytics/conversations/{conversationId}/communications/{communicationId}/transcripturl

The ingestion pipeline queries API endpoint using 50 parallel threads – here’s a sample query

https://api.mypurecloud.com.au/api/v2/speechandtextanalytics/conversations/f588f2fd-99b1-424c-9d55-332429c3154e/communications/8e018ac5-c43b-4adc-819e-af680c598d0b/transcripturl

getting back the URL like the one below

{
"url": https://prod-ap-southeast-2-transcript.s3.ap-southeast-2.amazonaws.com/transcriptsCache/832f792a-17b3-4a87-9da5-7582862d75f3/f588f2fd-99b1-424c-9d55-332429c3154e/8e018ac5-c43b-4adc-819e-af680c598d0b-transcriptsS3UrlResponse?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEHoaDmFwLXNvdXRoZWFzdC0yIkYwRAIgDBGFsSXkaACEK33l2a3XWWziUGe%2Bm%2FrSGxxOyJGr10gCIEcDJfdW4asfTfuhAgR7NnvpLIWRIL6fSBWMskGF8RisKssFCOP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQABoMNzY1NjI4OTg1NDcxIgz2ix%2FfX76hDfplcvYqnwVSpU5haGzYjfqK7uUmPemutcegSZ9FGNimbHBEbV9gdLKdeTmHtzk%2FXcr09h6CYtFEmxUqm0fQFjvqb7ll0Q60Peg3%2F4jPU1d1%2BRMJ2F3KVXchqv5CndSYZKwXmuaW1cX%2BWOV22qcZZ37kE8PDJvhguW1ZzHeuUyTz2KnwtjGm1%2B5JMPLheAAnnni6V5FR9%2FyBN77Wy98ADxGqC5SCE37W5XzDJVdptdEUBfg2n3YAPU9jVVpGq%2FazxjtArvXHrRE6REB4MB5rbyhkluaeO9UPyNEHWbhaFpdwW3WMDXa5sevzlL3Rxktt97r5hHWmnGrTOgSVJ08mOKvnBACxpRxUJAHjF8HBjLsze0UroJ8Bff%2FhnFxBnTzmcmT5I7myhTurj3gA2h0Ym8C6oO7d0Js4QJXEEaAI9xpkG9EnUYw5WiU4etbJe%2Ba9TbhcGF0LNWVtzLH3rI1j9FigTa87gme40Yrg9Rs03pZnkLk%2FadDYN2G6OQn0vrCZqyrFmvHbX0%2F7tbxamNpKZ%2FK6DZBpLFyPuhLlW6zoNlQAD4Q8hgeLmMYhiiSaxKXUbj49Q13Bz78m6ZBxeCxBMBM6YJXjU0rSGfzCWDHWf3ZJSONNGfcUR26FNevEhxBYbYC68SIecchxg688fdKPruMzBTX%2B%2FCdUbXYuLCXkPuuHP6mQZVuL0Sy3PzKRgoCxUsFofGz1gdBUFXQJa3yZ7AGlOg9JsUjTu6ogQd%2BI8mBBjXLg0YXI5Wb5hHF5aFNgc0rfCtYZPoK2fglzPbSyGkF0pqfW%2F9smsQqbvFqZmmF%2FkwyZ8k0%2FQhERhQog82AOaJlLdMHtf3csBVh6KaFJhIv3%2FSUH0NQRyfpTed5Hn5INikB5lAEq8noYEW6kKSaDVZmMly1%2FGzDG%2B%2Bu4BjqyAecXZgJjwoM%2BaRdTQhgOxek6MAR6sgd1Rd07DzJ1JrFzmoo9U12yv2Zem1ivARvJmbs4oyxnFa7ZaygrAx8PhNI0z3hY%2FTB9580SK32IcExRebtnkU8o6el7sTbvXWohSaPLHy9WD7i%2Bo1zCAKAc4ZsxqScw7LTRqrKxrOtApidRHR29zOvt4DvECcwyCdg548uhR3QrIzRYB0%2F1Q5NUnSSwX2%2F4riraJXU%2FUvsdwEjhsLI%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241025T024822Z&X-Amz-SignedHeaders=host&X-Amz-Expires=600&X-Amz-Credential=ASIA3EQYLGB76LF4RRYI%2F20241025%2Fap-southeast-2%2Fs3%2Faws4_request&X-Amz-Signature=2e72114fda494e1c73556f6ed96ab5cb806f4013afb2c25d45aa9b6405f388c4
}

but the file, that gets copied from the URL, is very often for some other communication id, different from the one that we requested

Attached please some examples file from a single ingestion batch where we ran the API queries in 50 parallel threads, copied the files under a file name, including with communication id
and compared that communication id, contained in the file name, against communication id inside the file payload.

Out of 117 files in a single batch, there are 36 files where requested communication ID does not match communication ID inside the file payload (many files seem to be duplicated as well)

We did a second test where we ran requests sequentially and could not replicate this issue.
The problem with the sequential approach is that it will take forever to ingest transcript data – it takes 4+ hours to process 800 conversarion id / communication id pairs, and we have over 300000 queries to make since 2024-06-01

How do you think we best approach this ingestion task?

tim.smith · November 6, 2024, 10:47pm

Please open a case with Genesys Cloud Care to report this issue.

system · December 6, 2024, 10:47pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.