We are working on the ingestion of the Speech & Text Analytics / Transcript data and having some issues with parallel API queries – please find details below
The endpoint we are using is /api/v2/speechandtextanalytics/conversations/{conversationId}/communications/{communicationId}/transcripturl
The ingestion pipeline queries API endpoint using 50 parallel threads – here’s a sample query
getting back the URL like the one below
but the file, that gets copied from the URL, is very often for some other communication id, different from the one that we requested
Attached please some examples file from a single ingestion batch where we ran the API queries in 50 parallel threads, copied the files under a file name, including with communication id
and compared that communication id, contained in the file name, against communication id inside the file payload.
Out of 117 files in a single batch, there are 36 files where requested communication ID does not match communication ID inside the file payload (many files seem to be duplicated as well)
We did a second test where we ran requests sequentially and could not replicate this issue.
The problem with the sequential approach is that it will take forever to ingest transcript data – it takes 4+ hours to process 800 conversarion id / communication id pairs, and we have over 300000 queries to make since 2024-06-01
How do you think we best approach this ingestion task?