JSON data is corrupt in AWS after doing S3 bulk export

Keaton_Forrest · July 18, 2023, 7:45am

Good day

When using the integration: AWS S3 recording bulk actions

I am able to successfully export all the recordings and meta data that the client requires.

The first problem I encounter is after the Glue crawler finishes and creates the table, it is automatically partitioned and the partition schema does not match the table schema which causes a hive error to be generated when the table is queried: "HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced". I was able to resolve this by selecting the checkbox for the option in the Glue crawler under updates and scheduling > advanced options, called "Update all new and existing partitions with metadata from the table". After dropping the table and running the crawler again to rebuild the table this error is resolved. I highly recommend that this step is added to your Athena + Glue example document.

The next issue that I am having and cannot resolve, is that even though I have set the "ignore.malformed.json" key to true for the database, I am experiencing queries occasionally generate the following error: "HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 4 [character 5 line 1]". Why would there be corrupted data exported from Genesys Cloud?

When running a query without a partition specified, the query will look at all the data, and a single corrupt JSON will stop the query from resolving successfully. The only way to get queries to resolve is to specify partitions to narrow down the result set and possibly skip over the corrupt JSON.

This leaves the client unable to use the export to find the recordings that they need and they are understandably very upset that we have advised them this is the best way to export and now they carry the costs for this unworkable solution with only a few days before their Genesys Cloud org is cancelled.

Any assistance with this would be greatly appreciated.

Keaton_Forrest · July 18, 2023, 8:34am

I have inspected 6 JSON files from where a query returned an error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Expected a ',' or ']' at 11 [character 12 line 1]

The JSON files all start with:
{"mediaType":"Call",

Chacter 12 is the end of the mediaType key and none of the files showed any problems when I checked them. This is starting to look like a AWS Athena/Glue error and I will open a ticket with them now.

Keaton_Forrest · July 18, 2023, 8:57am

Solved: On dropping the table and rebuilding, I had forgotten to go back and set the table property "ignore.malformed.json" = true

Now the queries complete successfully and there are just a bunch of null rows included which is acceptable.

Leaving this here in case someone else benefits from it.

Keaton_Forrest · July 25, 2023, 10:20am

Can anyone from Genesys comment on what's going wrong here in the first place? Is this an error with the AWS Glue crawler as I suspect? Bulk export of recordings is a hardly rare event so surely there must be some advice or explanation of how to overcome missing data in the AWS table?

The queries are competing but the amount of missing fields in the data is causing concern with our client.

Keaton_Forrest · July 26, 2023, 7:36am

It seems that this problem is relating to missing meta data for interactions older than 1.5 years. If this integration does not use the conversation details job endpoint then it won't have access to this older meta data. Please can I have some assistance with best practices to complete this export?

tim.smith · July 26, 2023, 1:25pm

Please open a case with Genesys Cloud Care to investigate the missing and corrupted data; we do not have access to your data via the forum.

system · August 26, 2023, 1:25pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.