Extracting JSON Message Data in BigQuery

The data structure from which I am extracting the JSON data is a BigQuery table called messages with three columns: conversation_id, timestamp, and history.

Here is an example of the table structure:

conversation_id | timestamp | history
'1qqq4323r23-t22' | '2017-04-24 14:23:12 UTC' | '[{JSON1}, {JSON2}, {JSON3}]'

The history column contains a list of JSON messages from a conversation. Like so many other processes, the data in the history column needs some processing before it can be usable, namely stripping away the enclosing brackets of the list and splitting each individual JSON message that belongs to a convesation into an array.

The JSON_EXTRACT() function is used to parse the JSON in the messages_array and obtain specific data according to the structure and key of the JSON. Specifically the message timestamp and message text are being pulled out of the JSON. Note that the JSON_ARRAY() function must operate on valid JSON. The function will not throw any sort of error however; it will just return a NULL resultset in the case of invalid or malformed JSON, which is not a useful error message.

Now that we have an array of message timestamps and message text, we use the UNNEST() function to essentially create individual records for the message data. The result of the CROSS JOIN is to have a result set with the conversation_id, timestamp of the messages that make up this conversation, and the message text of the conversation. Our use case calls for returning only the first message of the conversation so we use a window or analytics function called ROW_NUMBER() to get the first message of each conversation, according to their timestamp.

Below is the finished query, pulling all conversations that are within a specific date range that have a non-null message.

SELECT  
  results.conversation_id,
  results.message_time,
  results.msg
FROM (  
  WITH message_data as
    (
      SELECT
        message_id,
        SPLIT(trim(history,'[]'), '},' ) as messages_array
      FROM
        dataset.messages
      WHERE 1=1
        AND EXTRACT(DATE FROM `timestamp`) BETWEEN '2017-01-01' and '2017-09-30'
        AND history is not null
    )
  SELECT
    conversation_id,
    JSON_EXTRACT(CONCAT(messages_array, '}'), "$.timestamp") as message_time,
    JSON_EXTRACT(CONCAT(messages_array, '}'), "$.msg") as msg,
    ROW_NUMBER() OVER
      (PARTITION BY message_id
      ORDER BY JSON_EXTRACT(CONCAT(messages_array, '}'), "$.timestamp") asc
      ) as seqnum
  FROM message_data
    CROSS JOIN UNNEST(messages_array) as message
  WHERE 1=1
    AND JSON_EXTRACT(CONCAT(message, '}'), "$.msg") is not null
  ) as results
WHERE  
  results.seqnum = 1

Pitt Fagan

Greetings! I'm passionate about data; specifically the big data and data science ecosystems! It's such an exciting time to be working in these spaces. I run the BigDataMadison meetup where I live.