Challenge 3: Creating a Dataflow Job using the Datastream to BigQuery Template

Previous Challenge Next Challenge

Introduction

Now that you have a Datastream stream configured to capture changes from the source and send them to GCS, it’s time to create a Dataflow job which will read from GCS and update BigQuery.

There’s a number of pre-built templates available for Dataflow, and there’s even one for Datastream to BigQuery.

You can also extend the functionality of this template by including User Defined Functions (UDFs). UDFs are functions written in JavaScript that are applied to each incoming record and can do operations such as enriching, filtering and transforming data.

Description

We’ve already prepared a some transformation logic that masks a column in the data, you can download it from here.

Configure a Dataflow job that reads from the Pub/Sub notifications using the pre-built Datastream template with the target as the BQ dataset that’s been created in the very first challenge. Use the provided UDF to process the data before it’s stored in BQ.

Use the sub-retail subnet in the vpc-retail network and limit the maximum number of workers to 5.

Start the Dataflow job and once it’s running, then start the Datastream job.

Warning The template needs Cloud Storage Buckets for a few more things, these need to be different from the bucket that contains the staging data.

Note It will take ~10 minutes to complete the replication.

Tips

  • Create a new bucket to hold the transformation logic and other data
  • Some of the settings needed to complete this challenge (such as the UDF, max number of workers and networking related ones) are in the Optional Parameters section
  • Pay attention to how you specify the subnet

Success Criteria

  1. All the data from Oracle table ORDERS has been replicated to the new BigQuery dataset using Datastream & Dataflow
  2. The ORDERS table in BQ dataset should have a row count 520217

Learning Resources

Previous Challenge Next Challenge