Challenge 3: Creating a Dataflow Job using the Datastream to BigQuery Template

Introduction

Now that you have a Datastream stream configured to capture changes from the source and send them to GCS, it’s time to create a Dataflow job which will read from GCS and update BigQuery.

There’s a number of pre-built templates available for Dataflow, and there’s even one for Datastream to BigQuery.

You can also extend the functionality of this template by including User Defined Functions (UDFs). UDFs are functions written in JavaScript that are applied to each incoming record and can do operations such as enriching, filtering and transforming data.

Description

We’ve already prepared a some transformation logic that masks a column in the data, you can download it from here.

Configure a Dataflow job that reads from the Pub/Sub notifications using the pre-built Datastream template with the target as the BQ dataset that’s been created in the very first challenge. Use the provided UDF to process the data before it’s stored in BQ.

Use the sub-retail subnet in the vpc-retail network and limit the maximum number of workers to 5.

Start the Dataflow job and once it’s running, then start the Datastream job.

Warning
The template needs Cloud Storage Buckets for a few more things, these need to be different from the bucket that contains the staging data.

Note
It will take ~10 minutes to complete the replication.

Tips

Create a new bucket to hold the transformation logic and other data
Some of the settings needed to complete this challenge (such as the UDF, max number of workers and networking related ones) are in the Optional Parameters section
Pay attention to how you specify the subnet

Success Criteria

All the data from Oracle table ORDERS has been replicated to the new BigQuery dataset using Datastream & Dataflow
The ORDERS table in BQ dataset should have a row count 520217

Learning Resources

Previous Challenge Next Challenge