Challenge 1: Loading the source data

Introduction

This first step is all about getting started with the source data. Typically data is copied periodically from operational data stores, such as OLTP databases, CRM systems etc. to an analytics data platform. Many different methods exist for getting that data, either through pushes (change data capture streams, files being generated and forwarded etc.), or pulls (running periodically a query on a database, copying from a file system etc). But for now we’ll ignore all that and assume that somehow data has been collected from the source systems and put into a Google Cloud Storage bucket.

Note
For the sake of simplicity, we’ll implement full loads. In real world applications with larger datasets you might want to consider incremental loads.

Description

We have already copied the data from the underlying database to a specific Cloud Storage bucket. Go ahead and find that bucket, and have a look at its contents. Create a new BigQuery dataset called raw in the same region as that storage bucket, and create BigLake tables for the following entities: person, sales_order_header and sales_order_detail. You can ignore the other files for now. Make sure to name the Cloud Resource connection conn and to create it in the same region as the storage bucket.

Success Criteria

There is a new BigQuery dataset raw in the same region as the landing bucket.
There is a new Cloud Resource connection with the id conn in the same region as the landing bucket.
There are 3 BigLake tables with content in the raw dataset: person, sales_order_header and sales_order_detail.

Learning Resources

Next Challenge