Challenge 1: Loading the source data
Introduction
This first step is all about getting started with the source data. Typically data is copied periodically from operational data stores, such as OLTP databases, CRM systems etc. to an analytics data platform. Many different methods exist for getting that data, either through pushes (change data capture streams, files being generated and forwarded etc.), or pulls (running periodically a query on a database, copying from a file system etc). But for now we’ll ignore all that and assume that somehow data has been collected from the source systems and put into a Google Cloud Storage bucket.
Note For the sake of simplicity, we’ll implement full loads. In real world applications with larger datasets you might want to consider incremental loads.
Description
We have already copied the data from the underlying database to a specific Cloud Storage bucket. Go ahead and find that bucket, and have a look at its contents. Create a new BigQuery dataset called raw
in the same region as that storage bucket, and create BigLake tables for the following entities: person
, sales_order_header
and sales_order_detail
. You can ignore the other files for now. Make sure to name the Cloud Resource connection conn and to create it in the same region as the storage bucket.
Success Criteria
- There is a new BigQuery dataset
raw
in the same region as the landing bucket. - There is a new Cloud Resource connection with the id conn in the same region as the landing bucket.
- There are 3 BigLake tables with content in the
raw
dataset:person
,sales_order_header
andsales_order_detail
.