Challenge 2: Staging tables
Previous Challenge Next Challenge
Introduction
Before we create our dimensional model we’ll first do some cleanup. There’s a plethora of different approaches here, and different modeling techniques (Data Vault, normalized persistent/ephemeral staging tables etc.), but we’ll keep things simple again. Our source data is already relational and has the correct structure (3NF), we’ll stick to that data model and only do some minimal cleansing.
Description
Some of the tables have duplicate records and problematic columns that we’d like to remove. Create a new BigQuery dataset called curated
and create a new table for each BigLake table from the previous challenge. Name the new tables by prefixing them with stg_
and remove any duplicate records as well as any columns with only null
values. Make sure that the columns order_date
, due_date
, ship_date
, birth_date
and date_first_purchase
have the data type DATE
in the new tables.
Success Criteria
- There is a new BigQuery dataset
curated
in the same region as the other datasets. - There are 3 BigQuery tables with content in the
curated
dataset:stg_person
,stg_sales_order_header
andstg_sales_order_detail
with no duplicate records and no columns with onlynull
values. - The columns
order_date
,due_date
,ship_date
,birth_date
anddate_first_purchase
in the new tables have the data typeDATE
.
Learning Resources
Tips
- Data Profile (with 100% sampling!) can help you find
null
columns. - EXCEPT is useful when you want to discard a few columns when selecting all columns from a table.