Challenge 2: Staging tables
Previous Challenge Next Challenge
Introduction
Before we create our dimensional model we’ll first do some cleanup. There’s a plethora of different approaches here, and different modeling techniques (Data Vault, normalized persistent/ephemeral staging tables etc.), but we’ll keep things simple again. Our source data is already relational and has the correct structure (3NF), we’ll stick to that data model and only do some minimal cleansing.
Description
Some of the tables have duplicate records and problematic columns that we’d like to remove. Create a new BigQuery dataset called curated and create a new table for each BigLake table from the previous challenge. Name the new tables by prefixing them with stg_ and remove any duplicate records as well as any columns with only null values. Make sure that the columns order_date, due_date, ship_date, birth_date and date_first_purchase have the data type DATE in the new tables.
Success Criteria
- There is a new BigQuery dataset
curatedin the same region as the other datasets. - There are 3 BigQuery tables with content in the
curateddataset:stg_person,stg_sales_order_headerandstg_sales_order_detailwith no duplicate records and no columns with onlynullvalues. - The columns
order_date,due_date,ship_date,birth_dateanddate_first_purchasein the new tables have the data typeDATE.
Learning Resources
Tips
- Data Profile (with 100% sampling!) can help you find
nullcolumns. - EXCEPT is useful when you want to discard a few columns when selecting all columns from a table.