Challenge 2: Staging tables

Previous Challenge Next Challenge

Introduction

Before we create our dimensional model we’ll first do some cleanup. There’s a plethora of different approaches here, and different modeling techniques (Data Vault, normalized persistent/ephemeral staging tables etc.), but we’ll keep things simple again. Our source data is already relational and has the correct structure (3NF), we’ll stick to that data model and only do some minimal cleansing.

Description

Some of the tables have duplicate records and problematic columns that we’d like to remove. Create a new BigQuery dataset called curated and create a new table for each BigLake table from the previous challenge. Name the new tables by prefixing them with stg_ and remove any duplicate records as well as any columns with only null values. Make sure that the columns order_date, due_date, ship_date, birth_date and date_first_purchase have the data type DATE in the new tables.

Success Criteria

  • There is a new BigQuery dataset curated in the same region as the other datasets.
  • There are 3 BigQuery tables with content in the curated dataset: stg_person, stg_sales_order_header and stg_sales_order_detail with no duplicate records and no columns with only null values.
  • The columns order_date, due_date, ship_date, birth_date and date_first_purchase in the new tables have the data type DATE.

Learning Resources

Tips

  • Data Profile (with 100% sampling!) can help you find null columns.
  • EXCEPT is useful when you want to discard a few columns when selecting all columns from a table.

Previous Challenge Next Challenge