Challenge 3: Schema evolution and time travel
Previous Challenge Next Challenge
Introduction
In traditional data lakes, renaming a column or updating a specific row often requires a “rewrite-and-replace” of the entire dataset. Apache Iceberg eliminates this overhead by managing metadata at the file level and supporting full ACID transactions. Using standard BigQuery SQL, you can perform granular updates without disrupting concurrent readers.
Beyond simple updates, Iceberg provides a safety net through Time Travel. Because every change creates a new immutable snapshot, you can query the table exactly as it existed at a previous specific point in time.
Description
1. Clean up the product category
The products table contains inconsistent naming conventions in the category column. Standardize the records by merging overlapping categories:
- The category ‘Socks & Hosiery’ and ‘Socks’ are overlapping. Update ‘Socks & Hosiery’ to simply ‘Socks’.
- The category ‘Pants & Capris’ and ‘Pants’ are overlapping. Update ‘Pants & Capris’ to simply ‘Pants’.
2. Add a new column and set its value
Business requires a new flag to track order verification.
- Evolve the schema by adding a new boolean column named is_verified to the orders table.
- Set this value to FALSE for all records where an order was returned, and TRUE for all other records.
3. Time travel with Iceberg travel
Accidents happen. Data gets deleted or updated incorrectly. Use Time Travel to run a SQL query which calculates the average retail price of products with category = ‘Pants & Capris’ (which you have just updated to “Pants”).
Success Criteria
- A SELECT DISTINCT on the category column shows no instances of “Socks & Hosiery” or “Pants & Capris.”
- The orders table contains the is_verified column with the correct boolean logic applied.
- The Time Travel query successfully returns a correct average retail price for the now-deleted “Pants & Capris” category.
- Additional metadata files have been created in the GCS folders of those tables that have been updated.