Challenge 5: Multi-engine polyglot
Previous Challenge Next Challenge
Introduction
A true Data Lakehouse isn’t tied to one engine. To demonstrate the architecture is open, we modify the data using a completely different compute engine: Apache Spark.
Description
We are going to use Dataproc Serverless via a BigQuery Notebook to easily combine Spark with BigQuery.
The first step is to create a new notebook in BigQuery so you can start writing Spark SQL. At this time, you need to provide an additional instruction to explicitly export the table metadata for our tables to GCS. In the future, this will no longer be necessary as it will be done automatically. Add these instructions early on to your Notebook.
Next, we create an interactive Spark session to use for our querying.
Now that we have the setup done, we can query our BigQuery managed Iceberg tables using Spark. For each product brand, calculate the total revenue, the total profit, and the “Return Rate.” This will help identify which brands are the most profitable and which might have quality issues causing a high rate of returns.
Requirements:
- Only include items where the order status is not ‘Cancelled’.
- Revenue: Sum of sale_price from order_items.
- Profit: Sum of (sale_price from order_items minus cost from products).
- Return Rate: The percentage of items where returned_at is not null.
- Filter the final results to only show brands that have generated at least $500 in total revenue.
- Order the results by Profit in descending order.
Success Criteria
- A Dataproc Serverless session is started in a BigQuery Notebook
- The total revenue, total profit and the return rate percentages are calculated. The return rates should be between 7% - 15%.
Learning Resources
Tips
- Create the Spark session with version 2.3 (using the runtime_config) and the available subnet (using environment_config.execution_config)
- A Notebook runtime template is available to you. Make sure you create your runtime based on this template.