Challenge 1: Setting up AlloyDB and replicating data to BigQuery
| Target Persona: Data Engineer / DBA | Estimated Duration: 45 minutes |
Introduction
AlloyDB is our operational data store, and BigQuery is the workhorse for analytical workloads. In this challenge, you will initialize the operational database in AlloyDB, load the base data from Cloud Storage, generate embeddings at the source using AlloyDB AI, and set up real-time replication to BigQuery using Datastream. Finally, you will explore the replicated data in BigQuery Studio using the new Data Canvas visual interface.
Description
Task 1.1: Ingest Data into AlloyDB
First, let’s connect to your AlloyDB for PostgreSQL cluster.
Tip AlloyDB Postgres Credentials:
- Database:
disney(Note: Connect to the defaultpostgresdatabase first to create thedisneydatabase, then reconnect todisneyfor the rest of the hackathon)- Username:
postgres- Password: Use the one provided by your coach
Connect using AlloyDB Studio or psql and perform the following tasks:
-
Create a Dedicated Database: Connect to the default
postgresdatabase and run the following SQL command to create a new dedicated database:CREATE DATABASE disney;Once created, disconnect and reconnect to your new
disneydatabase. All subsequent tables, extensions, and queries in this hackathon must be run within thedisneydatabase. - Create the Tables:
disneyland_reviews: Represents visitor reviews.- Columns:
review_id(integer, Primary Key),rating(integer),year_month(text),reviewer_location(text),review_text(text),branch(text).
- Columns:
disneyland_attractions: Represents the park’s attractions.- Columns:
attraction_id(integer, Primary Key),branch(text),name(text),description(text).
- Columns:
visitor_movements: Represents visitor movements in the park.- Columns:
visitor_id(integer),from_attraction_id(integer),to_attraction_id(integer),timestamp(timestamp).
- Columns:
-
Add a Full-Text Search Vector: To enable high-performance hybrid search later, add a generated
tsvectorcolumn to the attractions table:ALTER TABLE disneyland_attractions ADD COLUMN description_tsvector tsvector GENERATED ALWAYS AS (to_tsvector('english', description)) STORED; - Import Data from Cloud Storage:
Use the AlloyDB Import API (via UI or
gcloud) or the tool of your choice to import the data from these public CSVs:gs://ghacks-disneyland-on-gcp/reviews.csvintodisneyland_reviewsgs://ghacks-disneyland-on-gcp/attractions.csvintodisneyland_attractionsgs://ghacks-disneyland-on-gcp/visitor_movements.csvintovisitor_movements
Task 1.2: Generate Vector Embeddings at the Source
To support semantic searches on our attractions, we need to generate and store vector embeddings directly inside AlloyDB using AlloyDB AI.
-
Enable the Extensions:
CREATE EXTENSION IF NOT EXISTS vector CASCADE; CREATE EXTENSION IF NOT EXISTS google_ml_integration CASCADE; ALTER DATABASE disney SET google_ml_integration.enable_preview_ai_functions = 'on'; -
Add the Embedding Column: Add a vector column of size 3072 (the output dimension of
gemini-embedding-001) to the attractions table:ALTER TABLE disneyland_attractions ADD COLUMN embedding vector(3072); -
Generate Embeddings: Populate the
embeddingcolumn by calling Gemini Enterprise Agent Platform’s embedding model natively from SQL:-- TODO: Write an UPDATE query that populates the `embedding` column by calling -- Gemini Enterprise Agent Platform's embedding model natively from SQL using the 'gemini-embedding-001' model.
Task 1.3: Set Up Real-Time Replication with One-Click Datastream
To stream our data from AlloyDB to BigQuery in near real-time, we will use Google Datastream via the One-Click Datastream Integration.
Create the Stream (One-Click Setup):
- In the Google Cloud Console, navigate to AlloyDB > Clusters.
- Select your cluster, and look for the Replicate data to BigQuery.
- Follow the guided wizard and customize the stream.
- Ensure to
- Create the stream in the same region as your AlloyDB cluster
- Use Built-in database authentication
- Select the three tables you just created
- Configure the write mode select Merge, staleness limit to 0 seconds and a single datatset for all schemas named disney.
The stream will be created and started automatically, do not wait until its finished. You can start challenge 2 and come back to this step later.
Task 1.4: Prove the Flow with Data Canvas
Once the stream is running, the data will begin replicating to BigQuery.
- Navigate to BigQuery Studio.
- Open Data Canvas (the new visual interface for exploring data).
- Load the replicated
disneyland_reviewstable. - Create a quick visualization (e.g., a bar chart showing the average rating per branch) to verify the data is flowing correctly from AlloyDB.
Success Criteria
To validate this challenge, you must demonstrate the following:
- Verify that
disneyland_reviewshas ~20,000 rows anddisneyland_attractionshas ~73 rows in AlloyDB. - Provide the SQL query you used to generate the embeddings natively in AlloyDB.
- Run a similarity search query in AlloyDB demonstrating the top 5 attractions similar to
'thrilling dark ride in space'. - Show a visualization of your BigQuery Data Canvas showing the replicated reviews.