Challenge 1: Setting up AlloyDB and replicating data to BigQuery

Next Challenge

Target Persona: Data Engineer / DBA Estimated Duration: 45 minutes

Introduction

AlloyDB is our operational data store, and BigQuery is the workhorse for analytical workloads. In this challenge, you will initialize the operational database in AlloyDB, load the base data from Cloud Storage, generate embeddings at the source using AlloyDB AI, and set up real-time replication to BigQuery using Datastream. Finally, you will explore the replicated data in BigQuery Studio using the new Data Canvas visual interface.

Description

Task 1.1: Ingest Data into AlloyDB

First, let’s connect to your AlloyDB for PostgreSQL cluster.

 Tip AlloyDB Postgres Credentials:

  • Database: disney (Note: Connect to the default postgres database first to create the disney database, then reconnect to disney for the rest of the hackathon)
  • Username: postgres
  • Password: Use the one provided by your coach

Connect using AlloyDB Studio or psql and perform the following tasks:

  1. Create a Dedicated Database: Connect to the default postgres database and run the following SQL command to create a new dedicated database:

     CREATE DATABASE disney;
    

    Once created, disconnect and reconnect to your new disney database. All subsequent tables, extensions, and queries in this hackathon must be run within the disney database.

  2. Create the Tables:
    • disneyland_reviews: Represents visitor reviews.
      • Columns: review_id (integer, Primary Key), rating (integer), year_month (text), reviewer_location (text), review_text (text), branch (text).
    • disneyland_attractions: Represents the park’s attractions.
      • Columns: attraction_id (integer, Primary Key), branch (text), name (text), description (text).
    • visitor_movements: Represents visitor movements in the park.
      • Columns: visitor_id (integer), from_attraction_id (integer), to_attraction_id (integer), timestamp (timestamp).
  3. Add a Full-Text Search Vector: To enable high-performance hybrid search later, add a generated tsvector column to the attractions table:

     ALTER TABLE disneyland_attractions 
     ADD COLUMN description_tsvector tsvector 
     GENERATED ALWAYS AS (to_tsvector('english', description)) STORED;
    
  4. Import Data from Cloud Storage: Use the AlloyDB Import API (via UI or gcloud) or the tool of your choice to import the data from these public CSVs:
    • gs://ghacks-disneyland-on-gcp/reviews.csv into disneyland_reviews
    • gs://ghacks-disneyland-on-gcp/attractions.csv into disneyland_attractions
    • gs://ghacks-disneyland-on-gcp/visitor_movements.csv into visitor_movements

Task 1.2: Generate Vector Embeddings at the Source

To support semantic searches on our attractions, we need to generate and store vector embeddings directly inside AlloyDB using AlloyDB AI.

  1. Enable the Extensions:

     CREATE EXTENSION IF NOT EXISTS vector CASCADE;
     CREATE EXTENSION IF NOT EXISTS google_ml_integration CASCADE;
     ALTER DATABASE disney SET google_ml_integration.enable_preview_ai_functions = 'on';
    
  2. Add the Embedding Column: Add a vector column of size 3072 (the output dimension of gemini-embedding-001) to the attractions table:

     ALTER TABLE disneyland_attractions ADD COLUMN embedding vector(3072);
    
  3. Generate Embeddings: Populate the embedding column by calling Gemini Enterprise Agent Platform’s embedding model natively from SQL:

     -- TODO: Write an UPDATE query that populates the `embedding` column by calling 
     -- Gemini Enterprise Agent Platform's embedding model natively from SQL using the 'gemini-embedding-001' model.
    

Task 1.3: Set Up Real-Time Replication with One-Click Datastream

To stream our data from AlloyDB to BigQuery in near real-time, we will use Google Datastream via the One-Click Datastream Integration.

Create the Stream (One-Click Setup):

  1. In the Google Cloud Console, navigate to AlloyDB > Clusters.
  2. Select your cluster, and look for the Replicate data to BigQuery.
  3. Follow the guided wizard and customize the stream.
  4. Ensure to
    • Create the stream in the same region as your AlloyDB cluster
    • Use Built-in database authentication
    • Select the three tables you just created
    • Configure the write mode select Merge, staleness limit to 0 seconds and a single datatset for all schemas named disney.

The stream will be created and started automatically, do not wait until its finished. You can start challenge 2 and come back to this step later.

Task 1.4: Prove the Flow with Data Canvas

Once the stream is running, the data will begin replicating to BigQuery.

  1. Navigate to BigQuery Studio.
  2. Open Data Canvas (the new visual interface for exploring data).
  3. Load the replicated disneyland_reviews table.
  4. Create a quick visualization (e.g., a bar chart showing the average rating per branch) to verify the data is flowing correctly from AlloyDB.

Success Criteria

To validate this challenge, you must demonstrate the following:

  • Verify that disneyland_reviews has ~20,000 rows and disneyland_attractions has ~73 rows in AlloyDB.
  • Provide the SQL query you used to generate the embeddings natively in AlloyDB.
  • Run a similarity search query in AlloyDB demonstrating the top 5 attractions similar to 'thrilling dark ride in space'.
  • Show a visualization of your BigQuery Data Canvas showing the replicated reviews.

Next Challenge