Challenge 5: Simple semantic search

Introduction

Embeddings are a way of representing data as points in space where the locations of those points in space are semantically meaningful. Data could be a word, a piece of text, an image, a video etc. The idea is once these entities are converted to embedding vectors, the entities that are similar (for instance in meaning), end up closer to each other in that vector space.

The objective of this challenge is to build a search system that goes beyond keyword search. We’ll convert our summaries to text embeddings and then run a query, a natural language sentence, to search within the summaries to find the paper that comes the closest. And all of that is possible within BigQuery.

Description

Similarly to the previous challenge, create a remote model in BigQuery for text embeddings. Run that model on the summaries table and store the results in a new table with the following columns: uri, title, summary, text_embedding.

Once the table is there, do a SQL search by COSINE distance for every row of the newly generated table and the query Which paper is about characteristics of living organisms in alien worlds? and show only the row with the closest distance.

Note
BigQuery has recently introduced vector search and vector indexes to make these type of searches more efficient. We’ll keep to the naive approach for this challenge (as the next challenge will introduce the concepts of vector search and indexes), so do not create vector indexes and stick to ML.DISTANCE for the search.

Success Criteria

Running the SQL query for the provided query returns the following paper: Solvent constraints for biopolymer folding and evolution in extraterrestrial environments

Learning Resources

BigQuery text embedding support
BigQuery documentation on ML.GENERATE_EMBEDDING and ML.DISTANCE

Previous Challenge Next Challenge