Challenge 5: Simple semantic search
Previous Challenge Next Challenge
Introduction
Embeddings are a way of representing data as points in space where the locations of those points in space are semantically meaningful. Data could be a word, a piece of text, an image, a video etc. The idea is once these entities are converted to embedding vectors, the entities that are similar (for instance in meaning), end up closer to each other in that vector space.
The objective of this challenge is to build a search system that goes beyond keyword search. We’ll convert our summaries to text embeddings and then run a query, a natural language sentence, to search within the summaries to find the paper that comes the closest. And all of that is possible within BigQuery.
Description
Similarly to the previous challenge, create a remote model in BigQuery for text embeddings. Run that model on the summaries
table and store the results in a new table with the following columns: uri
, title
, summary
, text_embedding
.
Once the table is there, do a SQL search by COSINE
distance for every row of the newly generated table and the query Which paper is about characteristics of living organisms in alien worlds? and show only the row with the closest distance.
Note BigQuery has recently introduced vector search and vector indexes to make these type of searches more efficient. We’ll keep to the naive approach for this challenge (as the next challenge will introduce the concepts of vector search and indexes), so do not create vector indexes and stick to
ML.DISTANCE
for the search.
Success Criteria
- Running the SQL query for the provided query returns the following paper: Solvent constraints for biopolymer folding and evolution in extraterrestrial environments
Learning Resources
- BigQuery text embedding support
- BigQuery documentation on ML.GENERATE_EMBEDDING and ML.DISTANCE