Challenge 6: Vector Search for scale

Previous Challenge

Introduction

The previous challenge used a naive method for searching through embeddings. As mentioned in that challenge the approach of scanning every row for every query is not very scalable. The better alternative is to index the embeddings intelligently to be able to do approximate nearest neighbor lookups. This process typically involves 3 steps

  1. Creating the embeddings; we’ve already done that through BigQuery.
  2. Importing the embeddings and creating an index for efficient lookup.
  3. Deploying that index to an endpoint to serve requests.

This challenge is all about implementing the 2nd & 3rd step of this process to build a scalable and fast semantic search system.

Description

Create a new Cloud Storage bucket and export the embeddings created in previous challenge into that bucket in JSON Lines format.

Note You’ll need to pick an single region for the bucket since Vector Search index needs to be co-located with it and doesn’t work with Multi-Region.

Once the embeddings have been exported, create a new Vector Search index. Choose small as the Shard size, and 5 as the Approximate neighbours count, find out the right number of Dimensions to set it, and stick to the defaults for the rest of the parameters.

Note JSON Lines is a text format that stores JSON objects, one per line, with each line terminated by a newline character. Typically the .jsonl extension is used to denote these files, but both BigQuery and Vector Search use and expect the .json extension.

Once the index is ready (should take less than a minute; refresh the page if Status is not Ready yet), create a new endpoint and deploy the index to that endpoint (use a machine type with 2 vCPUs and stick to the defaults for the rest). Deploying the index to the endpoint will take about 15 minutes (start working on how to use the endpoint while the index is being deployed).

Now run the same query as the previous challenge, Which paper is about characteristics of living organisms in alien worlds? through the REST API. You should get the uri of the corresponding paper.

Success Criteria

  • Running the query returns the uri of the paper with the title Solvent constraints for biopolymer folding and evolution in extraterrestrial environments (the document name should be 2310.00067.pdf) as the datapointId.

Learning Resources

Tips

  • Just as the previous challenge you’ll need to convert the query to text embeddings before you can query the endpoint. You can use the same methods as the previous challenge to do that.

Previous Challenge