Retrievers - Concepts

Retrievers can be used for semantic search or for RAG (retrieval augmented generation). A configured retriever can automatically compute embeddings (vectors) for our input data. Once this is done, it offers a few retrieve() interfaces to return data based on a query.

Concepts

Data sources and data types

A Data source is where the data to be embedded/retrieved comes from.

  • Table: a column in a table in the PG database
  • Volume: a PGFS "volume" which is a wrapper for accessing e.g. an S3 object store

Data types are

  • text
  • image

You can combine any data source with any data type:

  • Table & Text: The text, in this case, would be just a column with TEXT/VARCHAR type in a Postgres table. This could be the description for a list of products.
  • Table & Image: The image, in this case, would be stored in a BYTEA column in our table which would hold the bytes of the image. We do not, currently, support large objects (also known as BLOB) types at the moment.
  • Volume & Text: For example, this could be text files in an S3 bucket.
  • Volume & Image: The images could be stored as objects in an S3 bucket.

Embedding

Retrievers always need a "vector table" to store the vector embeddings that belong to the source data. We use a model with "encoder" capabilities to generate the vectors for the source data.

Each model has a different "dimensionality" i.e. number of dimensions of the vector (the vector [1, 2, 2] has 3 dimensions). We use PGvector "vector" column types to store the embeddings. And those have a fixed number of dimensions. For this reason, the encoder interface on the models returns how many dimensions the vectors will have.

We create the vector table, or support using an existing one. The vector column dimensionality must fit the model.

Consistency with source data

Embeddings must be in-sync with the source data in order to retrieve correct and consistent data. In the case of the table data source, we use triggers in postgres to be notified about changes on the source table.

If the source table already has data/rows at the time where the retriever is created, then a manual "bulk embedding" call must be made.

Auto embedding can be activated to keep the embeddings in sync going forward.

Bulk & Auto embedding

Bulk embedding generates the embeddings for all the existing data in a source table. Use this to initialize the retriever if the source table is not empty.

Note

Bulk embedding does not "sync" the data; it’s best to run this on an empty vector table.

Auto embedding can be manually enabled (automatically too?? TODO). It creates triggers in PG to keep the embeddings up to data if source data is added, updated or removed.

Note

Auto embedding only works with the "table" data source today; not the volume source.

Synchronous vs. Asynchronous embedding computation

Today, all embedding computation is synchronous:

  • Bulk-embedding is a blocking call that can’t be paused
  • Auto-embedding runs "on top of" other queries during insert/update while waiting on the embedding computation. This means insert/update queries on the source table will be slowed down.

Could this page be better? Report a problem or suggest an addition!