Pipelines - Retrievers - Usage

Creating a new retriever configuration

The options for the different data sources (table and volume) are so different that we decided to offer distinct "register" functions for those. If we had a single register function, it would either just accept an opaque JSON arg, or have a lot of different args where some are exclusive. Both are not great UX

Retriever for a table data source

The register_retriever_for_table function is used to create a retriever for a table data source. This is the function signature, you can see many of those are optional and have defaults

register_retriever_for_table(
-------------------------------------------------------------------------------
    p_name                              TEXT,
    p_model_name,                       TEXT,
    p_source_table_name                 regclass,
    p_source_table_data_column          TEXT,
    p_source_table_data_column_type     aidb.RetrieverSourceDataFormat,
    p_source_table_key_column           TEXT DEFAULT 'id',
    p_vector_table_name                 TEXT DEFAULT NULL,
    p_vector_table_vector_column        TEXT DEFAULT 'embeddings',
    p_vector_table_key_column           TEXT DEFAULT 'id',
    p_topk                              INTEGER DEFAULT 1,
    p_distance_operator                 aidb.distanceoperator DEFAULT 'L2',
    p_options                           JSONB DEFAULT '{}'::JSONB
)

Example: Registering a retriever

SELECT aidb.register_retriever_for_table(
               p_name => 'test_retriever',
               p_model_name => 'simple_model',
               p_source_table_name => 'test_source_table',
               p_source_table_data_column => 'content',
               p_source_table_data_column_type => 'Text'
       );

In this example, we use all the defaults.

Creating the Embeddings

Bulk embedding if there is existing data in the source table:

SELECT aidb.bulk_embedding('test_retriever'); 

enable auto-embedding for any future changes:

SELECT aidb.enable_auto_embedding_for_table('test_retriever');

auto-embedding can be disabled as well:

SELECT aidb.disable_auto_embedding_for_table('test_retriever');

Retrieving

A basic key retriever is available that does not look up the source data, but just returns the ID/key of the matching embeddings:

Retrieving the key

aidb.retrieve_key(<retriever name>, <query string>, <optional number of results>);

Example: Retrieving the key

SELECT * FROM aidb.retrieve_key('test_retriever', 'shoes', 2);
Output
key  |      distance
-------+--------------------
 43941 | 0.2938963414490189
 19337 | 0.3023805122617119
(2 rows)

This can be used if you want to do a join/lookup yourself based on the key. For retrievers with external (volume) data sources, this is especially useful. Usually the application itself wants to do the retrieval from the external data source. Or you might want to push-down the actual retrieval to a client application.

The retrieve_text function joins the embeddings with the source data and directly returns the results:

Retrieving the text

The retrieve_text function joins the embeddings with the source data and directly returns the results:

aidb.retrieve_text(<retriever name>, <query string>, <optional number of results>);

Example

SELECT * FROM aidb.retrieve_text('test_retriever', 'jacket', 2);
Output
key  |                       value                        |      distance
-------+----------------------------------------------------+--------------------
 19337 | United Colors of Benetton Men Stripes Black Jacket | 0.2994317672742334
 55018 | Lakme 3 in 1 Orchid  Aqua Shine Lip Color          | 0.3804609668507203
(2 rows)

Listing the retrievers

A view is available that lists all the retrievers. aidb.retrievers also includes some of the retrievers configuration:

SELECT * FROM aidb.retrievers;
Output
 id |        name         |     vector_table_name      | vector_table_key_column | vector_table_vector_column |  model_name  | topk | distance_operator | options | source_table_name | source_table_data_column | source_table_data_column_type | source_table_key_column | source_volume_name
----+---------------------+----------------------------+-------------------------+----------------------------+--------------+------+-------------------+---------+-------------------+--------------------------+-------------------------------+-------------------------+--------------------
  2 | test_retriever      | test_retriever_vector      | id                      | embeddings                 | simple_model |    5 | InnerProduct      | {}      | test_source_table | content                  | Text                          | id                      |
  5 | test_retriever_cosa | test_retriever_cosa_vector | id                      | embeddings                 | simple_model |    1 | L2                | {}      | test_source_table | content                  | Text                          | id                      |
  3 | test_retriever_cos  | test_retriever_cos_vector  | id                      | embeddings                 | simple_model |    5 | Cosine            | {}      | test_source_table | content                  | Text                          | id                      |
(3 rows)

It is recommended that you just select the columns you are interested in.

SELECT name, source_table_name FROM aidb.retrievers;
Output
        name         | source_table_name
---------------------+-------------------
 test_retriever      | test_source_table
 test_retriever_cos  | test_source_table
 test_retriever_cosa | test_source_table
(3 rows)

Deleting a retriever

This will not delete the vector table or anything else, just the configuration:

SELECT aidb.delete_retriever(<name>);

End to end example

You can find an end-to-end example for a table/text retriever at on the Retrievers example page.


Could this page be better? Report a problem or suggest an addition!