Working with AI data stored in Postgres tables
Suggest editsWe will first look at working with AI data stored in columns in the Postgres table.
To see how to use AI data stored in S3-compatible object storage, skip to the next section.
First let's create a Postgres table for some test AI data:
Now let's create a retriever with the just created products table as the source. We specify product_id as the unique key column to and we define the product_name and description columns to use for the similarity search by the retriever. We use the all-MiniLM-L6-v2
open encoder model from HuggingFace. We set auto_embedding
to True so that any future insert, update or delete to the source table will automatically generate, update or delete also the corresponding embedding. We provide a name for the retriever so that we can identify and reference it subsequent operations:
Now let's insert some AI data records into the products table. Since we have set auto_embedding to True, the retriever will automatically generate all embeddings in real-time for each inserted record:
Now we can directly use the retriever (specifying the retriever name) for a similarity retrieval of the top K most relevant (most similar) AI data items:
Now let's try a retriever without auto embedding. This means that the application has control over when the embeddings are computed in a bulk fashion. For demonstration we can simply create a second retriever for the same products table that we just created above:
We created this second retriever on the products table after we have inserted the AI records there. If we run a retrieve operation now we would not get back any results:
That's why we first need to run a bulk generation of embeddings. This is achieved via the refresh_retriever()
function:
Now we can run the same retrieve operation with the second retriever as above:
Now let's see what happens if we add additional AI data records:
The new data is automatically picked up in the retrieval from the first retriever with auto embeddings:
At the same time the second retriever without auto embedding does not reflect the new data until there is another explicit refresh_retriever() run:
If we now call refresh_retriever()
again, the new data is picked up:
And will be returned when we run the retrieve operation again:
We used the two different retrievers for the same source data just to demonstrate the workings of auto embedding compared to explicit refresh_retriever()
. In practice you may want to combine auto embedding and refresh_retriever() in a single retriever to conduct an initial embedding of data that existed before you created the retriever and then rely on auto embedding for any future data that is ingested, updated or deleted.
You should consider relying on refresh_retriever()
only, without auto embedding, if you typically ingest a lot of AI data at once in a batched manner.
Could this page be better? Report a problem or suggest an addition!