I’ve lost count of the number of times I’ve seen a question about training GPT on large blocks of text. It doesn’t matter if it the contents of a book, company policy documents, or knowledge bases.
The issue that we all struggle with is how to convert large blocks of text into fine-tuning sets for GPT.
You may also want to use embedding if you want a closed knowledgebase and you don’t want GPT’s other training to be used. In essence, yo can use GPT’s language skills to ask questions about blocks of text that you find using a semantic search.
When it’s suggested that embedding is a better idea, most people give up or try to read the online documentation.
An Overview of Embedding
In this section of the course, we will explain what embedding is and walk you through the process of preparing the data and using it for semantic search. The full version of the course also discusses how you can integrate offline embedding engines (not from GPT) with GPT for querying private knowledge sources.
The examples are in Python, but they can usually be converted into other programming languages. Other parts of the main course gives examples of how to do this.
What Are Embeddings?
Before you can use Embedding, you need to understand what an Embedding vector is. In this video, we will talk about vectors and explain how they work.
ADA 002 Embedding Engine
In this video we talk about the ADA 002 embedding engine. We talk about how it is different to previous versions and how it is an extremely cost effective way to access GPTs language processing skills.
OpenAI and GPT Embedding Code Example
Often, the quickest way to learn is to just straight into the code. In this video we generate an embedding vector by using the OpenAI library.
OpenAI Embedding Code Libraries
Embedding requires the use of several additional libraries. In this video we list the libraries and explain what they are used for. We also explain how you can get the same functionality if you are not using Python.
Similarity Checking with Embedding
The basis of Embedding is to compare two (or more) pieces of text with the goal of finding the text that is most similar to your search term. In this video, we use embedding vectors to compare 4 pieces of text. We will discover how to use the dot product and what the numbers all mean.
Preparing the Data for Embedding
Before you can search an embedded database, you need to do some preparation work. Embedding doesn’t involve storing your data on the OpenAI servers. Instead you will setup a local data source that has been encoded to Semantic Search, Clustering, and Classifying.
Introduction to the CSV File
The examples use an open source csv file of Amazon recipe reviews. In this video we do a quick introduction and look at the structure of the csv file.
Creating the "combined" field
We need to setup the data we are going to embed and search. In this video, we use some Python code to combine fields and change formatting slightly. The field we end up with will be the field that we use to create embedding vectors.
Cleaning the data for Embedding
GPT has a token limit. In this video, we use the tokenizer to filter the dataset and remove any records that would be too long for the embedding engine.
The Get_Embedding Function in Detail
We don’t have to use a function to do embedding, but it can some make it easier to wrap everything into one function call. This is especially true if you want to use lambda functions. In this video, we create a function that we can call to get an embedding vector.
Embedding and Saving the Final Data Source
Now its time to actually embed our data. We do this by calling the get_embedding function for each row of data. By the time the video has finished, we will have a new CSV file that can be used as a source for future searches. The final csv file will be an embedded data source.
Doing a Semantic Search with the Embedded Data
Now that we have an embedded data source (Hopefully you watch the previous videos in this section), we can do a semantic search
Loading the Pre-Prepared Data
In this video I show you how to load the CSV file into memory and how to convert the vectors from strings to vector objects. They need to be objects so we can do calculations on them when we do our search.
Doing the Actual Semantic Search
Once the data is loaded, we can do a semantic search. In this video we do the actual search and look at the results.
We also explain how you can use the results to ask GPT specific questions about its content.