How LLMs like ChatGPT can use plugins and tools

Spoken language is only the first step because anything can be tokenized

How LLMs Like ChatGPT Can Use Plugins and Tools

Spoken language is only the first step because anything can be tokenized

Generated by Midjourney

Large Language Models (LLMs), such as ChatGPT, have made a significant impact, although they have some very strong limitations.

One of these limitations is that these models are prewired. This means they are trained on a big set of documents, so they have a very big knowledge, but they cannot learn new things (I have a full article about how neural networks are trained). From this perspective, the training of neural networks is much more similar to instincts evolved by evolution than when we learn new things in school. But if they cannot learn new things, how can they use tools like calculators or do a Google search?

To investigate how this works, I made a small Vue.js app by using LangChain.js. LangChain.js is a wrapper library for LLMs and LLM-related things like prompts, agents, vector databases, etc. With LangChain, you can use tools with OpenAI models, which is a very similar concept to plugins in ChatGPT. Fortunately, LangChain is working fine in browsers, so if you have a browser app, you can simply check the API calls in the network tab.

My code is available here. It is simple, but the code is less relevant in this case. In the code, I created a LangChain agent executor that can use the calculator tool:

const model = new OpenAI({
temperature: 0,
openAIApiKey: process.env.OPENAI_API_KEY,
});

const tools = [new Calculator()];
this.executor = await initializeAgentExecutor(
tools,
model,
"zero-shot-react-description"
);

I sent the following prompt to see how the tool is used:

How much is 10+12+33+(5*8)?

Langchain sent the following prompt to OpenAI text-davinci-003:

Answer the following questions as best you can. You have access to the 
following tools:

calculator: Useful for getting the result of a math expression.
The input to this tool should be a valid mathematical expression
that could be executed by a simple calculator.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: How much is 10+12+33+(5*8)?
Thought:

The response was the following:

I need to calculate the expression
Action: calculator
Action Input: 10+12+33+(5*8)

Then LangChain sent a new prompt to OpenAI with the merged content where the result was calculated by the tool:

Answer the following questions as best you can. You have access to the 
following tools:

calculator: Useful for getting the result of a math expression.
The input to this tool should be a valid mathematical expression
that could be executed by a simple calculator.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: How much is 10+12+33+(5*8)?
Thought: I need to calculate the expression
Action: calculator
Action Input: 10+12+33+(5*8)
Observation: 95
Thought:

And the final response was:

I now know the final answer
Final Answer: 95

LLMs have no memory and cannot learn new things. The only thing they see is the input sequence, a list of tokens. In the case of GPT-3.5, the maximum number of input tokens is 4096, and 8192 in the case of GPT-4. This number is limited, but not so little. If you want to use tools, you must write the instructions into the prompt.

When I first saw the ChatGPT plugin API, it was weird that the interface was defined with natural language. Now the reason is absolutely clear. The interface is directly given to the model to use it. It understands the description because it understands the language and generates the required output processed by the framework that calls the chosen tool.

When the tool returns the result, the framework writes it back to the prompt and sends it again to the LLM, which gives back the final answer. That’s all. A simple but genius solution to expand the boundaries of LLMs.

Today’s models contain a lot of unnecessary lexical knowledge that is not really needed. This knowledge is unnecessary because the model can search it from the web. If you drop this unnecessary knowledge, the model's size can be radically reduced. These smaller models (like Alpaca) can efficiently run on your phone or your Raspberry Pi with relatively the same performance.

Another advantage of shrinking the model is that you can make it wider. If the number of input tokens is more, then the model can use more tools, understand the wider context, and give more accurate results.

I think the future of LLMs is bright, and using tools is a massive leap toward AGI. These models will be with us on our phones. We can talk with them through our Bluetooth headsets or BCI devices. Spoken language is only the first step because anything can be tokenized.

Vision transformers are using the same architecture (transformers) that is used by LLMs, and these models can be mixed as DeepMind Gato shows. In the future, these models can see through our smartglasses or hear what we hear and help us as an integrated part of ourselves, and connect us to the internet through tools as a new layer of our brain. The exocortex…