Merge pull request #26 from huggingface/unit1_joffrey

Modification of Unit 1.1 -> 1.3
This commit is contained in:
Thomas Simonini
2025-02-05 11:14:36 +01:00
committed by GitHub
3 changed files with 464 additions and 0 deletions

View File

@@ -0,0 +1,143 @@
# Definition of an Agent
Since you are interested in learning more about **Agents**, here is the moment to discuss the fundamental question :
## "What is an Agent ?"
For now let's formally define it as follows :
> An Agent, a type of system that gives an AI model the ability to interact with its environment to fulfill a pre-defined objective.s
<INSERT IMAGE>
In essence, Agents are designed to enhance the capabilities of AI Models by incorporating them in a framework managing things like **Planning**, **Memory**, and **Actions**.
To make things more visual, the AI model can be seen as the **brain of the agent** and the framework as the **remaining body parts**. The AI model does the reasoning and will then send "Action" to execute. And the scope of what is possible depends on what the model has been equipped with by it's creator. A human by not having "wings" can't execute the **Action** "fly" while it is possible to execute the **Actions** "walk", "run" ,"jump", "grab" and so on.
## "What type of AI Models ?"
The most commonly AI model found at the core of an Agent is an LLM ( Large Language Model), this is a kind of AI model that takes **Text** as an input and Also output **Text**. It's most known represents are **GPT4** from **OpenAI**, **LLama** from **meta**, **Gemini** from **Google**, etc... Those models have been trained on a very vast amount of text and are able to generalize well. But we will learn more about LLMs in the next section.
LLMs only handling **Text**, if the use-case requiere other modalities (Images, Audio, Video, ...), you will have to use different AI models. For instance, to browse the web, you could use a Vision Language Model (VLM) as the Agent's core that understands both Images and Text to navigate your web page.
A second example, of that could be to use **Whisper**, a very famous **Audio** to **Text** model as a "Tool" to allow your LLM agent to process audio into text in order to understand it.
## "How does an AI take action on it's environment ?"
The general word for this set of possible action that an AI model can use is a "Tool". For instance by default, your LLM can't generate any images. But if you ask some well-known chat application like HuggingChat, ChatGPT or Le Chat, to generate an Image, they can do it !
The model at the core of those application does not natively have the capacity to generate an Image. But the developpers of those applications created some code ( Tools ), that the LLM can call and execute to create an Image.
We will learn more about tools in the Tool section [insert link]
## "What can an Agent do ?"
The Agent has a task to perform the LLM at his core should selectect the best course of **ACTIONS** to fullfill it.
Example : "If I ask my personal assistant on my computer to send an email to my Manager asking to delay today's meeting", I will need to give code some Tool ( in this case a python function ) do such a thing :
```python
Send_message_to(recipiant, message):
"""Usefull to send an e-mail message to someone"""
```
And the AI model will need to run that code somehow to fulfill the predefined task :
```python
Send_message_to("Manager","Can we postopone today's meeting ?")
```
In Agents, the design of the Tools is very important that greatly impact the quality of your agent. Some task will requiere some very specific tools to be crafted, and other may be solved with some general purpose tool like "web_search".
Here we do the distinction between an Action and a Tool, because in some Agent implementations, one Action can contain multiples tool use.
Having an AI interract with it's environment opens a lot of real life scenarios for companies and individuals.
### Exemples
Personal Virtual Assistants:
Virtual assistants like Siri, Alexa, or Google Assistant function as agents when they interact with users and their digital environments. They take user queries, analyze context, retrieve information from databases, and provide responses or initiate actions (like setting reminders, sending messages, or controlling smart devices).
Customer Service Chatbots:
Many companies deploy chatbots as agents that interact with customers in natural language. These agents can answer questions, guide users through troubleshooting steps, or even complete transactions. Their predefined objectives might include improving user satisfaction, reducing wait times, or increasing sales conversion rates. By interacting directly with customers, learning from the dialogues, and adapting their responses over time, they demonstrate the core principles of an agent in action.
AI NPC ( Non Playable Character) in a video game:
AI agents powered by large language models (LLMs) can make NPCs more dynamic and unpredictable. Instead of following rigid behavior trees, they can respond contextually, adapt to player interactions, and generate more nuanced dialogue. This flexibility helps create more lifelike, engaging characters that evolve alongside the players actions.
To summarize, an Agent is a system that uses an AI Model (mostly LLM) as its core reasoning engine, to :
* **Understand natural language:** Interpret and respond to human instructions in a meaningful way.
* **Reason and plan:** Analyze information, make decisions, and devise strategies to solve problems.
* **Interact with its environment:** Gather information, take actions, and observe the results of those actions.
## Benefits of Agents
* **Automation:** Automate complex tasks that require reasoning and decision-making.
* **Personalization:** Provide tailored experiences and solutions based on individual needs.
* **Improved Decision-Making:** Analyze vast amounts of data to make more informed decisions.
* **Increased Efficiency:** Streamline processes and optimize resource allocation.
## Challenges of Agents
| Challenge | Explanation |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Accuracy and Bias** | Ensuring the agent's reasoning and actions are accurate, unbiased, and aligned with human values. |
| **Safety and Security** | Preventing unintended consequences and ensuring the agent's actions are safe and secure. |
| **Explainability** | Understanding how the agent makes decisions and providing transparency in its reasoning process. |
| **Control and Alignment** | Ensuring the agent's goals and actions remain aligned with human intentions and ethical principles, especially as the agent learns and evolves. |
By addressing these challenges and harnessing the capabilities of LLMs, we can create AI agents that augment human capabilities and solve complex problems in a responsible and beneficial way.
---
**Pop Quiz 🍾**
**Question 1:**
Which phrase best describes an Agent?
a) A static question-and-answer system that cannot adapt
b) A specialized chatbot that only recommends content
c) A digital brain that can gather data, reason, plan, and take actions
d) A random text generator that cannot interact with its environment
**Question 2:**
Which function is NOT mentioned as one of the LLM Agent's core capabilities?
a) Reasoning and planning
b) Generating images
c) Collecting information from various sources
d) Learning and adapting over time
**Question 3:**
Which of the following is a proven benefit of LLM Agents mentioned in this document?
a) They guarantee perfect results without supervision
b) They replace the need for human reasoning entirely
c) They can automate complex tasks requiring decision-making
d) They remain unaffected by real-world actions and constraints
**Question 4:**
Which is an example use case of an Agent?
a) Preparing a cup of coffee
b) Replacing all human healthcare workers
c) Providing personalized customer support
d) Generating random cat memes
e) all of the above
**Question 5:**
What is a major challenge in setting up an LLM Agent?
a) They operate free of any biases
b) They work best when hidden from users
c) Guaranteeing total safety and security of their actions
d) They require no data to produce results
---
<details>
<summary>Answer Key (click to reveal)</summary>
1. c) A digital brain that can gather data, reason, plan, and take actions
2. b) Generating images ( available through tools but not natively)
3. c) They can automate complex tasks requiring decision-making
4. a)c)d) All of the above exept Replacing all human healthcare workers (Yes, with the corect tools, you could have an agent make you coffee, but unlike popular opinion, AI is not here to take your job, especially not healthcare workers)
5. c) Guaranteeing total safety and security of their actions
</details>

View File

@@ -0,0 +1,139 @@
# Explain Large Language Models
<!-- Explanation of LLMs, including the family tree of models: encoders, seq2seq, decoders. Decoders are autoregressive and continue until EOS. -->
<!-- TODO: @burtenshaw -->
As explained in the previous chapter, each agent needs an AI Model at his core. And the currently most used AI model for those are LLMs ( Large Language Model).
## What is a Large Language Model?
A Large Language Model (LLM) is a type of artificial intelligence model that excels at understanding and generating human language. They are trained on vast amounts of text data, allowing them to learn patterns, nuances, and structure in language. These models typically consist of billions of parameters
Most LLMs are **Transformers**, it is an architecture of deep learning models that has gained a lot of interest since the release of Bert from google in 2018. There are 3 types of transformers :
1. **Encoders**
An encoder-based Transformer takes text (or other data) as input and outputs a dense representation (or embedding) of that text.
- **Example**: BERT from Google
- **Use Cases**: Text classification, semantic search, Named Entity Recognition
- **Usual Size**: Millions of parameters
2. **Decoders**
A decoder-based Transformer is focused on _generating_ new token to complete a sequence token-by-token.
- **Example**: LLama from meta
- **Use Cases**: Text generation, chatbots, code generation
- **Usual Size**: Billions of parameters
3. **Seq2Seq (EncoderDecoder)**
A sequence-to-sequence Transformer _combines_ an encoder and a decoder. The encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.
- **Example**: T5, BART,
- **Use Cases**: Translation, Summarization, Paraphrasing
- **Usual Size**: Millions of parameters
Although Large Language Models come in various forms, in most people's mind LLMs are **decoders** of multiple billions parameterss. Here are some of it's famous names :
| **Model** | **Provider** |
|-----------------------------------|-------------------------------------------|
| **GPT4** | OpenAI |
| **LLaMA3** | Meta (Facebook AI Research) |
| **Deepseek-R1** | DeepSeek |
| **SmollLM2** | BigScience (collaborative initiative) |
| **Gemma** | Google |
| **Mistral** | Mistral |
The principle behind LLM is very simple yet very effective. The objective of a decoder is to **Predict the next token**. We talk about tokens and not words because not every token translates to a word. In English, the dictionarry have an estimated 600 000 different words while the vocabulary size of an LLM is around 32 000 tokens ( for Llama2 ). This can be achieved by functioning on Sub-word token.
For example you can see interesting as "Interest" + "##ing" which can both be reused to compose new word like "interested" ( "interest" + "##ed" ) or "fasting" ( "fast" + "##ing")
You can play with different tokenizers in the space below:
<iframe
src="https://xenova-the-tokenizer-playground.static.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>
Furthermore, each LLM have some **special tokens** specific to this model. The most important of those special token is the **End of sequence token** (EOS).
| **Model** | **Provider** | **EOS Token** |
|---------------|-------------------------------|----------------------------|
| **GPT4** | OpenAI | ``<\|endoftext\|>`` |
| **LLaMA3** | Meta (Facebook AI Research) | ``<\|end_header_id\|>`` |
| **Deepseek-R1** | DeepSeek | ``<end▁of▁sentence>`` |
| **SmollLM2** | Hugging Face | ``<\|im_end\|>`` |
| **Gemma** | Google | ``<end_of_turn>`` |
| **Mistral** | Mistral | ``[/INST]`` |
## Understanding next token prediction.
LLMs are said to be **autoregrassive**, meaning that the output from one pass become the input from the next one. This loop continue untill the model predict the next token to be the EOS token. At which point the model can stop.
<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/AutoregressionFinal.gif" alt="Visual Gif of autoregressive decoding" width="60%">
Alright, an LLM will decode untill reaching an EOS. But what happens during a single loop ?
The process is probably a bit too technical for the purpose of learning agents. If you want to know more about decoding, you can take a look at the NLP course.
In short : Once we have tokenized our text, we compute a dense representation of the words that both accounts for their meaning and their position in the input sequence. This Dense representation goes into the model and outputs some logits and those logits can be remapped to a token id ( a unique number for each token ) through the softmax function.
<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/DecodingFinal.gif" alt="Visual Gif of decoding" width="60%">
Once out of the model, we have multiple options of potential tokens that could complete the sentence. The most naive decoding strategy would be to always take the word with the maximum probability.
You can interract with the decoding process yourself with SmollLM2 in this space (remember, it decodes untill reaching an **EOS** token which is **<|im_end|>** for this model):
<iframe
src="https://jofthomas-decoding-visualizer.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>
But there is also other more advanced decoding strategies like beam search. Where instead of searching immediate reward, we will to find the maximum cummulative probability by exploring options when taking the sub-optimal options on the shorter term
<iframe
src="https://m-ric-beam-search-visualizer.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>s
## Attention is all you need
One small detail that we should still mention is **Attention**. When predicting the next word. Not all the words in the sentence have the same importance. For instance, when decoding " The Capital of France is ", the Attention will be higher on the words "France" and "Capital" as they are the ones holding the meaning of the sentence.
<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/AttentionSceneFinal.gif" alt="Visual Gif of Attention" width="60%">
This simple process of finding the most probable word to complete a sequence proved itself to be very usefull, in fact, the basic principle of LLM did not change much since gpt2, but the sizes of the neural network and the way to change the attention has drastically changed.
If you have interacted with LLMs, you most likely heard about the word "context length". This represents the amount of tokens that the LLM can process as a total and that's what have been the most impacted by the recent improvment in Attention.
IE : Context length is the maximum number of Input token + Output Token that the model can handle.
## Prompting the LLM is important.
Considering that the only job of an LLM is to predict the next token by looking at every input token , and to chose which are "important" to decode what the next word should be, the wording of your input token is very important.
This called a prompt in LLM and will allow to guide the generation of the LLM toward the desired output.
## How do LLMs learn?
LLMs are trained on large datasets of text, where they learn to predict the next word in a sequence through a self-supervised or masked language modeling objective. From this unsupervised learning, the model learns the structure of the language and underlying patterns in text allowing to generalize on unseen data.
Following this, LLMs can be fine-tuned on a supervised learning objective to perform specific tasks. For example, some are trained for conversational structures or tool usage, while others focus on classification or code generation.
## How can I use LLMs?
You can run LLMs locally on your own laptop (if you have sufficient hardware) or call them through an API. Some models require a lot of memory to run efficiently, so you'll need to factor in these hardware requirements when choosing a model.
During this course, we will initially use models through APIs on the Hugging Face hub. Later, we will explore how to run models locally on your own hardware.
## How are LLMs used in AI Agents?
LLMs are a key component of AI Agents, providing the foundation for understanding and generating human language. They can interpret user instructions, maintain context in conversations, and decide which tools to use. We will explore these steps in more detail in [dedicated sections](./6_agent_steps_and_structure.md).
## Challenges and Limitations
While LLMs are powerful, they can sometimes generate biased or factually incorrect content. They may also require substantial computational resources to train and deploy. Researchers are actively exploring methods to reduce these issues and make LLMs more reliable.

View File

@@ -0,0 +1,182 @@
# Messages and Special Tokens
<!-- Explanation of messages, special tokens, and chat-template usage. Special tokens are different for every model and allow segmentation of generation in messages. Can go from messages to prompt with the chat_template. -->
<!-- TODO: @burtenshaw -->
In this section we will explore how Large Language Models (LLMs) structure their generations through tokenization, special tokens, and chat-templates.
## Tokenization
We already talked a little bit about tokenization, but we only talked about the most prominant kinf of tokenization (**Subword Tokenization**), hence let's do a deeper dive.
Tokenization is the process of breaking down a text into smaller units, called tokens. These tokens are then used to represent the text in a way that can be processed by the model. For a detailed explanation, see the
For natural language, tokens are frequently appearing combinations of characters within a language.
Tokenizers are crucial in preparing inputs for models. They convert text into a format that models can understand, typically by splitting text into sub-word units and converting these into numerical IDs. Hugging Face provides two types of tokenizers: a full Python implementation and a "Fast" implementation based on the Rust library 🤗 Tokenizers. The "Fast" tokenizers offer significant speed improvements and additional methods for mapping between the original text and token space.
### Types of Tokenization
1. **Word Tokenization**: This involves splitting text into individual words. It's simple but can be inefficient for languages with complex morphology.
2. **Subword Tokenization**: This breaks down words into smaller units, such as prefixes, suffixes, or even individual characters. This method is more efficient for handling rare words and is commonly used in modern NLP models.
3. **Character Tokenization**: This splits text into individual characters. While it can handle any text, it often results in longer sequences, which can be computationally expensive.
### Importance of Tokenization
- **Efficiency**: Tokenization reduces the complexity of text data, making it easier for models to process.
- **Handling Rare Words**: Subword tokenization helps in managing rare or unseen words by breaking them into known subword units.
- **Language Agnostic**: Tokenization can be adapted to different languages, making it versatile for multilingual models.
## Chat-Templates
Chat templates are essential for structuring interactions between language models and users. They provide a consistent format for conversations, ensuring that models understand the context and role of each message while maintaining appropriate response patterns.
### Base Models vs Instruct Models
A base model is trained on raw text data to predict the next token, while an instruct model is fine-tuned specifically to follow instructions and engage in conversations. For example, `SmolLM2-135M` is a base model, while `SmolLM2-135M-Instruct` is its instruction-tuned variant.
To make a base model behave like an instruct model, we need to format our prompts in a consistent way that the model can understand. This is where chat templates come in. ChatML is one such template format that structures conversations with clear role indicators (system, user, assistant). If you have interacted with some AI API lately, you know what we're talking about.
It's important to note that a base model could be fine-tuned on different chat templates, so when we're using an instruct model we need to make sure we're using the correct chat template.
Here is an example :
```python
messages = [
{"role": "system", "content": "You are a helpful assistant focused on technical topics."},
{"role": "user", "content": "Can you explain what a chat template is?"},
{"role": "assistant", "content": "A chat template structures conversations between users and AI models..."}
]
```
### Understanding Chat Templates
Each model having different special token, chat templates have be implemented to ensure that we correctly format the prompt in each model. The user does not need to ensure that the
At their core, chat templates define how conversations should be formatted when communicating with a language model. They include code on how to transform the ChatML list of JSON data presented in the above example into a textual representation of the system-level instructions, user messages and assistant responses that the model can understand.
This structure helps maintain consistency across interactions and ensures the model responds appropriately to different types of inputs.Below is an example of a chat template:
chat_template of `SmolLM2-135M-Instruct`:
```jinja2
{% for message in messages %}
{% if loop.first and messages[0]['role'] != 'system' %}
<|im_start|>system
You are a helpful AI assistant named SmolLM...
<|im_end|>
{% endif %}
<|im_start|>{{ message['role'] }}
{{ message['content'] }}<|im_end|>
{% endfor %}
```
As you can see a chat_template is some code that will write how should the list of messages be formated inside
```sh
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
```
If you remember last lesson, you will notice that "<|im_end|>" is the End of sequence ( EOS ) token of **SmolLM2-135M-Instruct**. Meaning that we only ask the assistant to generate some part of it ( in this case the assistant messages )
The `transformers` library will take care of chat templates for you in relation to the model's tokenizer. Read more about how transformers builds chat templates [here](https://huggingface.co/docs/transformers/en/chat_templating#how-do-i-use-chat-templates). All we have to do is structure our messages in the correct way and the tokenizer will take care of the rest.
Or you can experiment with different conversations/models to see how they are then formated for the model in the following space :
<iframe
src="https://jofthomas-chat-template-viewer.hf.space"
frameborder="0"
width="850"
height="450"
></iframe>
Let's break down the above example, and see how it maps to the chat template format.
### System Messages
System messages set the foundation for how the model should behave. They act as persistent instructions that influence all subsequent interactions. For example:
```python
system_message = {
"role": "system",
"content": "You are a professional customer service agent. Always be polite, clear, and helpful."
}
```
### Conversations
A conversation consist in alternating messages betwen a Human ( user ) and an LLM ( assistant )
Chat templates maintain context through conversation history, storing previous exchanges between users and the assistant. This allows for more coherent multi-turn conversations:
```python
conversation = [
{"role": "user", "content": "I need help with my order"},
{"role": "assistant", "content": "I'd be happy to help. Could you provide your order number?"},
{"role": "user", "content": "It's ORDER-123"},
]
```
Templates can handle complex multi-turn conversations while maintaining context:
```python
messages = [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is calculus?"},
{"role": "assistant", "content": "Calculus is a branch of mathematics..."},
{"role": "user", "content": "Can you give me an example?"},
]
```
### Tool Usage
Tool usage in chat templates allows models to interact with external functions and APIs in a structured way. AI agents rely on tools to perform tasks, such as searching the web, performing calculations, or even controlling physical robots.
When working with tools, chat templates need to handle three specific message types: tool definitions that describe available functions, tool calls that occur when the assistant wants to use a tool, and tool responses that contain results returned from tool execution.
Here's an example of how a tool interaction might look in a chat template:
```sh
<|im_start|>system
You are an AI assistant with access to a calculator tool.<|im_end|>
<|im_start|>user
What is 123 multiplied by 456?<|im_end|>
<|im_start|>tool
Tool Name: calculator
Tool Arguments: {"operation": "multiply", "x": 123, "y": 456}<|im_end|>
<|im_start|>tool-answer
Tool Answer : 56,088
<|im_start|>assistant
Based on the calculator tool, 123 multiplied by 456 equals 56,088.<|im_end|>
```
This example shows how special tokens (`<|im_start|>` and `<|im_end|>`) are used to segment different parts of the conversation, including system context, user input, tool usage, and the assistant's response. Let's see it in action with an example conversation:
```python
messages = [
{"role": "system", "content": "You are an AI assistant with access to various tools."},
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "tool", "tool_name": "WeatherAPI", "args": {"location": "Paris"}},
{"role": "assistant", "content": "It's currently 20°C and partly cloudy."},
]
```
To use this template with your model, you'll need to ensure your model is trained to work with tool calls in this format. Then, configure your tokenizer with the appropriate chat template. Finally, use the template to format messages before sending to the model:
```python
rendered_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
```
Remember that different models may expect different formatting for tool interactions. Always check your model's documentation for the specific format it expects. The template shown here uses a common format with `<|im_start|>` and `<|im_end|>` tokens, but your model might use different special tokens or formatting.
## Resources
- [Hugging Face Chat Templating Guide](https://huggingface.co/docs/transformers/main/en/chat_templating)
- [Transformers Documentation](https://huggingface.co/docs/transformers)