Merge pull request #4 from eltociear/patch-1

Update README.md
This commit is contained in:
Steve Krenzel
2023-05-16 20:07:01 -07:00
committed by GitHub

View File

@@ -339,7 +339,7 @@ You can experiment with a tokenizer here: [https://platform.openai.com/tokenizer
Different models will use different tokenizers with different levels of granularity. You could, in theory, just feed a model 0s and 1s but then the model needs to learn the concept of characters from bits, and then the concept of words from characters, and so forth. Similarly, you could feed the model a stream of raw characters, but then the model needs to learn the concept of words, and punctuation, etc… and, in general, the models will perform worse. Different models will use different tokenizers with different levels of granularity. You could, in theory, just feed a model 0s and 1s but then the model needs to learn the concept of characters from bits, and then the concept of words from characters, and so forth. Similarly, you could feed the model a stream of raw characters, but then the model needs to learn the concept of words, and punctuation, etc… and, in general, the models will perform worse.
To learn more, [HuggingFace has a wonderful introduction to tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) and why they need to exist. To learn more, [Hugging Face has a wonderful introduction to tokenizers](https://huggingface.co/docs/transformers/tokenizer_summary) and why they need to exist.
Theres a lot of nuance around tokenization, such as vocabulary size or different languages treating sentence structure meaningfully different (e.g. words not being separated by spaces). Fortunately, language model APIs will almost always take raw text as input and tokenize it behind the scenes *so you rarely need to think about tokens*. Theres a lot of nuance around tokenization, such as vocabulary size or different languages treating sentence structure meaningfully different (e.g. words not being separated by spaces). Fortunately, language model APIs will almost always take raw text as input and tokenize it behind the scenes *so you rarely need to think about tokens*.