# Python Tokenize

This repo contains a Jupyter notebook to calculate the number of tokens in text, files, and folders using tokenizers from Hugging Face and OpenAI.

## Installation

```sh
uv sync
```

## Usage

Select the model to use for tokenization in the Jupyter notebook. You can choose either a model from the Hugging Face model hub or OpenAI. Set the model's name in the `model_name` variable.

- For Hugging Face models, use the `user/model name` from the Hugging Face model hub, eg. `mixedbread-ai/mxbai-embed-large-v1`
- For OpenAI models, use the model name from the OpenAI API, eg. `gpt-4o`. [Available models](https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/tiktoken/model.py#L22-L72).

### Calculate tokens in a text

1. Set the `text` variable to your text.
1. Run all cells.

### Calculate tokens in a file

1. Set the `file_path` variable to the path of your file.
1. Run all cells.

### Calculate tokens in files in a folder

1. Set the `folder_path` variable to the path of your folder.
1. Optionally, specify a filter for which files to include.
1. Run all cells.