mirror of
https://github.com/fchollet/deep-learning-with-python-notebooks.git
synced 2021-07-27 01:28:40 +03:00
244 lines
8.6 KiB
Plaintext
244 lines
8.6 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Using TensorFlow backend.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'2.0.8'"
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import keras\n",
|
|
"keras.__version__"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# One-hot encoding of words or characters\n",
|
|
"\n",
|
|
"This notebook contains the first code sample found in Chapter 6, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.\n",
|
|
"\n",
|
|
"----\n",
|
|
"\n",
|
|
"One-hot encoding is the most common, most basic way to turn a token into a vector. You already saw it in action in our initial IMDB and \n",
|
|
"Reuters examples from chapter 3 (done with words, in our case). It consists in associating a unique integer index to every word, then \n",
|
|
"turning this integer index i into a binary vector of size N, the size of the vocabulary, that would be all-zeros except for the i-th \n",
|
|
"entry, which would be 1.\n",
|
|
"\n",
|
|
"Of course, one-hot encoding can be done at the character level as well. To unambiguously drive home what one-hot encoding is and how to \n",
|
|
"implement it, here are two toy examples of one-hot encoding: one for words, the other for characters.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Word level one-hot encoding (toy example):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"# This is our initial data; one entry per \"sample\"\n",
|
|
"# (in this toy example, a \"sample\" is just a sentence, but\n",
|
|
"# it could be an entire document).\n",
|
|
"samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
|
|
"\n",
|
|
"# First, build an index of all tokens in the data.\n",
|
|
"token_index = {}\n",
|
|
"for sample in samples:\n",
|
|
" # We simply tokenize the samples via the `split` method.\n",
|
|
" # in real life, we would also strip punctuation and special characters\n",
|
|
" # from the samples.\n",
|
|
" for word in sample.split():\n",
|
|
" if word not in token_index:\n",
|
|
" # Assign a unique index to each unique word\n",
|
|
" token_index[word] = len(token_index) + 1\n",
|
|
" # Note that we don't attribute index 0 to anything.\n",
|
|
"\n",
|
|
"# Next, we vectorize our samples.\n",
|
|
"# We will only consider the first `max_length` words in each sample.\n",
|
|
"max_length = 10\n",
|
|
"\n",
|
|
"# This is where we store our results:\n",
|
|
"results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))\n",
|
|
"for i, sample in enumerate(samples):\n",
|
|
" for j, word in list(enumerate(sample.split()))[:max_length]:\n",
|
|
" index = token_index.get(word)\n",
|
|
" results[i, j, index] = 1."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Character level one-hot encoding (toy example)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import string\n",
|
|
"\n",
|
|
"samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
|
|
"characters = string.printable # All printable ASCII characters.\n",
|
|
"token_index = dict(zip(characters, range(1, len(characters) + 1)))\n",
|
|
"\n",
|
|
"max_length = 50\n",
|
|
"results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))\n",
|
|
"for i, sample in enumerate(samples):\n",
|
|
" for j, character in enumerate(sample[:max_length]):\n",
|
|
" index = token_index.get(character)\n",
|
|
" results[i, j, index] = 1."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. \n",
|
|
"This is what you should actually be using, as it will take care of a number of important features, such as stripping special characters \n",
|
|
"from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with very large input \n",
|
|
"vector spaces)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Using Keras for word-level one-hot encoding:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Found 9 unique tokens.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from keras.preprocessing.text import Tokenizer\n",
|
|
"\n",
|
|
"samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
|
|
"\n",
|
|
"# We create a tokenizer, configured to only take\n",
|
|
"# into account the top-1000 most common words\n",
|
|
"tokenizer = Tokenizer(num_words=1000)\n",
|
|
"# This builds the word index\n",
|
|
"tokenizer.fit_on_texts(samples)\n",
|
|
"\n",
|
|
"# This turns strings into lists of integer indices.\n",
|
|
"sequences = tokenizer.texts_to_sequences(samples)\n",
|
|
"\n",
|
|
"# You could also directly get the one-hot binary representations.\n",
|
|
"# Note that other vectorization modes than one-hot encoding are supported!\n",
|
|
"one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')\n",
|
|
"\n",
|
|
"# This is how you can recover the word index that was computed\n",
|
|
"word_index = tokenizer.word_index\n",
|
|
"print('Found %s unique tokens.' % len(word_index))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"A variant of one-hot encoding is the so-called \"one-hot hashing trick\", which can be used when the number of unique tokens in your \n",
|
|
"vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these \n",
|
|
"indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a very lightweight hashing function. \n",
|
|
"The main advantage of this method is that it does away with maintaining an explicit word index, which \n",
|
|
"saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available \n",
|
|
"data). The one drawback of this method is that it is susceptible to \"hash collisions\": two different words may end up with the same hash, \n",
|
|
"and subsequently any machine learning model looking at these hashes won't be able to tell the difference between these words. The likelihood \n",
|
|
"of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Word-level one-hot encoding with hashing trick (toy example):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"samples = ['The cat sat on the mat.', 'The dog ate my homework.']\n",
|
|
"\n",
|
|
"# We will store our words as vectors of size 1000.\n",
|
|
"# Note that if you have close to 1000 words (or more)\n",
|
|
"# you will start seeing many hash collisions, which\n",
|
|
"# will decrease the accuracy of this encoding method.\n",
|
|
"dimensionality = 1000\n",
|
|
"max_length = 10\n",
|
|
"\n",
|
|
"results = np.zeros((len(samples), max_length, dimensionality))\n",
|
|
"for i, sample in enumerate(samples):\n",
|
|
" for j, word in list(enumerate(sample.split()))[:max_length]:\n",
|
|
" # Hash the word into a \"random\" integer index\n",
|
|
" # that is between 0 and 1000\n",
|
|
" index = abs(hash(word)) % dimensionality\n",
|
|
" results[i, j, index] = 1."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|