Tokenization and Textual content Information Preparation with TensorFlow & Keras



Up to now we’ve had a have a look at a general approach to preprocessing text data, which targeted on tokenization, normalization, and noise removing. We then adopted that up with an summary of text data preprocessing using Python for NLP tasks, which is actually a sensible implementation of the framework outlined within the former article, and which encompasses a primarily guide method to textual content information preprocessing. We have now additionally had a have a look at what goes into constructing an elementary text data vocabulary using Python.

There are quite a few instruments out there for automating a lot of this preprocessing and textual content information preparation, nonetheless. These instruments existed previous to the publication of these articles for sure, however there was an explosion of their proliferation since. Since a lot NLP work is now achieved utilizing neural networks, it is smart that neural community implementation libraries comparable to TensorFlow — and in addition, but concurrently, Keras — would come with strategies for reaching these preparation duties.

This text will have a look at tokenizing and additional getting ready textual content information for feeding right into a neural community utilizing TensorFlow and Keras preprocessing instruments. Whereas the extra idea of making and padding sequences of encoded information for neural community consumption weren’t handled in these earlier articles, it is going to be added herein. Conversely, whereas noise removing was lined within the earlier articles, it won’t be right here. What constitutes noise in textual content information is usually a task-specific enterprise, and the earlier therapy of this subject continues to be related as it’s.

For what we are going to accomplish right this moment, we are going to make use of 2 Keras preprocessing instruments: the Tokenizer class, and the pad_sequences module.

As an alternative of utilizing an actual dataset, both a TensorFlow inclusion or one thing from the actual world, we use a number of toy sentences as stand-ins whereas we get the coding down. Subsequent time we are able to lengthen our code to each use an actual dataset and carry out some fascinating duties, comparable to classification or one thing comparable. As soon as this course of is known, extending it to bigger datasets is trivial.

Let’s begin with the required imports and a few “data” for demonstration.

from tensorflow.keras.preprocessing.textual content import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_data = [
  "I enjoy coffee.",
  "I enjoy tea.",
  "I dislike milk.",
  "I am going to the supermarket later this morning for some coffee."

test_data = [
  "Enjoy coffee this morning.",
  "I enjoy going to the supermarket.",
  "Want some milk for your coffee?"


Subsequent, some hyperparameters for performing tokenization and getting ready the standardized information illustration, with explanations under.

num_words = 1000
oov_token = '<UNK>'
pad_type = 'publish'
trunc_type = 'publish'


  • num_words = 1000
    This would be the most variety of phrases from our ensuing tokenized information vocabulary that are for use, truncated after the 1000 commonest phrases in our case. This won’t be a difficulty in our small dataset, however is being proven for demonstration functions.
  • oov_token = <UNK>
    That is the token which can be used for out of vocabulary tokens encountered through the tokenizing and encoding of check information sequences, created utilizing the phrase index constructed throughout tokenization of our coaching information.
  • pad_type = 'publish'
    Once we are encoding our numeric sequence representations of the textual content information, our sentences (or arbitrary textual content chunk) lengths won’t be uniform, and so we might want to choose a most size for sentences and pad unused sentence positions in shorter sentences with a padding character. In our case, our most sentence size can be decided by looking out our sentences for the one in every of most size, and padding characters can be ‘zero’.
  • trunc_type = 'publish'
    As within the above, once we are encoding our numeric sequence representations of the textual content information, our sentences (or arbitrary textual content chunk) lengths won’t be uniform, and so we might want to choose a most size for sentences and pad unused sentence positions in shorter sentences with a padding character. Whether or not we pre-pad or post-pad sentences is our resolution to make, and we’ve chosen ‘publish’, that means that our sentence sequence numeric representations akin to phrase index entries will seem on the left-most positions of our ensuing sentence vectors, whereas the padding characters (‘zero’) will seem after our precise information on the right-most positions of our ensuing sentence vectors.


Now let’s carry out the tokenization, sequence encoding, and sequence padding. We’ll stroll by way of this code chunk by chunk under.

# Tokenize our coaching information
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)

# Get our coaching information phrase index
word_index = tokenizer.word_index

# Encode coaching information sentences into sequences
train_sequences = tokenizer.texts_to_sequences(train_data)

# Get max coaching sequence size
maxlen = max([len(x) for x in train_sequences])

# Pad the coaching sequences
train_padded = pad_sequences(train_sequences, padding=pad_type, truncating=trunc_type, maxlen=maxlen)

# Output the outcomes of our work
print("Word index:n", word_index)
print("nTraining sequences:n", train_sequences)
print("nPadded training sequences:n", train_padded)
print("nPadded training shape:", train_padded.form)
print("Training sequences data type:", kind(train_sequences))
print("Padded Training sequences data type:", kind(train_padded))


This is what’s occurring chunk by chunk:

  • # Tokenize our coaching information
    That is simple; we’re utilizing the TensorFlow (Keras) Tokenizer class to automate the tokenization of our coaching information. First we create the Tokenizer object, offering the utmost variety of phrases to maintain in our vocabulary after tokenization, in addition to an out of vocabulary token to make use of for encoding check information phrases we’ve not come throughout in our coaching, with out which these previously-unseen phrases would merely be dropped from our vocabulary and mysteriously unaccounted for. To be taught extra about different arguments for the TensorFlow tokenizer, try the documentation. After the Tokenizer has been created, we then match it on the coaching information (we are going to use it later to suit the testing information as nicely).
  • # Get our coaching information phrase index
    A byproduct of the tokenization course of is the creation of a phrase index, which maps phrases in our vocabulary to their numeric illustration, a mapping which can be important for encoding our sequences. Since we are going to reference this later to print out, we assign it a variable right here to simplify a bit.
  • # Encode coaching information sentences into sequences
    Now that we’ve tokenized our information and have a phrase to numeric illustration mapping of our vocabulary, let’s use it to encode our sequences. Right here, we’re changing our textual content sentences from one thing like “My name is Matthew,” to one thing like “6 8 2 19,” the place every of these numbers match up within the index to the corresponding phrases. Since neural networks work by performing computation on numbers, passing in a bunch of phrases will not work. Therefore, sequences. And keep in mind that that is solely the coaching information we’re engaged on proper now; testing information is essentially tokenized and encoded afterwards, under.
  • # Get max coaching sequence size
    Bear in mind once we mentioned we would have liked to have a most sequence size for padding our encoded sentences? We may set this restrict ourselves, however in our case we are going to merely discover the longest encoded sequence and use that as our most sequence size. There would definitely be causes you wouldn’t wish to do that in follow, however there would even be instances it could be acceptable. The maxlen variable is then used under within the precise coaching sequence padding.
  • # Pad the coaching sequences
    As talked about above, we’d like our encoded sequences to be of the identical size. We simply discovered the size of the longest sequence, and can use that to pad all different sequences with further ‘zero’s on the finish (‘publish’) and also will truncate any sequences longer than most size from the tip (‘publish’) as nicely. Right here we use the TensorFlow (Keras) pad_sequences module to perform this. You possibly can have a look at the documentation for extra padding choices.
  • # Output the outcomes of our work
    Now let’s examine what we have finished. We might anticipate to notice the longest sequence and the padding of these that are shorter. Additionally word that when padded, our sequences are transformed from Python lists to Numpy arrays, which is useful since that’s what we are going to in the end feed into our neural community. The form of our coaching sequences matrix is the variety of sentences (sequences) in our coaching set (4) by the size of our longest sequence (maxlen, or 12).
Phrase index:
 '<UNK>': 1, 'i': 2, 'take pleasure in': 3, 'espresso': 4, 'tea': 5, 'dislike': 6, 'milk': 7, 'am': 8, 'going': 9, 'to': 10, 'the': 11, 'grocery store': 12, 'later': 13, 'this': 14, 'morning': 15, 'for': 16, 'some': 17

Coaching sequences:
 [[2, 3, 4], [2, 3, 5], [2, 6, 7], [2, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 4]]

Padded coaching sequences:
 [[ 2  3  4  0  0  0  0  0  0  0  0  0]
 [ 2  3  5  0  0  0  0  0  0  0  0  0]
 [ 2  6  7  0  0  0  0  0  0  0  0  0]
 [ 2  8  9 10 11 12 13 14 15 16 17  4]]

Padded coaching form: (4, 12)
Coaching sequences information kind: <class 'listing'>
Padded Coaching sequences information kind: <class 'numpy.ndarray'>


Now let’s use our tokenizer to tokenize the check information, after which equally encode our sequences. That is all fairly just like the above. Word that we’re utilizing the identical tokenizer we created for coaching to be able to facilitate simpatico between the 2 datasets, utilizing the identical vocabulary. We additionally pad to the identical size and specs because the coaching sequences.

test_sequences = tokenizer.texts_to_sequences(test_data)
test_padded = pad_sequences(test_sequences, padding=pad_type, truncating=trunc_type, maxlen=maxlen)

print("Testing sequences:n", test_sequences)
print("nPadded testing sequences:n", test_padded)
print("nPadded testing shape:",test_padded.form)
Testing sequences:
 [[3, 4, 14, 15], [2, 3, 9, 10, 11, 12], [1, 17, 7, 16, 1, 4]]

Padded testing sequences:
 [[ 3  4 14 15  0  0  0  0  0  0  0  0]
 [ 2  3  9 10 11 12  0  0  0  0  0  0]
 [ 1 17  7 16  1  4  0  0  0  0  0  0]]

Padded testing form: (3, 12)


Are you able to see, as an illustration, how having completely different lengths of padded sequences between coaching and testing units would trigger an issue?

Lastly, let’s try are encoded check information.

for x, y in zip(test_data, test_padded):
  print(' -> '.format(x, y))

print("nWord index (for reference):", word_index)
Get pleasure from espresso this morning. -> [ 3  4 14 15  0  0  0  0  0  0  0  0]
I take pleasure in going to the grocery store. -> [ 2  3  9 10 11 12  0  0  0  0  0  0]
Need some milk to your espresso? -> [ 1 17  7 16  1  4  0  0  0  0  0  0]

Phrase index (for reference): '<UNK>': 1, 'i': 2, 'take pleasure in': 3, 'espresso': 4, 'tea': 5, 'dislike': 6, 'milk': 7, 'am': 8, 'going': 9, 'to': 10, 'the': 11, 'grocery store': 12, 'later': 13, 'this': 14, 'morning': 15, 'for': 16, 'some': 17


Word that, since we’re encoding some phrases within the check information which weren’t seen within the coaching information, we now have some out of vocabulary tokens which we encoded as <UNK> (particularly ‘need’, for instance).

Now that we’ve padded sequences, and extra importantly know the right way to get them once more with completely different information, we’re able to do one thing with them. Subsequent time, we are going to change the toy information we have been utilizing this time with precise information, and with little or no change to our code (save the attainable necessity of classification labels for our prepare and check information), we are going to transfer ahead with an NLP activity of some type, almost certainly classification.


About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *