Create a Vocabulary for NLP Duties in Python — And Why



When performing a natural language processing task, our textual content information transformation proceeds kind of on this method:

uncooked textual content corpus → processed textual content → tokenized textual content → corpus vocabulary → textual content illustration

Remember the fact that this all occurs previous to the precise NLP process even starting.

The corpus vocabulary is a holding space for processed textual content earlier than it’s reworked into some representation for the impending task, be it classification, or language modeling, or one thing else.

The vocabulary serves a number of main functions:

  • assist in the preprocessing of the corpus textual content
  • function storage location in reminiscence for processed textual content corpus
  • gather and retailer metadata in regards to the corpus
  • permit for pre-task munging, exploration, and experimentation

The vocabulary serves a number of associated functions and will be considered in a number of alternative ways, however the primary takeaway is that, as soon as a corpus has made its approach to the vocabulary, the textual content has been processed and any related metadata ought to be collected and saved.

This publish will take a step-by-step take a look at a Python implementation of a helpful vocabulary class, exhibiting what is going on within the code, why we’re doing what we’re doing, and a few pattern utilization. We’ll begin with some code from this PyTorch tutorial, and can make a number of modifications as we go. Although this would possibly not be terribly programming heavy, if you’re wholly unfamiliar with Python object oriented programming, I recommend you first look here.

The very first thing to do is to create values for our begin of sentence, finish of sentence, and sentence padding particular tokens. Once we tokenize textual content (cut up textual content into its atomic constituent items), we want particular tokens to delineate each the start and finish of a sentence, in addition to to pad sentence (or another textual content chunk) storage buildings when sentences are shorter then the utmost allowable house. Extra on this later.

PAD_token = zero   # Used for padding quick sentences
SOS_token = 1   # Begin-of-sentence token
EOS_token = 2   # Finish-of-sentence token

What the above states is that our stat of sentence token (actually ‘SOS’, under) will take index spot ‘1’ in our token lookup desk as soon as we make it. Likewise, finish of sentence (‘EOS’) will take index spot ‘2’, whereas the sentence padding token (‘PAD’) will take index spot ‘zero’.

The subsequent factor we are going to do is create a constructor for our Vocabulary class:

def __init__(self, identify):
  self.identify = identify
  self.word2index = 
  self.word2count = 
  self.index2word = PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"
  self.num_words = 3
  self.num_sentences = zero
  self.longest_sentence = zero

The primary line is our __init__() declaration, which requires ‘self’ as its first parameter (once more, see this link), and takes a Vocabulary ‘identify’ as its second.

Line by line, this is what the item variable initializations are doing

  • self.identify = identify → that is instantiated to the identify handed to the constructor, as one thing by which to consult with our Vocabulary object
  • self.word2index = → a dictionary to carry phrase token to corresponding phrase index values, finally within the type of 'the': 7, for instance
  • self.word2count = → a dictionary to carry particular person phrase counts (tokens, truly) within the corpus
  • self.index2word = PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS" → a dictionary holding the reverse of word2index (phrase index keys to phrase token values); particular tokens added instantly
  • self.num_words = 3 → this might be a depend of the variety of phrases (tokens, truly) within the corpus
  • self.num_sentences = zero → this might be a depend of the variety of sentences (textual content chunks of any indiscriminate size, truly) within the corpus
  • self.longest_sentence = zero → this would be the size of the longest corpus sentence by variety of tokens

From the above, you need to be capable to see what metadata about our corpus we’re involved with at this level. Try to consider some extra corpus-related information you may need to preserve monitor of, which we’re not.

Since we’ve outlined that metadata which we’re concerned with amassing and storing, we are able to transfer on to performing the work to take action. A primary unit of labor we might want to do to replenish our vocabulary is so as to add phrases to it.

def add_word(self, phrase):
  if phrase not in self.word2index:
    # First entry of phrase into vocabulary
    self.word2index[word] = self.num_words
    self.word2count[word] = 1
    self.index2word[self.num_words] = phrase
    self.num_words += 1
    # Phrase exists; improve phrase depend
    self.word2count[word] += 1

As you’ll be able to see, there are 2 situations we are able to encounter when attempting so as to add a phrase token to our vocabulary; both it doesn’t already exists within the vocabulary (if phrase not in self.word2index:) or it does (else:). If the phrase doesn’t exist in our vocabulary, we need to add it to our word2index dict, instantiate our depend of that phrase to 1, add the index of the phrase (the subsequent out there quantity within the counter) to the index2word dict, and increment our total phrase depend by 1. Alternatively, if the phrase already exists within the vocabulary, merely increment the counter for that phrase by 1.

How are we going so as to add phrases to the vocabulary? We’ll achieve this by feeding sentences in and tokenizing them as properly go, processing the ensuing tokens one after the other. Word, once more, that these needn’t be sentences, and naming these 2 features add_token and add_chunk could also be extra applicable than add_word and add_sentence, respectively. We’ll depart the renaming for an additional day.

def add_sentence(self, sentence):
  sentence_len = zero
  for phrase in sentence.cut up(' '):
    sentence_len += 1
  if sentence_len > self.longest_sentence:
    # That is the longest sentence
    self.longest_sentence = sentence_len
  # Depend the variety of sentences
  self.num_sentences += 1

This perform takes a piece of textual content, a single string, and splits it on whitespace for tokenization functions. This isn’t sturdy tokenization, and isn’t good follow, however will suffice for our functions in the meanwhile. We’ll revisit this in a follow-up publish and construct a greater method to tokenization into our vocabulary class. Within the meantime, you’ll be able to learn extra on textual content information preprocessing here and here.

After splitting our sentence on whitespace, we then increment our sentence size counter by one for every phrase we move to the add_word perform for processing and addition to our vocabulary (see above). We then examine to see if this sentence is longer than different sentences we’ve processed; whether it is, we make be aware. We additionally increment our depend of corpus sentences we’ve added to the vocabulary so far.

We’ll then add a pair of helper features to assist us extra simply entry 2 of our most necessary lookup tables:

def to_word(self, index):
  return self.index2word[index]

def to_index(self, phrase):
  return self.word2index[word]

The primary of those features performs the index to phrase lookup within the applicable dictionary for a given index; the opposite performs the reverse lookup for a given phrase. That is important performance, as as soon as we get our processed textual content into the vocabulary object, we are going to need to get it again out in some unspecified time in the future, in addition to carry out lookups and reference metadata. These 2 features might be helpful for a lot of this.

Placing this all collectively, we get the next.

Let’s examine how this works. First, let’s create an empty vocabulary object:

voc = Vocabulary('check')

<__main__.Vocabulary object at 0x7f80a071c470>

Then we create a easy corpus:

corpus = ['This is the first sentence.',
          'This is the second.',
          'There is no sentence in this corpus longer than this one.',
          'My dog is named Patrick.']
['This is the first sentence.',
 'This is the second.',
 'There is no sentence in this corpus longer than this one.',
 'My dog is named Patrick.']

Let’s loop via the sentences in our corpus and add the phrases in every to our vocabulary. Do not forget that add_sentence makes calls to add_word:

for despatched in corpus:

Now let’s check what we have finished:

print('Token 4 corresponds to token:', voc.to_word(4))
print('Token "this" corresponds to index:', voc.to_index('this'))

That is the output, which appears to work properly.

Token 4 corresponds to token: is
Token "this" corresponds to index: 13

Since our corpus is so small, let’s print out your entire vocabulary of tokens. Word that since we’ve not but carried out any type of helpful tokenization past splitting on white house, we’ve some tokens with capitlized first letters, and others with trailing punctuation. Once more, we are going to take care of this extra appropriately in a follow-up.

for phrase in vary(voc.num_words):

Let’s create and print out lists of corresponding tokens and indexes of a specific sentence. Word this time that we’ve not but trimmed the vocabulary, nor have we added padding or used the SOS or EOS tokens. We add this to the checklist of things to deal with subsequent time.

sent_tkns = []
sent_idxs = []
for phrase in corpus[3].cut up(' '):
['My', 'dog', 'is', 'named', 'Patrick.']
[18, 19, 4, 20, 21]

And there you go. It appears that evidently, even with our quite a few famous shortcomings, we’ve a vocabulary that may find yourself being helpful, provided that it displays a lot of the core vital performance which might make it will definitely helpful.

A overview of the objects we should deal with subsequent time embody:

  • carry out normalization of our textual content information (drive all to lowercase, take care of punctuation, and so on.)
  • correctly tokenize chunks of textual content
  • make use of SOS, EOS, and PAD tokens
  • trim our vocabulary (minimal variety of token occurrences earlier than saved completely in our vocabulary)

Subsequent time we are going to implement this performance, and check our Python vocabulary implementation on a extra sturdy corpus. We’ll then transfer information from our vocabulary object right into a helpful information illustration for NLP duties. Lastly, we are going to get to performing an NLP process on the information we’ve gone to the difficulty of so aptly making ready.


About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *