Description
This post will walkthrough a Python implementation of a vocabulary class for storing processed text data and related metadata in a manner useful for subsequently performing NLP tasks.
Summary
- Keep in mind that this all happens prior to the actual NLP task even beginning.
- What the above states is that our stat of sentence token (literally 'SOS', below) will take index spot '1' in our token lookup table once we make it.
- self.num_words = 3 → this will be a count of the number of words (tokens, actually) in the corpus self.num_sentences = 0 → this will be a count of the number of sentences (text chunks of any indiscriminate length, actually) in the corpus self.longest_sentence = 0 → this will be the length of the longest corpus sentence by number of tokens From the above, you should be able to see what metadata about our corpus we are concerned with at this point.
- Again, we will deal with this more appropriately in a follow-up.