How to Create a Vocabulary for NLP Tasks in Python

By KDnuggets - 2021-03-22

Description

This post will walkthrough a Python implementation of a vocabulary class for storing processed text data and related metadata in a manner useful for subsequently performing NLP tasks.

Summary

  • Keep in mind that this all happens prior to the actual NLP task even beginning.
  • What the above states is that our stat of sentence token (literally 'SOS', below) will take index spot '1' in our token lookup table once we make it.
  • self.num_words = 3 → this will be a count of the number of words (tokens, actually) in the corpus self.num_sentences = 0 → this will be a count of the number of sentences (text chunks of any indiscriminate length, actually) in the corpus self.longest_sentence = 0 → this will be the length of the longest corpus sentence by number of tokens From the above, you should be able to see what metadata about our corpus we are concerned with at this point.
  • Again, we will deal with this more appropriately in a follow-up.

 

Topics

  1. NLP (0.3)
  2. Coding (0.12)
  3. Backend (0.09)

Similar Articles

Sentiment Analysis With Long Sequences

By Medium - 2021-03-10

Sentiment analysis is typically limited by the length of text that can be processed by transformer models like BERT. We will learn how to work around this.

1 line to BioBERT Word Embeddings with NLU in Python

By Medium - 2021-01-17

Including Part of Speech, Named Entity Recognition, Emotion Classification in the same line! With Bonus t-SNE plots! John Snow Labs NLU library gives you 1000+ NLP models and 100+ Word Embeddings in…

Python enumerate(): Simplify Looping With Counters

By realpython - 2020-12-15

Once you learn about for loops in Python, you know that using an index to access items in a sequence isn't very Pythonic. So what do you do when you need that index value? In this tutorial, you'll lea ...