Facebook
From Shubham Pramod Jadhav , 2 Weeks ago, written in Plain Text.
Embed
Download Paste or View Raw
Hits: 105
  1. AIM: Write a program to implement Sentence Segmentation & Word Tokenization
  2. THEORY:
  3. Tokenization is used in natural language processing to split paragraphs and sentences into
  4. smaller units that can be more easily assigned meaning.
  5. Sentence Tokenization
  6. Sentence tokenization is the process of splitting text into individual sentences.
  7. Word Tokenization
  8. Word tokenization is the most common version of tokenization. It takes natural breaks, like
  9. pauses in speech or spaces in text, and splits the data into its respective words using delimiters
  10. (characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its
  11. parts.
  12. Modules
  13. NLTK contains a module called tokenize() which further classifies into two subcategories:
  14.  ✓ Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.
  15.  ✓ Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into
  16. sentences
  17. CODE:
  18. #pip intall nltk
  19. #py -m pip install --upgrade pip
  20. #nltk.download('punkt')
  21. #nltk.download('wordnet')
  22. import nltk
  23. nltk.download('punkt')
  24. from nltk.tokenize import word_tokenize
  25. with open('New Text Document.txt') as f:
  26.  lines = f.readlines()
  27.  for content in lines:
  28.  line = nltk.sent_tokenize(content)
  29.  print("Sentence is:",content)
  30.  print("Tokens are:",word_tokenize(content))
  31.  print()