Facebook
From Shubham Pramod Jadhav , 2 Weeks ago, written in Plain Text.
Embed
Download Paste or View Raw
Hits: 122
  1. AIM: Write a program to implement Sentence Segmentation & Word Tokenization
  2. THEORY:
  3. Tokenization is used in natural language processing to split paragraphs and sentences into
  4. smaller units that can be more easily assigned meaning.
  5. Sentence Tokenization
  6. Sentence tokenization is the process of splitting text into individual sentences.
  7. Word Tokenization
  8. Word tokenization is the most common version of tokenization. It takes natural breaks, like
  9. pauses in speech or spaces in text, and splits the data into its respective words using delimiters
  10. (characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its
  11. parts.
  12. Modules
  13. NLTK contains a module called tokenize() which further classifies into two subcategories:
  14.  ✓ Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.
  15.  ✓ Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into
  16. sentences
  17. CODE:
  18. #pip intall nltk
  19. #py -m pip install --upgrade pip
  20. #nltk.download('punkt')
  21. #nltk.download('wordnet')
  22. import nltk
  23. nltk.download('punkt')
  24. from nltk.tokenize import word_tokenize
  25. with open('New Text Document.txt') as f:
  26.  lines = f.readlines()
  27.  for content in lines:
  28.  line = nltk.sent_tokenize(content)
  29.  print("Sentence is:",content)
  30.  print("Tokens are:",word_tokenize(content))
  31.  print()
  32.  
  33.  
  34.  AIM: Write a program to Implement Stemming & Lemmatization.
  35.  
  36. CODE:
  37. from nltk.stem import PorterStemmer
  38. from nltk.stem import SnowballStemmer
  39. from nltk.stem import LancasterStemmer
  40. words = ['run','runner','running','runs','easily','fairly']
  41. def portstemming(words):
  42.  ps=PorterStemmer()
  43.  print("PorterStemmer")
  44.  for word in words:
  45.  print(word,"-->",ps.stem(word))
  46. def snowballstemming(words):
  47.  snowball= SnowballStemmer(language='english')
  48.  print("Snowball Stemmer")
  49.  for word in words:
  50.  print(word,"-->",snowball.stem(word))
  51. def lancasterstemming(words):
  52.  lancaster= LancasterStemmer()
  53.  for word in words:
  54.  print(word,"-->",lancaster.stem(word))
  55. print("Select operation.")
  56. print("1.Poter Stemmer")
  57. print("2.Snowball Stemmer")
  58. print("3.Lancaster stemmer")
  59. while True:
  60.  choice = input("Enter choise(1/2/3):")
  61.  if choice in ('1','2','3'):
  62.  if choice =='1':
  63.  portstemming(words)
  64.  elif choice =='2':
  65.  snowballstemming(words)
  66.  elif choice =='3':
  67.  lancasterstemming(words)
  68.  next_calculation=input("Do you want to do stemming again?(yes/no):")
  69.  if next_calculation =="no":
  70.  break
  71.  else:
  72.  print("Invalid input")
  73.  
  74.  
  75.  AIM: Write a program to implement a Tri-Gram Model
  76.  
  77.  sample_tokens = nltk.word_tokenize (sample)
  78. print('\n Sample Tokens:',sample_tokens)
  79. print('\n Type of Sample Tokens:',type(sample_tokens))
  80. print('\n Length of Sample Tokens:',len(sample_tokens))
  81. sample_freq =FreqDist(sample_tokens)
  82. tokens=[]
  83. sf=[]
  84. for i in sample_freq:
  85.  tokens.append(i)
  86.  sf.append(sample_freq[i])
  87. df = pd.DataFrame({'Tokens':tokens, 'Frequency':sf})
  88. print('\n',df)
  89. print('\n Bigrams:',list(nltk.bigrams(sample_tokens)))
  90. print('\n Trigrams:',list(nltk.trigrams(sample_tokens)))
  91. print('\n N-grams(4):',list(nltk.ngrams(sample_tokens,4)))
  92.