Nlp

From Shubham Pramod Jadhav , 2 Weeks ago, written in Plain Text.

Embed

Download Paste or View Raw
Hits: 122

AIM: Write a program to implement Sentence Segmentation & Word Tokenization

THEORY:

Tokenization is used in natural language processing to split paragraphs and sentences into

smaller units that can be more easily assigned meaning.

Sentence Tokenization

Sentence tokenization is the process of splitting text into individual sentences.

Word Tokenization

Word tokenization is the most common version of tokenization. It takes natural breaks, like

pauses in speech or spaces in text, and splits the data into its respective words using delimiters

(characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its

parts.

Modules

NLTK contains a module called tokenize() which further classifies into two subcategories:

✓ Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.

✓ Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into

sentences

CODE:

#pip intall nltk

#py -m pip install --upgrade pip

#nltk.download('punkt')

#nltk.download('wordnet')

import nltk

nltk.download('punkt')

from nltk.tokenize import word_tokenize

with open('New Text Document.txt') as f:

lines = f.readlines()

for content in lines:

line = nltk.sent_tokenize(content)

print("Sentence is:",content)

print("Tokens are:",word_tokenize(content))

print()

AIM: Write a program to Implement Stemming & Lemmatization.

CODE:

from nltk.stem import PorterStemmer

from nltk.stem import SnowballStemmer

from nltk.stem import LancasterStemmer

words = ['run','runner','running','runs','easily','fairly']

def portstemming(words):

ps=PorterStemmer()

print("PorterStemmer")

for word in words:

print(word,"-->",ps.stem(word))

def snowballstemming(words):

snowball= SnowballStemmer(language='english')

print("Snowball Stemmer")

for word in words:

print(word,"-->",snowball.stem(word))

def lancasterstemming(words):

lancaster= LancasterStemmer()

for word in words:

print(word,"-->",lancaster.stem(word))

print("Select operation.")

print("1.Poter Stemmer")

print("2.Snowball Stemmer")

print("3.Lancaster stemmer")

while True:

choice = input("Enter choise(1/2/3):")

if choice in ('1','2','3'):

if choice =='1':

portstemming(words)

elif choice =='2':

snowballstemming(words)

elif choice =='3':

lancasterstemming(words)

next_calculation=input("Do you want to do stemming again?(yes/no):")

if next_calculation =="no":

break

else:

print("Invalid input")

AIM: Write a program to implement a Tri-Gram Model

sample_tokens = nltk.word_tokenize (sample)

print('\n Sample Tokens:',sample_tokens)

print('\n Type of Sample Tokens:',type(sample_tokens))

print('\n Length of Sample Tokens:',len(sample_tokens))

sample_freq =FreqDist(sample_tokens)

tokens=[]

sf=[]

for i in sample_freq:

tokens.append(i)

sf.append(sample_freq[i])

df = pd.DataFrame({'Tokens':tokens, 'Frequency':sf})

print('\n',df)

print('\n Bigrams:',list(nltk.bigrams(sample_tokens)))

print('\n Trigrams:',list(nltk.trigrams(sample_tokens)))

print('\n N-grams(4):',list(nltk.ngrams(sample_tokens,4)))

Author

Title

Language

Your paste - Paste your paste here

AIM: Write a program to implement Sentence Segmentation &amp; Word Tokenization
THEORY:
Tokenization is used in natural language processing to split paragraphs and sentences into 
smaller units that can be more easily assigned meaning.
Sentence Tokenization
Sentence tokenization is the process of splitting text into individual sentences.
Word Tokenization
Word tokenization is the most common version of tokenization. It takes natural breaks, like 
pauses in speech or spaces in text, and splits the data into its respective words using delimiters 
(characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its 
parts.
Modules
NLTK contains a module called tokenize() which further classifies into two subcategories:
 ✓ Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.
 ✓ Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into 
sentences
CODE:
#pip intall nltk
#py -m pip install --upgrade pip
#nltk.download('punkt')
#nltk.download('wordnet')
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
with open('New Text Document.txt') as f:
 lines = f.readlines()
 for content in lines:
 line = nltk.sent_tokenize(content)
 print(&quot;Sentence is:&quot;,content)
 print(&quot;Tokens are:&quot;,word_tokenize(content))
 print()
 
 
 AIM: Write a program to Implement Stemming &amp; Lemmatization.

CODE:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem import LancasterStemmer
words = ['run','runner','running','runs','easily','fairly']
def portstemming(words):
 ps=PorterStemmer()
 print(&quot;PorterStemmer&quot;)
 for word in words:
 print(word,&quot;--&amp;gt;&quot;,ps.stem(word))
def snowballstemming(words):
 snowball= SnowballStemmer(language='english')
 print(&quot;Snowball Stemmer&quot;)
 for word in words:
 print(word,&quot;--&amp;gt;&quot;,snowball.stem(word))
def lancasterstemming(words):
 lancaster= LancasterStemmer()
 for word in words:
 print(word,&quot;--&amp;gt;&quot;,lancaster.stem(word))
print(&quot;Select operation.&quot;)
print(&quot;1.Poter Stemmer&quot;)
print(&quot;2.Snowball Stemmer&quot;)
print(&quot;3.Lancaster stemmer&quot;)
while True:
 choice = input(&quot;Enter choise(1/2/3):&quot;)
 if choice in ('1','2','3'):
 if choice =='1':
 portstemming(words)
 elif choice =='2':
 snowballstemming(words)
 elif choice =='3':
 lancasterstemming(words)
 next_calculation=input(&quot;Do you want to do stemming again?(yes/no):&quot;)
 if next_calculation ==&quot;no&quot;:
 break
 else:
 print(&quot;Invalid input&quot;)
 
 
 AIM: Write a program to implement a Tri-Gram Model
 
 sample_tokens = nltk.word_tokenize (sample)
print('\n Sample Tokens:',sample_tokens)
print('\n Type of Sample Tokens:',type(sample_tokens))
print('\n Length of Sample Tokens:',len(sample_tokens))
sample_freq =FreqDist(sample_tokens) 
tokens=[]
sf=[]
for i in sample_freq:
 tokens.append(i)
 sf.append(sample_freq[i])
df = pd.DataFrame({'Tokens':tokens, 'Frequency':sf})
print('\n',df)
print('\n Bigrams:',list(nltk.bigrams(sample_tokens)))
print('\n Trigrams:',list(nltk.trigrams(sample_tokens)))
print('\n N-grams(4):',list(nltk.ngrams(sample_tokens,4)))

Private - Private paste aren't shown in recent listings.

Delete After - When should we delete your paste?

Spam protection -

{"html5":"htmlmixed","css":"css","javascript":"javascript","php":"php","python":"python","ruby":"ruby","lua":"text\/x-lua","bash":"text\/x-sh","go":"go","c":"text\/x-csrc","cpp":"text\/x-c++src","diff":"diff","latex":"stex","sql":"sql","xml":"xml","apl":"apl","asterisk":"asterisk","c_loadrunner":"text\/x-csrc","c_mac":"text\/x-csrc","coffeescript":"text\/x-coffeescript","csharp":"text\/x-csharp","d":"d","ecmascript":"javascript","erlang":"erlang","groovy":"text\/x-groovy","haskell":"text\/x-haskell","haxe":"text\/x-haxe","html4strict":"htmlmixed","java":"text\/x-java","java5":"text\/x-java","jquery":"javascript","mirc":"mirc","mysql":"sql","ocaml":"text\/x-ocaml","pascal":"text\/x-pascal","perl":"perl","perl6":"perl","plsql":"sql","properties":"text\/x-properties","q":"text\/x-q","scala":"scala","scheme":"text\/x-scheme","tcl":"text\/x-tcl","vb":"text\/x-vb","verilog":"text\/x-verilog","yaml":"text\/x-yaml","z80":"text\/x-z80"}

Reply to "Nlp"