- AIM: Write a program to implement Sentence Segmentation & Word Tokenization
- THEORY:
- Tokenization is used in natural language processing to split paragraphs and sentences into
- smaller units that can be more easily assigned meaning.
- Sentence Tokenization
- Sentence tokenization is the process of splitting text into individual sentences.
- Word Tokenization
- Word tokenization is the most common version of tokenization. It takes natural breaks, like
- pauses in speech or spaces in text, and splits the data into its respective words using delimiters
- (characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its
- parts.
- Modules
- NLTK contains a module called tokenize() which further classifies into two subcategories:
- ✓ Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.
- ✓ Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into
- sentences
- CODE:
- #pip intall nltk
- #py -m pip install --upgrade pip
- #nltk.download('punkt')
- #nltk.download('wordnet')
- import nltk
- nltk.download('punkt')
- from nltk.tokenize import word_tokenize
- with open('New Text Document.txt') as f:
- lines = f.readlines()
- for content in lines:
- line = nltk.sent_tokenize(content)
- print("Sentence is:",content)
- print("Tokens are:",word_tokenize(content))
- print()
- AIM: Write a program to Implement Stemming & Lemmatization.
- CODE:
- from nltk.stem import PorterStemmer
- from nltk.stem import SnowballStemmer
- from nltk.stem import LancasterStemmer
- words = ['run','runner','running','runs','easily','fairly']
- def portstemming(words):
- ps=PorterStemmer()
- print("PorterStemmer")
- for word in words:
- print(word,"-->",ps.stem(word))
- def snowballstemming(words):
- snowball= SnowballStemmer(language='english')
- print("Snowball Stemmer")
- for word in words:
- print(word,"-->",snowball.stem(word))
- def lancasterstemming(words):
- lancaster= LancasterStemmer()
- for word in words:
- print(word,"-->",lancaster.stem(word))
- print("Select operation.")
- print("1.Poter Stemmer")
- print("2.Snowball Stemmer")
- print("3.Lancaster stemmer")
- while True:
- choice = input("Enter choise(1/2/3):")
- if choice in ('1','2','3'):
- if choice =='1':
- portstemming(words)
- elif choice =='2':
- snowballstemming(words)
- elif choice =='3':
- lancasterstemming(words)
- next_calculation=input("Do you want to do stemming again?(yes/no):")
- if next_calculation =="no":
- break
- else:
- print("Invalid input")
- AIM: Write a program to implement a Tri-Gram Model
- sample_tokens = nltk.word_tokenize (sample)
- print('\n Sample Tokens:',sample_tokens)
- print('\n Type of Sample Tokens:',type(sample_tokens))
- print('\n Length of Sample Tokens:',len(sample_tokens))
- sample_freq =FreqDist(sample_tokens)
- tokens=[]
- sf=[]
- for i in sample_freq:
- tokens.append(i)
- sf.append(sample_freq[i])
- df = pd.DataFrame({'Tokens':tokens, 'Frequency':sf})
- print('\n',df)
- print('\n Bigrams:',list(nltk.bigrams(sample_tokens)))
- print('\n Trigrams:',list(nltk.trigrams(sample_tokens)))
- print('\n N-grams(4):',list(nltk.ngrams(sample_tokens,4)))