AIM: Write a program to implement Sentence Segmentation & Word Tokenization
THEORY:
Tokenization is used in natural language processing to split paragraphs and sentences into
smaller units that can be more easily assigned meaning.
Sentence Tokenization
Sentence tokenization is the process of splitting text into individual sentences.
Word Tokenization
Word tokenization is the most common version of tokenization. It takes natural breaks, like
pauses in speech or spaces in text, and splits the data into its respective words using delimiters
(characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its
parts.
Modules
NLTK contains a module called tokenize() which further classifies into two subcategories:
✓ Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words.
✓ Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into
sentences
CODE:
#pip intall nltk
#py -m pip install --upgrade pip
#nltk.download('punkt')
#nltk.download('wordnet')
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
with open('New Text Document.txt') as f:
lines = f.readlines()
for content in lines:
line = nltk.sent_tokenize(content)
print("Sentence is:",content)
print("Tokens are:",word_tokenize(content))
print()
{"html5":"htmlmixed","css":"css","javascript":"javascript","php":"php","python":"python","ruby":"ruby","lua":"text\/x-lua","bash":"text\/x-sh","go":"go","c":"text\/x-csrc","cpp":"text\/x-c++src","diff":"diff","latex":"stex","sql":"sql","xml":"xml","apl":"apl","asterisk":"asterisk","c_loadrunner":"text\/x-csrc","c_mac":"text\/x-csrc","coffeescript":"text\/x-coffeescript","csharp":"text\/x-csharp","d":"d","ecmascript":"javascript","erlang":"erlang","groovy":"text\/x-groovy","haskell":"text\/x-haskell","haxe":"text\/x-haxe","html4strict":"htmlmixed","java":"text\/x-java","java5":"text\/x-java","jquery":"javascript","mirc":"mirc","mysql":"sql","ocaml":"text\/x-ocaml","pascal":"text\/x-pascal","perl":"perl","perl6":"perl","plsql":"sql","properties":"text\/x-properties","q":"text\/x-q","scala":"scala","scheme":"text\/x-scheme","tcl":"text\/x-tcl","vb":"text\/x-vb","verilog":"text\/x-verilog","yaml":"text\/x-yaml","z80":"text\/x-z80"}