Bag-of-Words
“Language is a wonderful medium of communication”.
We communicate in different languages, however machines can not process text in its raw form, we need a way to convert text into a numeric format that can be easily processed by machines.
So what we are looking for is a way to convert text data into a vector of numbers. Underlying notion is that — for the text which have similar meanings they should get converted into a similar vectors.
For example: Say we have collected reviews of 3 people about a recently released movie. Let these reviews be R1, R2, R3 and say we have corresponding vectors as V1, V2, V3. Let R1 and R2 be positive reviews while R3 is a negative review, so vectors V1 and V2 should be similar or close i.e distance between vectors V1 and V2 should be small compared to V1 and V3, which are opposite reviews.
Bag-of-Words : Bag-of-Words converts text into fixed length vectors.
It is called as a ‘bag’ of words because just like after we put the things in a bag there is no order manitained like which thing was put first in bag, same is with ‘bag-of-words’ model, it is mainly concerned about occurance of words in a document rather than the order in which words appear.
Steps to follow:
Collect Data: Say we have reviews of 3 people about a movie, and we want to convert these reviews into vectorized format , i.e. we want to convert text reviews into vectors.
R1: Good action and good songs
R2: Good story and songs
R3: Boring Story
Each of these reviews is called as a textdocument, and collection of these documents is called as corpus.
Create Vocabulary:
It is basically a list of all unique words that appear in corpus .
So, based on above 3 reviews we have unique words as:
{Good, action, and, songs, story boring}
Since we have 6 unique words , so we will have fix vectors of 6 dimensions, where each word is a dimension.
Create Document Vectors:
Approach 1 :
Count frequency of words in a document
Approach 2 :
Binary Count: Look for presence of word in a document.
So instead of frequency, we put 1 is word appears atleast once in document else we put 0.
Approach 3:
We can find the frequency of word in a document and then normalize the frequency. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones.
So we divide term i.e. word frequency by document length.
After normalization:
Drawbacks -
As number of terms i.e. unique words increase , vector length increases, so it becomes more difficut to manage in terms of space
It creates sparse vectors i.e vectors with lots of elements as zeros.
It does not maintain order, which is very important in many cases.
For example two statements
Statement 1: Turn left then right
Statement 2: Turn right then left
Both would have same vector representation, though both statements mean totally opposite things.
Neverthless, it is a simple and straight forward technique to convert text into vectors.