Understanding Word Embeddings from scratch | Example LSTM/GRU model

The final destination to intuitively understand word embeddings… finally

Yo reader! I am Manik. What’s up?.

Hope you’re doing great and working hard for your goals. If not, it’s never late. Start now, at this very moment.

With this piece of information, you’ll walk away with a clear explanation on Sequence and Text processing for Deep Neural Networks which includes:

  1. What’s one-hot Encoding?
  2. OneHot Encoding with keras.
  3. What are word embeddings and their advantage over One-Hot encoding?
  4. What are word embeddings trying to say?
  5. A complete example of converting raw text to word embeddings in keras with an LSTM and GRU layer.

if you want to learn about LSTMs, you can go here

Let’s get started.

“Yours and mine ancestors had run after a mastodons or wild boar, like an olympic sprinter, with a spear in hand covering themselves with leaves and tiger skin, for their breakfast” — History

The above sentence is in textual form and for neural networks to understand and ingest it, we need to convert it into some numeric form. Two ways of doing that are One-hot encoding and the other is Word embeddings.

One-Hot

This is a way of representing each word by an array of 0s and 1. In the array, only one index has ‘1’ present and rest all are 0s.

Example: The following vector represents only one word, in a sentence with 6 unique words.

Image for post
Image by Author

One-Hot with numpy

Let’s find all the unique words in our sentence.

array(['Yours', 'a', 'after', 'an', 'ancestors', 'and', 'boar,',
'breakfast', 'covering', 'for', 'had', 'hand', 'in','leaves',
'like', 'mastodons', 'mine', 'olympic', 'or', 'run', 'skin,',
'spear', 'sprinter,', 'their', 'themselves', 'tiger', 'wild',
'with'], dtype='<U10')shape: (28,)

Now, give each of them an index i.e. create a word_index where each word has an index attached to it in a dictionary.

You might have observed above in the code that 0 is not assigned to any word. It’s a reserved index in Keras(We’ll get here later).

{'Yours': 1, 'a': 2, 'after': 3, 'an': 4, 'ancestors': 5, 'and': 6, 'boar,': 7, 'breakfast': 8, 'covering': 9, 'for': 10, 'had': 11, 'hand': 12, 'in': 13, 'leaves': 14, 'like': 15, 'mastodons': 16, 'mine': 17, 'olympic': 18, 'or': 19, 'run': 20, 'skin,': 21, 'spear': 22, 'sprinter,': 23, 'their': 24, 'themselves': 25, 'tiger': 26, 'wild': 27, 'with': 28}

Now, let’s create one-hot encoding for them.

Example output: This is how “yours” is represented.

Yours [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

One-hot keras example

text_to_matrix is the method used to return one-hot encoding.

You can see that, to represent a word, we are actually wasting a lot of memory to just set 0s(sparse matrix). These one-hot encodings also doesn’t reflect any relation between similar words. They are just representation of some word with ‘1’. Two similar words such as “accurate” and “exact” might be at very different positions in one-hot encodings.

What if we can represent a word with less space and have a meaning of its representation with which we can learn something.

Word Embeddings

Image for post
Word Vectors example
  • Word embeddings also represent words in an array, not in the form of 0s and 1s but continuous vectors.
  • They can represent any word in few dimensions, mostly based on the number of unique words in our text.
  • They are dense, low dimensional vectors
  • Not hardcoded but are “learned” through data.

What are word embeddings trying to say?

  • Geometric relationship between words in a word embeddings can represent semantic relationship between words. Words closer to each other have a strong relation compared to words away from each other.
  • Vectors/words closer to each other means the cosine distance or geometric distance between them is less compared to others.
  • There could be vector “male to female” which represents the relation between a word and its feminine. That vector may help us in predicting “king” when “he” is used and “Queen” when she is used in the sentence.

How word Embeddings look like?

Below is a single row of embedding matrix representing the word ‘the’ in 100 dimensions from a text having 100K unique words.

A row from embedding matrix
A row from embedding matrix which represents a single word in 100 dimensions

Such matrices are learned from data and can represent any text with millions of words in 100, 200, 1000 or more dimensions (The same would require 1MM dimensions if one-hot encoding is used).

Let’s see how to create embeddings of our text in keras with a recurrent neural network.

Steps to follow to convert raw data to embeddings:

Flow of embeddings with keras Recurrent Neural Network
Flow
  1. Load text data in array.
  2. Process the data.
  3. Convert the text to sequence and using the tokenizer and pad them with keras.preprocessing.text.pad_sequences method.
  4. Initialise a model with Embedding layer of dimensions (max_words, representation_dimensions, input_size))
  • max_words: It is the no. of unique words in your data
  • representation_dimension: It is the no. of dimensions in which you want to represent a word. Usually, it is number of (unique words)^(1/4)
  • input_size: size of your padded sequence(maxlen)

5 . Run the model

Let’s follow the above steps for IMDB raw data. All the code below is present in my Kaggle notebook.

Step 1. Necessary imports

Step 2. Load the text data.

loading the text data with pandas.

Step 3: Process the data.

Marking 1 for positive movie review and 0 for negative review.

Step 4: Creating and padding the sequence.

Creating an instance of keras’s Tokenizer class and padding the sequence to ‘maxlen’.

Step 5. Initialise our model

A simple recurrent neural network with embedding as first layer.

Step 6: Run the model!

Outputs

With GRU:

GRU training and Validation
GRU training and Validation

With LSTM:

LSTM training and validation
LSTM training and validation

All the above code is present here .

If this piece of writing helped you in anyway do leave a 👏 . It encourages us to write more such and better articles.

Thanks for reaching till here! Great!

+3
Default image
Manik Soni
Articles: 5

Leave a Reply