Word Embeddings, WordPiece and Language-Agnostic BERT (LaBSE)

7 min readFeb 20, 2021

Word embeddings are the representation of words in a numeric format, which can be understood by a computer. Simplest example would be (Yes, No) represented as (1, 0). But when we are dealing with large texts and corpora, this may not be the efficient way to represent words and sentences. For large corpora, the co-occurrences of words and its probabilities play a major role.

Let’s explore some techniques of word representations…

One-hot encoding

In one-hot encoding, each word in a sentence is represented by a vector.

For e.g. consider the sentence ‘I love dogs.’ It has 3 words and each of these words are represented as

I — (1,0,0)

Love — (0,1,0)

Dogs — (0,0,1)

A one is given to the position of the occurrence of the word in the vector.

However this method is not a cognitive approach when we want to do sentiment analysis or Question-Answering.

Moreover for a large corpus these vectors becomes enormous and meaning less.

CBOW and Continuous Skip-Gram Model

It was important to have words represented in a numeric format but also have a meaning to those words, simply put, they should have a contextual meaning to perform several NLP tasks.

In the paper ‘Efficient Estimation of Word Representations in Vector Space’ Google proposed two architecture for computing continuous vector representations of words from very large data sets, known as Continuous Bag Of Words (CBOW) and Continuous Skip-Gram Model.

CBOW model’s objective is to predict the context word, given 4 future and 4 history words as the input. These future and history words are transformed into their probability of co-occurrence in the sentence and provided as input to the model. This is trained using a Feed forward neural net with one hidden layer on a corpus of 6B words from Google News. The hidden layer uses the activation function ReLU and the output layer uses softmax as below:

Were wc = context word, wt = given word and s is the scoring function.

Since these models does not depend on the order of the words in the corpus, it is called Bag Of Words.

Skip Gram model predict words within a certain range before and after the given input word using a log-linear classifier. This is also trained in the similar way as CBOW.

These 2 models were able to produce more effective semantic and syntactic relationship between words. The accuracy of these models were at 60%, measured by word similarities and word analogies.

Word2Vec library in Genism package can be used to generate CBOW and skip grams. See the example below.

GloVe

Stanford University came up with a new model for word embeddings called GloVe, Global Vectors for word representations. GloVe achieved an accuracy of 75% on word analogy dataset, also outperformed other models in word similarity tasks.

Glove works similar to word2vec with ratio of probability of co-occurrences as input rather than the probability alone w.r.t a context word. Given below is the Co-occurrence probabilities for target words ice and steam with selected context words (k). Compared to the raw probabilities, the ratio (marked in red) is better able to distinguish relevant words (solid and gas) from irrelevant words (water and fashion) and it is also better able to discriminate between the two relevant words.

GloVe uses log-bilinear model as opposed to log-linear model in CBOW and Skip grams. Word2vec genism package can be used for Glove by using Glove pre-trained model as its input as seen below. It is performing the same task as word2vec.

FastText

FastText from Facebook, works on the same logic of skip gram model, with emphasis on the morphology of words. This means, the model considers the inflection, derivation and composition of the words. Each word is represented as a bag of character n-grams or called subwords. A vector representation is associated to each character n-gram and the words is represented as the sum of these representations. Thus obtaining the scoring function as below

Were, w is the given word,

Gw ⊂ {1, . . . , G}, — the set of n-grams appearing in w,

zg — vector representation for each n-gram g.

This model, also called the subwords model allows sharing the representations across words, thus allowing to learn reliable representation for rare words. This scoring function is then used in activation function of the output layer of the feed forward neural net.

In contrast to skip gram model which uses softmax as the activation function in the output layer to predict the context word, FastText treats this as set of independent binary classification tasks. Then the goal is to independently predict the presence (or absence) of context words.

For a chosen context position c, using the binary logistic loss, the negative log-likelihood is given as

FastText module is available in Genism library and we can either use the already trained corpus or we can train the model on completely new corpus. Below is an example from the already trained data.

WordPiece Embeddings in BERT

Word Piece embeddings was developed for google speech recognition system for Asian languages like Korean and Japanese. These languages have large inventory of characters, homonyms and no or few spaces between words. No or fewer spaces meant segmentation was necessary for the text. However, segmentation would produce a lot of out of vocabulary words (OOV) in the model. Thus WordPiece representation was created to learn word units from large amount of data automatically and does not produce any OOV’s. This technique of dealing with OOV’s is used in BERT.OOV’s are ignored in word2vec and GloVe, however in FastText character n-gram representation of the word compensates for the OOV.

BERT uses WordPiece Embeddings of 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. Sentences are differentiated with a special token ([SEP]) and by adding a learned embedding to every token indicating whether it belongs to sentence A or sentence B. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. These inputs are used in pre-training BERT for Masked Language Modeling and Next Sentence Prediction tasks. BERT has revolutionized many of the NLP applications, however the construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering and information retrieval via semantic search.

Sentence-BERT (SBERT), a modification of the pretrained BERT network that use Siamese and triplet network structures was proposed to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.

Visit https://www.sbert.net/docs/pretrained_models.html for English sentence embeddings.

LaBSE

Language agnostic BERT Sentence Embeddings is an adaptation of BERT to produce language-agnostic sentence embeddings for 109 languages. SBERT could produce English sentence embeddings, however this cannot be used in multilingual cases. LaBSE model combines masked language model (MLM) and translation language model (TLM) pretraining with a translation ranking task using bi-directional dual encoders.

In the figure above we can see that there is a dual encoder, these contains paired encoders which feeds a dot-product scoring function. Source and target sentences are encoded separately using a shared BERT based encoder. The final layer [CLS] representations are taken as the sentence embeddings for each input. The similarity between the source and target sentences is scored using cosine over the sentence embeddings produced by the BERT encoders.

The bidirectional dual encoders is trained using an additive margin softmax loss with in-batch negative sampling as given below:

The embedding space similarity of x and y is given by φ(x, y) where, φ(x, y) = cosine(x, y).

Github link for all including LaBSE code below:

https://github.com/bijular/datascience/blob/master/Word_Embedding.ipynb