ELECTRA Vs BERT– A comparative study

Bijula Ratheesh
4 min readNov 15, 2020

--

With the launch of ELECTRA, most likely we are set for another revolution in the NLP and NLU tasks and we are all looking forward to it. The Google AI blog of March2020 has an elaborate details regarding ELECTRA. However, I thought give a comparative study of the model architecture of both in this blogpost.

It is inevitable to start without discussing the transformers, which initiated the transformation from RNN and LSTM’s for sequence modeling and transduction problems such as translations and language modeling. The transformers have eliminated the sequential nature and have introduced parallelism by relying entirely on attention mechanism to draw global dependencies between input and output. Transformer uses stacked self-attention (or multi-head attention) and point-wise fully connected feed forward layers for both the encoder and decoder.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning.

The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features.

The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.

The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

Pretraining in BERT

BERT eliminates the unidirectionality by introducing MLM — Masked Language Modeling based pretraining. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows to pretrain a deep bidirectional Transformer. In addition to the masked language model, a Next Sentence Prediction task is also used that jointly pretrains text-pair representations.

One of the downside of MLM is that it is creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, the “masked” words are not always replaced with the actual [MASK] token. The training data generator chooses 15% of the token positions at random for prediction.

Fine-tuning BERT

Fine-tuning is straightforward since the self-attention mechanism in the Transformer allows BERT to model many downstream tasks by swapping out the appropriate inputs and outputs, whether they involve single text or text pairs.

Performance

BERT obtained a score of 80.5 in GLEU leaderboard -General Language Understanding Evaluation (GLUE) benchmark.

In the SQuAD Leaderboard BERT model obtained a F1 score of 93.2. The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k crowdsourced question/answer pairs.

ELECTRA — (Efficiently Learning an Encoder that Classifies Token Replacement Accurately)

ELECTRA replaces the MLM of BERT with Replaced Token Detection (RTD), which looks to be more efficient and produces better results. In BERT, the input is replaced by some tokens with [MASK] and then a model is trained to reconstruct the original tokens.

In ELECTRA, instead of masking the input, the approach corrupts it by replacing some input tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, a discriminative model is trained that predicts whether each token in the corrupted input was replaced by a generator sample or not. This new pre-training task is more efficient than MLM because the model learns from all input tokens rather than just the small subset that was masked out.

Pretraining task — Replaced Token Detection

This approach trains two neural networks, a generator G and a discriminator D. Each one primarily consists of an encoder (e.g., a Transformer network) that maps a sequence on input tokens into a sequence of contextualized vector representations. The discriminator then predicts whether it’s fake by analyzing its data distribution.

Model Architecture (From the research paper)

The generator is trained to perform masked language modeling (MLM). MLM first select a random set of positions (integers between 1 and n) to mask out the input. The tokens in the selected positions are replaced with a [MASK] token from the generator. The generator then learns to maximize the likelihood of the masked-out tokens. The discriminator is trained to distinguish tokens in the data from tokens that have been replaced by generator samples.

Performance

ELECTRA got a GLUE Score of 85

Points to note

MLM has not been completely eradicated in the approach since Generator model still uses small MLM.

Although training the discriminator looks like a GAN approach, this method is not “adversarial” since the generator producing corrupted tokens is trained with maximum-likelihood due to the difficulty of applying GANs to text.

--

--

Bijula Ratheesh
Bijula Ratheesh

No responses yet