GloVe Research Paper Explained (2024)

An Intuitive understanding and Explanation of math behind GloVe model

Published in

Towards Data Science

18 min read

Sep 8, 2021

1.1 Matrix Factorization

These methods factorize word-word or document-term co-occurrence matrix. HAL (Hyperspace Analogue to Language)[Lund & Burgess], LSA (Latent Semantic Analysis) [Landauer, Foltz, & Laham] are popular models that use matrix factorization. Eigen value decomposition is used for a square matrix and singular value decomposition (SVD) is used for a rectangular matrix for matrix factorization. In word-word matrix, rows represent words and columns refer context words. A value in matrix Mᵢⱼ represents how many times a particular word Wᵢ occurs in context Wⱼ. In document-term matrix, rows represent words and column represents documents. A row represents distribution of a particular word across all documents and a column represent distribution of all words in that document.

The matrix factorization methods usually involve following steps.

Define context for each word and generate word-word or document term matrix M.
Normalize the values in matrix either by row, column or length.
Remove columns having less variance to reduce dimensionality of matrix.
Perform matrix factorization by SVD on M to generate U, S, V matrix.
After sorting on singular values of matrix M in descending order by taking first r < R (rank of matrix M), low rank of a matrix M can be obtained. Product of low rank matrices U^, S^ and V^ is close approximation to the original matrix M.
Vector of a word Wⱼ is the jᵗʰ vector from reduced rank space matrix U^.

In such methods, the most common words in the corpus such as (the, and) contribute disproportionate similarity between words. The words Wᵢ and Wⱼ which occur in the context of “the” , “and” causes enlarged similarity score than their true similarity score. Similarly, Wᵢ and Wⱼ which don’t come often in context of “the” , “and” but having true similarity will have less similarity score. Methods such as COALS [T. Rohde et al.] overcome this problem by using correlation or entropy based normalization of matrix. When entropy based normalization is used, words which occur in all the contexts or documents such as “the”, “and” will have high entropy. Raw counts are normalized with high entropy term resulting in shrinking of high values of such terms in the matrix. This solves issue of adding disproportionate measure to word similarity score.

1.2 Window based methods

These methods work on local context windows rather than using global statistics of the corpus. The language model in [Bengio et al.] trains neural network which in turn trains word vector representations. This model takes past history of N words and predict the next word as an objective of language modelling. Hence, each example in this method is a local window of words which are inputs and output is a next word. Later, word2vec models [Mikolov et al.] such as skip-gram and CBOW decoupled language modelling and trained single hidden layer neural network. For more detailed understanding, you can go through my previous article on it.

These methods do not take advantage of available global statistics as compared to matrix factorization methods resulting in suboptimal vector representations of words. Training time and complexity of the model can be reduced by utilizing repetitive local windows across corpus.

2. Detailed Math behind GloVe Model

Here comes the section that you all are looking forward to read. To understand math behind any algorithm is very crucial. It helps us to understand formulation of algorithm and grasp a concept. With that said, lets’ define some notations.

X is co-occurrence matrix (word-word), where Xᵢⱼ is count of word Wⱼ in the context of word Wᵢ. What does context mean for a particular word? Context around a word can be defined as symmetric context which comprises of past N and future N words. Asymmetric context just comprises past N history words. A simple count or weighted counts can be used for computing values in matrix. Simple count takes 1 as occurrence count. In weighted counts, 1/d is used as occurrence count where d is distance from the given word. The motivation behind using weightage is context words which are nearer to a given word are more important as meaningful context than distant words.

In Eqⁿ (1), Xᵢ is a summation over all the words which occur in the context of word Wᵢ.

In Eqⁿ (2), P is a co-occurrence probability where Pᵢⱼ is a probability of word Wⱼ occurring in the context of word Wᵢ.

GloVe suggests to find relationship between two words in terms of probability rather than raw counts. The relationship between two words (Wᵢ and Wⱼ) is examined by finding co-occurrence probability with some probe words (Wₖ).

Let’s say we have two words, Wᵢ as ‘Ice’ and Wⱼ as ‘Steam’ and some probe words Wₖ as ‘Solid’, ‘Gas’, ‘Water’, ‘Fashion’. From basic understanding, we know that ‘Solid’ is more related to ‘Ice’ (Wᵢ)and ‘Gas’ is more related to ‘Steam’ (Wⱼ) while Fashion is not related to both ‘Ice’ and ‘Steam’ while Water is related to both ‘Ice’ and ‘Steam.’ Now our objective is to find relevant words for the given words among probe words.

From table 1, for probe word (Wₖ) as ‘solid’, P(solid|ice) which is probability of ‘solid’ coming in context of ‘ice’ is (1.9*10^–4) which is greater than P(solid|steam). The ratio P(k|ice) / P(k|steam) is >>1. For probe word ‘gas’, the ratio is <<1. For probe words ‘water’ and ‘fashion’ the ratio is nearly equal to 1. The ratio of co-occurrence probability distinguishes words (solid and gas) which are more relevant to given words than irrelevant words (fashion and water). The words having ratio nearly equal to 1 either appear in the context of given words or not, hence causing no impact in learning relationship between given words. This proves ratio of co-occurrence probability is to be the starting point for learning word representations.

The ratio of co-occurrence probability Pᵢₖ/Pⱼₖ depends on three words Wᵢ, Wⱼ, Wₖ. The most general form of function F can be defined on word and context vectors as below.

where w ∈ Rᵈ is a word vector and w˜ ∈ Rᵈ is a context vector. Right side of Eqⁿ (3) are probabilities which are obtained from the training corpus. Though F has vast number of possibilities of function, it should encode information present in Pᵢₖ/Pⱼₖ. The relationship between target words Wᵢ and Wⱼ can be obtained by vector differences as these vectors are from linear vector spaces of d dimensions. Hence equation becomes,

Now, right side of equation is scaler while input of F is d dimensional vectors. F can be parameterized by complex neural networks which will eventually break linear structure in vector space. To avoid it, we can take dot product of inputs of F which prevents mixing of vector dimension in other dimensions.

Word and context word can be interchanged with each other, as ‘solid’ occurring in context of ‘ice’ is equivalent to ‘ice’ occurring in context of ‘solid’. Hence w ↔ w˜ and X ↔ Xᵀ replacement can be done in Eqⁿ (5). Basically, we are changing inputs to F on the left hand side and then want to have similar impact on the right hand side of Eqⁿ (5). This is also referred to as keeping structural similarity between two groups (hom*omorphic function). We have group G on the left hand side of Eqⁿ (5) with subtraction as group operation and group H on the right side with division as group operation. Now to keep structural similarity when we replace w ↔ w˜ and X ↔ Xᵀ, subtraction of two vectors in G should reflect division of those two word vectors in H. Lets say X= wᵢᵀwₖ , Y= wⱼᵀwₖ ∈ G and Z= (X −Y). Function F should be hom*omorphic between the groups (R,−) and (R>0, ÷). By definition of hom*omorphism, if Z =X − Y with subtraction op in G, then in group H ⇒ F(Z) = F(X)÷ F(y) but Z= X −Y hence we get, F(X-Y)=F(x) ÷ F(y).

So what should be function F ? Well you guessed it right, it should be Exponential function. F (x)= e ˣ. Comparing Eqⁿ (5) and Eqⁿ (6), we get, F(wᵢᵀ*wₖ)=Pᵢₖ = Xᵢₖ / Xᵢ. Hence, taking logarithm on both sides we get,

Eqⁿ (7) exhibits symmetry in terms of word and context except the term log(Xᵢ). As this term is independent of Wₖ , we can consider this as bias term bᵢ . Adding one more bias term bₖ, we get

Since log(x) is divergent near 0, we can use additive shift on log(Xᵢₖ) making input to logarithm always ≥1,

Eqⁿ (9) is analogous to the factorization of the logarithm of co-occurrence matrix which is the main idea behind LSA algorithm. But there is a problem in above equation. If we define cost function using squared error function, it will give same weightage to all the terms in the matrix even for the rare frequency terms which have values nearly 0. Such terms are noisy and don’t carry much information. Hence weighted least squares can be used as a cost function for GloVe model.

Note, additive shift is already done on matrix X shown in Eqⁿ (10). J is a cost function which we want to minimize over co-occurrence matrix with vocabulary size V. f(Xᵢⱼ) gives weightage to each term in the matrix. wᵢ is a word vector and wⱼ is a context vector. bᵢ and bⱼ are bias terms for each of vectors and Xᵢⱼ is a term in co-occurrence matrix.

Now, question is how to choose function f. It should follow certain conditions. f(0)=0, f should be non-decreasing and f(x) should not be very high for large values of x as frequent words should not be over weighted. Xₘ from Eqⁿ (11) is same as Xₘₐₓ from the figure 1. After substituting Eqⁿ (11) in Eqⁿ (10) for f(Xᵢⱼ), we get final cost function for the GloVe model. Then model is trained on batches of training examples with optimizer to minimize cost function and hence generate word and context vectors for each word.

Finally, we have completed the full derivation of the GloVe model. Kudos to all of us for making until this point !!! Now if you want to find out the relationship of GloVe model to the skip-gram (word2vec model) and understand the overall complexity of the GloVe model, you can go through the next sections. After these sections, experiments and results of the GloVe model are discussed.

3. Equivalence between GloVe and Skip-Gram (word2vec model)

GloVe model is based upon the global co-occurrence matrix while skip-gram scans through the local windows and does not take into account of global statistics. These two approaches can be thought of two different schools of thoughts for training word vectors. This section finds out similarity between GloVe and skip-gram even if it seems at first both types of models have different interpretations.

Let’s define skip-gram model where Qᵢⱼ defines probability of word j comes in the context of word i. Qᵢⱼ is probability distribution over context words given Wⱼ and can be considered as softmax function.

Here, wᵢ and wⱼ are context and word vectors respectively of skip-gram model. The objective of skip-gram model is to maximize log probability of all the local window scans over training corpus. Training happens via stochastic/online through examples generated by local windows but global objective function can be formulated over all local scans from the corpus as

To make cost function +ve, negative sign is added as Qᵢⱼ values are between 0 and 1 and log of this range of values is -ve. The only difference between cost function above and original skip-gram is former is global and latter is local. While iterating through each local window, we can group together the same word-context pairs (Xᵢⱼ) and multiply them directly with log(Qᵢⱼ) term.

We know Pᵢⱼ=Xᵢⱼ/Xᵢ. Hence substituting Xᵢⱼ, we get

Xᵢ is independent on j, so it can be taken outside of the summation over j. H(Pᵢ,Qᵢ) is a cross entropy between probability distributions Pᵢ and Qᵢ. Note here negative sign is considered in cross entropy formula. The cost function becomes weighted sum of cross entropy error with weights as Xᵢ. The above objective function can be interpreted as global skip-gram’s objective. There are some limitations using cross entropy as error measure. P is a long tail distribution with higher weights given to unlikely/rare events while using cross entropy as error metric. Also, Q has to be normalized with summation over vocabulary V which is a huge bottleneck while using cross entropy. Hence different cost error metric can be used instead of cross entropy in Eqⁿ (15), one of which is least squared objective. The denominator term in Q can be neglected by just taking numerator term Q^. The new cost function is defined as,

Here, p^ᵢⱼ = Xᵢⱼ and Q^ᵢⱼ= exp(wᵢᵀ * wⱼ) are unnormalized distributions. Xᵢⱼ takes large values compared to Q^ᵢⱼ causing large gradients and large steps in optimization leading to unstable learning of the model. Hence, squared error of the logarithm is used.

Instead of Xᵢ, f(Xᵢⱼ) is used as weightage function as shown in figure 1. The terms in a squared bracket can be reversed without any sign change. The cost function in Eqⁿ (17) is equivalent as GloVe model’s cost function in Eqⁿ (10) which shows skip-gram model is ultimately based on the co-occurrence matrix of the corpus and has similarity with GloVe model.

4. The complexity of the GloVe model

Complexity of GloVe model can be found out by cost function from Eqⁿ (10). It depends upon total number of non zero elements in the co-occurrence matrix. The Summation runs over i and j with vocab size V. GloVe model’s complexity scales not beyond than O(|V|²). For corpus having few hundred of thousands words, |V|² goes beyond billions. Hence to get accurate complexity of the model, more tighter bound can be placed on total number of non-zero terms in the X. A term in the co-occurrence matrix Xᵢⱼ can be modeled as power-law function of the frequency rank (rᵢⱼ) of word j and context i.

The total number of words in corpus |C| is proportional to sum over all terms of the co-occurrence matrix X.

All the elements in the matrix, whether it’s word i or context j, its frequency rank can be obtained from the X by considering all terms(words, contexts )in the matrix together. |X| is maximum rank of the any word/context in the matrix which is same as number of non-zero elements in the matrix. H ₓ,α is a hormonic progression with ratio as α and number of elements as |X|. The maximum value of frequency rank r can be obtained by Eqⁿ (18), by setting Xᵢⱼ to its minimum value of 1. Hence we get, |X|=k ^(¹/α). Using this in Eqⁿ (19), we get

Expanding H term on right side using generalized harmonic numbers (Apostol, 1976) we get,

where ζ (α) is the Riemann zeta function. When |X| is large, only one of the two terms is relevant and it depends on whether α>1 or α<1. Hence we finally arrive at,

Authors of GloVe observed that Xᵢⱼ is well modelled by setting α=1.25. When α>1 we get |X|=O(|C|^0.8). Hence overall complexity of the model is much better than O(|V|²) and marginally better than original skip-gram model which scales with O(|C|).

5. Experiments

Trained GloVe word vectors are tested on below three NLP tasks.

i. Word Analogy

The task is to find word d which answers questions like “a is to b as c is to ?”. Dataset contains both syntactic and semantic questions. To find word d, Wᵈ closest to (Wᵇ− Wᵃ+ Wᶜ) according to the cosine similarity is predicted as output.

ii. Word Similarity

The task is to rank similar words to a given word in decreasing order of similarity. Spearman rank correlation is used to measure performance of the model.

iii. Named Entity Recognition (NER)

The task is to assign entity type for each token in the corpus. The CoNLL-2003 English benchmark dataset has four entity types as person, location, organization, and miscellaneous. Along with the token features, word vectors trained from GloVe model are added as input to the NER model to generate probability distribution over entity types.

6. GloVe Model Training Details

GloVe model is trained on five different corpora's: 2010 Wikipedia dump with 1 billion tokens, 2014 Wikipedia dump with 1.6 billion tokens, Gigaword 5 which has 4.3 billion tokens, the combination of Gigaword5 + Wikipedia2014 having 6 billion tokens, 42 billion tokens of web data from Common Crawl.

Preprocessing steps :- Corpus text is lowercased. Stanford tokenizer is used for tokenization. Co-occurrence matrix is constructed using vocabulary of top 400,000 frequent words. Before constructing matrix, context for a word has to be defined. Decreasing weighting window with weight 1/d for words d distance apart is employed to calculate matrix values. GloVe explores an effect of asymmetric context using left tokens of the word (history) as well as symmetric context using both left and right tokens of the word (history and future).

Below values are set in GloVe model training.

Xₘ= 100, α=3/4 for weightage function shown in Figure 1.

Optimizer- Adagrad (Duchi et al., 2011) optimizer with an initial learning rate of 0.05 is used on batch of randomly sampled non zero terms from X. For less than 300 dimensional vector, 50 iterations of training data are used while for vector dimensions more than 300, 100 iterations are used.

GloVe model generates two vectors for each token , W (word vector)and W~(context vector). The sum of two vectors (W+W~) is used as a final vector.

7. Results

i. Word Analogy

Results of word analogy experiment of various models are shown in table 2. SG model stands for skip-gram , CBOW stands for continuous bag of words model and SVD stands for singular value decomposition. Percentage accuracy is shown for semantic and syntactic and total in the table 2. GloVe model has outperformed both word2vec models. GloVe model trained on corpus size 1.6B and vector dimension 300 has achieved nearly 70% accuracy surpassing skip-gram’s 61% and CBOW’s 36%. Improvement in model accuracy is much significant in word2vec models with increase in vector dimension from 300 to 1000 and corpus size from 1B to 6B but still underperforming compared to GloVe model with vector size as 300 and corpus size 1.6B. This shows GloVe model word vectors even with 300 dimensions trained on smaller dataset has more meaningful information than 1000 dimensions of word2vec models trained on much larger dataset. GloVe model’s accuracy increased by small margin (75%) as compared to increase in corpus size (7 fold from 6B to 42B).

Accuracy of GloVe model with respect to symmetric and asymmetric context, window size and vector dimension is shown in figure 2. All the models are trained on 6B corpus. In (a), context is symmetric with window size as 10 across all variations of vector dimension. Syntactic, Semantic and hence overall accuracy increases steadily with increase in vector dimension from 50 to 300, after which increase in accuracy is negligible as compared to increase in vector dimension. In (b), vector dimension is kept as 100 across all variations of symmetric context window sizes. As symmetric window size increases, all the accuracies increase with steep increase in semantic as compared to syntactic which shows wider context is needed for understanding semantics while short context is enough for syntactic questions. In (c), vector size is kept as 100 across all the models of asymmetric windows. Similar behavior is noticed in (c) as symmetric context from (b). Semantic accuracy surpasses syntactic accuracy in symmetric window size of 3 while in asymmetric window it takes 5 windows.

In figure 3, GloVe model accuracy on different corpora’s is shown. Vector dimension of 300 is kept across all training corpus. Syntactic accuracy increases steadily with increase in corpus size while semantic accuracy doesn’t show such pattern. Hence there is no significant improvement in overall accuracy by increasing corpus size from 1B to 6B and 42B.

ii. Word Similarity- Various datasets (WordSim-353, MC, RG, SCEW and RW) are used for testing GloVe 300 dimensional word vectors along with other models. A similarity score is obtained by normalizing each vector dimension across all words in vocabulary and then cosine similarity is used on normalized vectors to find top n words similar to a given word. Spearman’s rank correlation coefficient is computed on ranks of top n words obtained from the similarity score and human judgement.

The results of various models tested on various datasets of word similarity task are shown in table 3. In WS353, MC, RG and RW datasets, GloVe model outperformed SVD, CBOW and SG which are trained on same corpus size. GloVe model trained on 42B corpus improves spearman rank correlation over CBOW model trained on 100B corpus in SCWS dataset.

iii. NER- Model performance of NER task on different datasets(validation and test set of CoNLL-2003, ACE and MUC7) is shown in Table 4.

All of above models are CRF based with different set of features. Discrete model uses features set from Stanford NER model while other models use basic set of features along with its trained word vectors as features. F1 score is used on both validation and test dataset of CoNLL-2003 along with test data of other datasets for comparison of model performance. GloVe model has outperformed all the other models (Discrete, SVD and both word2vec models)in NER task over various datasets.

8. Conclusion

GloVe model discusses about two classes of algorithms, count based and prediction based algorithms for learning distributional representation of the word. GloVe model shows these two classes of methods are not very different and both use co-occurrence matrix eventually as underlying concept to train word vectors. GloVe captures both global statistics and linear substructure present in the data. Hence GloVe which is a global log bilinear model outperforms both classes of models on variety of downstream NLP tasks.

If you want to learn more about how to implement GloVe model using python, let me know in comments.

9. Resources

[1] Original research paper- GloVe: Global Vectors for Word Representation: https://nlp.stanford.edu/pubs/glove.pdf
[2] Producing high-dimensional semantic spaces from lexical co-occurrence : https://link.springer.com/article/10.3758/BF03204766
[3] An Introduction to Latent Semantic Analysis research paper: https://mainline.brynmawr.edu/Courses/cs380/fall2006/intro_to_LSA.pdf
[4] Efficient Estimation of Word Representations in Vector Space: https://arxiv.org/pdf/1301.3781.pdf

Thanks for taking out time to read the post. I hope you have enjoyed the read. Please let me know about your thoughts in comments. Feel free to reach out to me on LinkedIn and Gmail.

FAQs

What is the concept of GloVe? ›

Gloves are pieces of clothing which cover your hands and wrists and have individual sections for each finger. You wear gloves to keep your hands warm or dry or to protect them. He stuck his gloves in his pocket.

How does the GloVe model work? ›

The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. Populating this matrix requires a single pass through the entire corpus to collect the statistics.

View Details ›

What is the GloVe format? ›

GloVe is a distributed word representation model that was developed by Global Vectors. The model is an algorithm for unsupervised learning that generates vector representations of words. This is accomplished by placing words into a meaningful space in which semantic similarity is correlated with word distance.

Show Me More ›

What are some facts about glove use? ›

Glove Wear Factors

Gloves must be cleaned after use and replaced periodically depending upon chemical permeability to the material handled. When gloves become torn or worn through by physical contact they should be replaced.

See Details ›

What is the purpose of gloving? ›

Gloving is an important sterile barrier for reducing the risk of contamination between the surgeon and the patient. Gloving is carried out after the surgeon has scrubbed and dried their hands. Pairs of surgical gloves are individually wrapped in a sterile packet (always check that the packet is correctly sealed).

Find Out More ›

What is the top GloVe strategy? ›

"Guided by our six point turnaround strategy, we will continue to navigate the challenging business environment by improving our sales revenue, enhancing quality, eliminating wastage, optimising resource allocation, trimming expenditures and streamlining our processes, towards greater financial efficiency and ...

View Details ›

What is the GloVe pyramid? ›

The Glove Pyramid – to aid decision making

on when to wear (and not wear) gloves. Gloves must be worn according to STANDARD and CONTACT PRECAUTIONS. The pyramid details some clinical examples in which gloves are not indicated, and others in which examination or sterile gloves are indicated.

Show Me More ›

Is GloVe better than BERT? ›

Results indicate that BERT outperforms all other methods with 90.88% accuracy, while fastText (86.91%), Skip-Gram (85.82%) and CBOW (86.15%) give comparable results, and GloVe (80.86%) cannot compete with its peers, featuring the lowest score.

Explore More ›

What is the one GloVe rule? ›

ONE GLOVE RULE

If you transport materials from labs through common areas, use an ungloved hand to touch common surfaces and a gloved hand to carry the items. Best lab safety practice is to package the material to allow handling the outer package without gloves and to contain the material if it were dropped.

View Details ›

What are 3 rules around gloves? ›

Wash and dry your hands before putting on gloves. Select the correct glove size. Hold gloves by the edge when putting them on. Avoid touching the glove as much as possible.

Learn More ›

What is a GloVe pretrained model? ›

Global Vectors for Word Representation, or GloVe, is an “unsupervised learning algorithm for obtaining vector representations for words.” Simply put, GloVe allows us to take a corpus of text, and intuitively transform each word in that corpus into a position in a high-dimensional space.

Get More Info Here ›

What is glove used for in NLP? ›

GloVe can be used to find relations between words like synonyms, company-product relations, zip codes and cities, etc. However, the unsupervised learning algorithm is not effective in identifying hom*ographs, i.e., words with the same spelling and different meanings.

Read On ›

How does glove embedding work? ›

The embedding that results determines whether words appear in similar contexts. The gloVe looks for word co-occurrences over the entire corpus. Its embeddings are based on the likelihood of two words appearing together.

See Details ›

What is Michael Jackson glove? ›

Michael Jackson's iconic single white glove was more than a fashion statement; it was a strategic accessory that played a crucial role in concealing the effects of the skin condition known as vitiligo.

Get More Info ›

Who invented gloves and why? ›

William Halstead introduced rubber gloves [in 1889-1890]: not to protect the patient from infection, but to protect the surgeon from the disinfectant. So they'd put their gloves on and instead of red raw dermatitis their hands would be alright.

Get More Info Here ›

What are three reasons to wear gloves? ›

Wear gloves protect against skin absorption of chemicals, chemical burns, thermal burns, lacerations, and cryogenic liquid exposure.

Discover More Details ›

Where did the term like a glove come from? ›

The phrase "fits like a glove" likely originated in the 16th century, when gloves were a common accessory for both men and women.

Get More Info Here ›