Word2vec

Introduction

Original Word2vec Paper: Efficient Estimation of Word Representations in Vector Space
Overview: https://myndbook.com/view/4900

Ideas: The idea is to design a model whose parameters are the word vectors. Then, train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. Thus, we learn our word vectors.

Word2vec is a software package that actually includes :

  • 2 algorithms: continuous bag-of-words (CBOW) and skip-gram. CBOW aims to predict a center word from the surrounding context in terms of word vectors. Skip-gram does the opposite, and predicts the distribution (probability) of context words from a center word.
  • 2 training methods: negative sampling and hierarchical softmax. Negative sampling defines an objective by sampling negative examples, while hierarchical softmax defines an objective using an efficient tree structure to compute probabilities for all the vocabulary

A detailed introduction can be found on word2vec中的数学 and word2vec Parameter Learning Explained. Here only shows the derivation from the course under a simplified scenario.

Objective Function

For each position \(t=1,...,T\), predict context words within a window of fixed size m, given center word \(w_t\). Data likelihood: $$
\text{Likelihood}=L(\theta)=\prod_{t=1}^{T} \prod_{\substack{m \leq j \leq m \ j \neq 0}} P\left(w_{t+j} \mid w_{t} ; \theta\right) $$
The objective function \(J(\theta)\) is the (average) negative log likelihood: $$ J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{m \leq j \leq m \ j \neq 0}} \log P\left(w_{t+j} \mid w_{t} ; \theta\right) $$ $$ \text{Minimizing objective function} \Leftrightarrow \text{Maximizing predictive accuracy} $$


Two vectors are used for a word \(w\): 1. \(v_w\): when \(w\) is a center word. 2. \(u_w\): when \(w\) is a context word.
Then for a center word c and a context word o: $$ P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} $$


Take derivatives to work out the minimum: $$ \begin{aligned} \frac{\partial}{\partial v_c}\log{\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}} &= \frac{\partial}{\partial v_c}\log{\exp \left(u_{o}^{T} v_{c}\right)} - \frac{\partial}{\partial v_c}\log{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}\\ &= u_o - \sum_{x \in V}\frac{1}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}\exp \left(u_{x}^{T} v_{c}\right)u_x\\ &= u_o - \sum_{x \in V}p(x|c)u_x\\ &= \text{observed - expected} \end{aligned} $$ This is just the derivatives for the center vector parameters, and the derivatives for output vector parameters are also needed (they're similar). Then we have derivative w.r.t all parameters and can minimize