CS224n

Language Models

1. Language Modeling

The probability of a sequence of $\mathbb{m}$ words ${w_1, ..., w_m}$ is denoted as $P(w_1, ..., w_m)$, $P(w_1, ..., w_m)$ is usually conditioned on a window of $\mathbb{n}$ previous words rather than all previous words: $$ P\left(w_{1}, \ldots, w_{m}\right)=\prod_{i=1}^{i=m} P\left(w_{i} \mid w_{1}, \ldots, w_{i-1}\right) \approx \prod_{i=1}^{i=m} P\left(w_{i} \mid w_{i-n}, \ldots, w_{i-1}\right) $$

2. n-gram Language Models

Idea: Collect statistics about how frequent different n-grams are and use these to predict next word.

n-gramLM

2.1 Sparsity Problems

sparsity

2.2 Storage Problems

storage

3. A Fixed-window Neural LM

Fixed-windowLM

Improvements over n-gram LM	Remaining Problems
No sparsity problem	Fixed window is too small
Don't need to store all observed n-grams	No symmetry in how the inputs are processed.