Language model (LM) focuses on estimating the following conditional probability

for an m-gram LM. This can be easily estimated from maximum likelihood as

where means the count of m-gram (w1, w2, …, wm). This is a good estimate when there is enough number of count. However, it become a problem if the m-gram is never seen in the training data (corpus). In this case, the maximum likelihood estimate sets the probability to 0, which is often too strict to be true as language is quite versatile.

To overcome this issue, a straight forward idea is that, if an m-gram does not exists, why not backing off and use (m-1)-gram instead, and recursively so on and so forth until ending at unigram. This is the basic idea of Katz backoff, but there is one issue here. Take a tri-gram LM for example. Let’s say that the particular tri-gram (“I”, “am”, “I”) never exist in our training data, which indicates that the probability for “I” to follow “I am” is very small and agrees with our intuition. However, if we back off to bi-gram, the probability of “I” to follow “am” becomes quite larger than it should be. In general, as eventually the back-off ends at unigram, even though is indeed 0, backing off guarantees it to be greater than 0.

On the one hand, a naïve back off makes the total probability does not add up to 1. On the other hand, an m-gram does not exist means it is not likely, and you need to respect this information instead of just tossing it away.

Good-Turning provides a method to estimate the probability of something that is never seen in the training data: the probability of something never seen is estimated by the probability of something that appears only once in the training data. This makes sense as the probability of something that appears only once provides an upper bound for something that is never seen yet. And then Good-Turning adjust the probability of something that appears once, twice, three-time, … so that they add up to 1.

Imaging a thought experiment. There is a box of balls with different colors. In order to know the distribution of colors in the box, we randomly pick a ball from box and then put it back, and repeat this for N times. The color we observe is denoted as l1, l2, …, lk. The number of frequency we observe for each color is c(l1), c(l2), … c(lk). And c(l1) + c(l2) + … + c(lk) = N. However, there is a chance that we never pick any ball with a rare color. As introduced above, Good-Turning methods estimate the probability of a ball with an unseen color to be the number of balls whose color is observed only once. Adjusting probability for seen color is more involved. First you need to build a frequency of frequency from c(l1), c(l2), … c(lk). Denote this as n(r), which means that there is n(r) number of c(li) equals r. The smoothed probability for color li is

,

where the adjusted count is

Let’s walk through an example. Suppose the sequence of color we observe is

We can then count number of times each color observed as

Then get the count of count as

Finally, we can calculate the smoothed count as

Notice that the smoothed count for black color becomes 0. This is undesirable and in Katz method, smoothing is only done for items observed 1 to k times. For a frequent observed item (count > k), the probability is set to equal to the maximum likelihood estimate.