The Viterbi Algorithm

And its Application in the Speech Recognition Programming Part I

 

Notations:

 

Filtering (finding the most likely final state):

Given the observations from time 1 to time t, determine the most likely state at time t.

In other words, determine a state s that maximizes Pr(Xt=s | E1=e1,E2=e2,…Et=et) over all possible states. More precisely in mathematics:

s = Max x  Pr(Xt=x | E1=e1,E2=e2,…Et=et)

 

Note that

Pr(Xt=x | E1=e1,E2=e2,…Et=et)

=Pr(E1=e1,E2=e2,…Et=et, Xt=x ) | Pr(E1=e1,E2=e2,…Et=et) and

Pr(E1=e1,E2=e2,…Et=et) is simply a constant that has nothing to do with a state x.

 

Therefore

s = Max x  Pr(Xt=x | E1=e1,E2=e2,…Et=et)

   = Max x  Pr(E1=e1,E2=e2,…Et=et, Xt=x ).

 

In other words, for each state x we need to calculate Pr(E1=e1,E2=e2,…Et=et, Xt=x ), the probability of seeing the evidences e1,e2,…,et at time 1,2,…,t respectively and ending in state x. Then we can find among them a state s with a maximal Pr(E1=e1,E2=e2,…Et=et, Xt=s ) value.

 

So our task is:

 

Using Dynamic Programming for filtering:

Assuming the first order Markov property and the stationary sensor observation property, the problem of calculating Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), Et=et, Xt=x ) can be recursively decomposed in the following way:

 

Pr(E1=e1,E2=e2,…, E(t-1)=e(t-1), Et=et, Xt=x )

= S y Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), Et=et, X(t-1)=y, Xt=x )

= S y Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y, Et=et, Xt=x )

// Using conditional probability, we have

= S y    (            Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y, Xt=x) *

Pr(Et=et | E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y, Xt=x)

            )

// Because of the stationary sensor observation property, we then have

= S y    (            Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y, Xt=x) *

Pr(Et=et | Xt=x)            

            )

// Using conditional probability, we have

= S y    (            Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y) *

Pr(Xt=x  | E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y)

Pr(Et=et | Xt=x)

            )

// Because of the first order Markov property, we then have

= S y    (            Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y) *

                        Pr(Xt=x | X(t-1)=y ) *

                        Pr(Et=et | Xt=x)

            )

 

At the t-th stage problem:

Let’s refer to the problem of calculating Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), Et=et, Xt=x ) for all possible state x as the t-th stage problem.

 

Solving the t-th stage problem by solving the (t-1)-th stage problem first:

Since we have

Pr(E1=e1,E2=e2,…, E(t-1)=e(t-1), Et=et, Xt=x )

= S y    (            Pr(E1=e1,E2=e2,… E(t-1)=e(t-1), X(t-1)=y) *

                        Pr(Xt=x | X(t-1)=y ) *

                        Pr(Et=et | Xt=x)

            ),

we can solve the t-th stage problem by solving the t-1-th stage problem first and then using the formula above to solve the t-th stage problem.

 

Base case at the first stage: Similarly, the (t-1)th stage problem can solved by solving the (t-2)th stage problem first, and so forth. The base case is simply the first  stage problem of calculating  Pr(E1=e1, X1=x ) for every state x, which is solved directly by

Pr(E1=e1, X1=x ) = Pr(X1=x ) * Pr(E1=e1 | X1=x )

 

Application in Speech Recognition Assignment#1

In assignment #1, you need to determine the probability of seeing a specific series characters given a specific underlying word. The observations are the individual characters you see. For each state x in the word model, we want to determine Pr(E1=e1,E2=e2,…, E(t-1)=e(t-1), Et=et, Xt=x ) and then calculate

S x       (            Pr(E1=e1,E2=e2,… Et=et, Xt=x) *

                        Pr(The Special Final Sate | Xt=x)

            ),

 

That is the likelihood of seeing a specific series characters given a specific underlying word and ends in the final state when finish typing that word.