The Viterbi Algorithm
And its Application in the
Speech Recognition Programming Part I
Notations:
Filtering
(finding the most likely final state):
Given the observations from time 1 to time t, determine the most likely state at time t.
In other words, determine a state s that maximizes Pr(Xt=s | E1=e1,E2=e2, Et=et) over all possible states. More precisely in mathematics:
s = Max x Pr(Xt=x | E1=e1,E2=e2, Et=et)
Note that
Pr(Xt=x | E1=e1,E2=e2, Et=et)
=Pr(E1=e1,E2=e2, Et=et, Xt=x ) | Pr(E1=e1,E2=e2, Et=et) and
Pr(E1=e1,E2=e2, Et=et) is simply a constant that has nothing to do with a state x.
Therefore
s = Max x Pr(Xt=x | E1=e1,E2=e2, Et=et)
= Max x Pr(E1=e1,E2=e2, Et=et, Xt=x ).
In other words, for each state x we need to calculate Pr(E1=e1,E2=e2, Et=et, Xt=x ), the probability of seeing the evidences e1,e2, ,et at time 1,2, ,t respectively and ending in state x. Then we can find among them a state s with a maximal Pr(E1=e1,E2=e2, Et=et, Xt=s ) value.
So our task is:
Using
Dynamic Programming for filtering:
Assuming the first order Markov property and the stationary sensor observation property, the problem of calculating Pr(E1=e1,E2=e2, E(t-1)=e(t-1), Et=et, Xt=x ) can be recursively decomposed in the following way:
Pr(E1=e1,E2=e2,
, E(t-1)=e(t-1),
Et=et, Xt=x )
= S y Pr(E1=e1,E2=e2,
E(t-1)=e(t-1), Et=et,
X(t-1)=y, Xt=x )
= S y Pr(E1=e1,E2=e2,
E(t-1)=e(t-1), X(t-1)=y,
Et=et, Xt=x )
// Using conditional
probability, we have
= S y ( Pr(E1=e1,E2=e2,
E(t-1)=e(t-1),
X(t-1)=y, Xt=x) *
Pr(Et=et |
E1=e1,E2=e2,
E(t-1)=e(t-1), X(t-1)=y, Xt=x)
)
// Because of the stationary
sensor observation property, we then have
= S y ( Pr(E1=e1,E2=e2,
E(t-1)=e(t-1),
X(t-1)=y, Xt=x) *
Pr(Et=et | Xt=x)
)
// Using conditional
probability, we have
= S y ( Pr(E1=e1,E2=e2,
E(t-1)=e(t-1), X(t-1)=y) *
Pr(Xt=x | E1=e1,E2=e2,
E(t-1)=e(t-1), X(t-1)=y)
Pr(Et=et | Xt=x)
)
// Because of the first order
Markov property, we then have
= S y ( Pr(E1=e1,E2=e2,
E(t-1)=e(t-1), X(t-1)=y) *
Pr(Xt=x | X(t-1)=y ) *
Pr(Et=et | Xt=x)
)
At the t-th stage problem:
Lets refer to the problem of calculating Pr(E1=e1,E2=e2, E(t-1)=e(t-1), Et=et, Xt=x ) for all possible state x as the t-th stage problem.
Solving the t-th stage problem by solving the (t-1)-th
stage problem first:
Since we have
Pr(E1=e1,E2=e2,
, E(t-1)=e(t-1),
Et=et, Xt=x )
= S y ( Pr(E1=e1,E2=e2,
E(t-1)=e(t-1), X(t-1)=y) *
Pr(Xt=x | X(t-1)=y ) *
Pr(Et=et | Xt=x)
),
we can solve the t-th stage problem by solving the t-1-th stage problem first and then using the formula above to solve the t-th stage problem.
Base case at the first stage: Similarly, the (t-1)th stage problem can solved by solving the (t-2)th stage problem first, and so forth. The base case is simply the first stage problem of calculating Pr(E1=e1, X1=x ) for every state x, which is solved directly by
Pr(E1=e1, X1=x ) = Pr(X1=x ) *
Pr(E1=e1 | X1=x )
In assignment #1, you need to determine the probability of
seeing a specific series characters given a specific underlying word. The
observations are the individual characters you see. For each state x in the
word model, we want to determine Pr(E1=e1,E2=e2,
,
E(t-1)=e(t-1), Et=et, Xt=x ) and then
calculate
S x ( Pr(E1=e1,E2=e2,
Et=et, Xt=x) *
Pr(The Special Final Sate | Xt=x)
),
That is the likelihood of seeing a specific series characters given a specific underlying word and ends in the final state when finish typing that word.