TBA
4.1 TheForwardAlgorithm .................................. 3
4.2 TheBackwardAlgorithm ................................. 4
4.3 TheForward-BackwardAlgorithm ............................ 5
Hidden Markov models (HMMs) are used as a modeling tool in a number of application areas, rangingfrom speech recognition toDNA sequencingtodata compressionto computer vision. HMMs are Markov processes—described by a Markov matrix (i.e, a transition probability matrix) and an initial distribution—together with an emission probability matrix, which reveals partial state information with each state transition. Complete state information is hidden.
A discrete-time hidden Markov model is a tuple H = �X,Y,Π,A,B�, where time is discrete: i.e., t ∈ T = {1,2,...}, and
N.B.: bijk =1 for all states xi,xj ∈ X s.t. aij > 0.
k
Example: Figure 1 depicts an example of anHMM.The set of states X = {1,2,3}. All transitions between states occur with uniform probability: i.e., a11 = a12 = a13 =1/3; a22 = a23 =1/2; a33 =1. The initial distribution is also uniform. The alphabet of observations Y = {H,T}, and at state i, bijH =1/i, for all j s.t. aij > 0.
Figure 1: Example: HMM with uniform transition and emission probabilities.
The trellis is a visual aid that helps us to envision the state space of an HMM as it evolves over time. The trellis is constructed by listing all the states of an HMM in columns and drawing arrows from each state xi in one column to all the states xj in the next column for which aij > 0: i.e., all the states xj which are reachable from state xi. A state sequence is a path through the trellis.
t=0 t=1 t=2 t=3 t=T
Figure2:TheTrellis. Inthisexample,itispossibletotransitionfromany statetoany otherstate: i.e., for all states xi,xj , the transition probability aij > 0.
Exercise: Draw the trellis corresponding to the HMM depicted in Figure 1.
Let x ≡ (x0:T )denote a state sequence; let y ≡ (y1:T )denote an observation sequence.
Evaluate Given an HMM λ and the sequence of observations y, compute the probability P[y]: i.e., the probability that the HMM generates the sequence of observations y.
Decode Given an HMM λ and the sequence of observations y, compute x s.t. P[x,y]is maximal: i.e., compute x s.t. P[y |x]P[x]is maximal: i.e., compute the sequence of states x most likely to generate the sequence of observations y.
Learn Given the sequence of observations y, estimate the parameters Π,A,B of an HMM λ s.t. P[y]is maximal: i.e., construct λ that is most likely to generate the sequence of observations y.
Given the sequence of observations y, theprobability P[y]canbecomputed asfollows:(i) compute theprobability P[x]ofpath x,(ii) computetheprobability P[y |x]of observation sequence y given path x,(iii) sum the product P[x]P[y |x]over all paths x. Formally,
P[y]
=
P[x,y]
=
P[x]P[y |x] (1)
xx
The RHS of Equation 1 simplifies by applying the Markov and CI assumptions. The probability P[x]of path x equals the probability of state x0, times the probability of transitioning from state
01
xto state x1, times the probability of transitioning from state xto state x2, and so on.
T
P[x]= π(x
0)
a(x
t−1
,x
t)
(2)
t=1
The probability P[y | x] of observation sequence y given path x is computed by multiplying the
10
probability of observing yupon transitioning from state xto state x1, times the probability of
21
observing yupon transitioning from state xto state x2, and so on.
T
P[y |x]
=
b(x
t−1 t
,x ,y t)
(3)
| t=1 | |
| Hence, | |
| T |
P[x,y]= π(x
0)
t−1 t−1
t)b(x
t
,y t)
a(x
(4)
,x ,x
t=1
and, moreover,
T
P[y]
=
π(x
0)
t−1 t−1
t)b(x
,x
t
,y t)
a(x
(5)
,x
xt=1
Direct evaluation of Equation 5 requires O(TNT )multiplications, where N is the number of states in the HMM, since there are NT paths through the trellis each of which requires T multiplications.
The forward algorithm isanefficient alternativeforcomputing thequantity P[y]basedondynamic programming. This algorithm computestheprobabilities ofpartial observation sequencesinterms of the probabilities of shorter observation sequences and visiting intermediate states. It is called the forward algorithm because it works from the start of an observation sequence to its end.
Let αt(j)denotethejointprobability attime t of the state xj and thepartial observation sequence
1:t
y: i.e., αt(j)= P[y1:t,Xt = xj ]. The probability P[y1:t]can be computed by marginalizing Xt: i.e., sumthejointprobabilities of observing this sequence and visiting state xj at time t, for all j:
1:t] 1:t,Xt
P[y = P[y = xj ] j
= αt(j) j
In particular, P[y]= j αT (j). The forward algorithm computes α values inductively as follows: for all states j,
α0(j)= π(xj )
αt+1(j) t+1),
= αt(i)aij bij (y for 0 ≤ t ≤ T −1 i
In particular, α0(j) is defined as the initial probability of state xj , and αt+1(j) is computed as
follows: (i) look up the value αt(i) for arbitrary state xi: i.e.,thejointprobability of observing 1:t)
sequence(yand visiting state xi at time t;(ii) look up the probability aij of transitioning from t+1
state xi to state xj ;(iii) look up theprobability of observing yupon transitioning from state xi t+1)
to state xj ; and(iv) sumtheproduct αt(i)aij bij (yfor all i.
Table 1: Forward Algorithm: What is the probability of “THT”?
Time αt(1) αt(2) αt(3) 0 1/3 1/3 1/3 1 0 0+1/12 0+1/12+2/9 2 0 0+1/48 0+1/48+11/108 3 0 0 + 1/192 0 + 1/192 + 106/1296
The complexity of the forward algorithm is O(TN2)(Why?).
1:t]
Sum: P[y1 14/36 = 0.3889 62/432 = 0.1435 478/5184 = 0.0922
The backward algorithm is ananotheralgorithmforcomputing thequantity P[y]. Liketheforward algorithm,itis adynamicprogramming algorithm. Butunlike theforward algorithm, it worksfrom the end of an observation sequence to its start, computing the probabilities of partial observation sequences in terms of the probabilities of shorter observation sequences, given intermediate states.
As above the probability P[yt+1:T ] can be computed by marginalizing Xt: i.e., sum the joint
P[y
t+1:T ]
=
P[y
t+1:T ,Xt
=
xi]
i
t+1:T
P[y |Xt = xi]P[Xt = xi]
=
i
βt(i)P[Xt = xi]
=
i
where βt(i) denotes the probability of the observation sequence yt+1:T , given state xi is visited
t+1:T
at time t: i.e., βt(i)= P[y| Xt = xi]. But since P[X0 = xi]= π(xi), it follows that
i
β0(i)π(xi).
P[y]=
The β values are computed via backward induction as follows: for all states i, βT (i) =1
βt−1(i)
=
aij bij (y t)βt(j), for T ≥ t ≥ 1
j
Inparticular, βT (i)is initialized to 1, and βt(i)is computed asfollows:(i) look up theprobability aij of transitioning from state xi to arbitrary state xj ; (ii) look up the probability of observing
t
yupon transitioning from state xi to state xj ; (iii) look up the value βt(j) for state xj : the probability of observing sequence yt+1:T , given state xj is visited at time t; and (iv) sum the product aij bij (yt)βt(i)for all i.
The complexity of the backward algorithm is O(TN2)(Why?).
Theforward-backward algorithm caches some combination offorward(α)andbackward(β)values.
t+1:T ]
P[y,Xt = xi]= P[y 1:t,Xt = xi,y t+1:T
= P[y 1:t,Xt = xi]P[y | y 1:t,Xt = xi] t+1:T
= P[y 1:t,Xt = xi]P[y | Xt = xi] = αt(i)βt(i)
i
P[y,Xt
αt(i)βt(i), for allThus, by marginalization, P[y]
xi]t.
=
=
=
i
Table 2: Backward Algorithm: What is the probability of “THT”?
Time 3 2 1 0
| βt(1) | βt(2) | βt(3) Average: P[yt+1:t] | |
|---|---|---|---|
| 1 | 1 | 1 | 1 |
| 0 + 0 + 0 | 1/4 + 1/4 | 2/3 | 7/18 = 0.3889 |
| 0 + 1/6 + 2/9 | 1/8 + 1/6 | 2/9 | 65/216 = 0.3009 |
| 0 + 0 + 0 | 7/96 + 1/18 | 4/27 | 239/2592 = 0.0922 |
Table 3: Forward-Backward Algorithm: What is the probability of “THT”?
Time 0 1 2 3 αt(1)βt(1) αt(2)βt(2) αt(3)βt(3) (1/3)(0) (1/3)(.1285) (1/3)(.1481) (0)(0.3889) (0.0833)(0.2917) (0.3056)(0.2222) (0)(0) (0.0208)(1/2) (0.1227)(2/3) (0)(1) (0.0052)(1) (.0867)(1)
Sum: P[y] 0.0922 0.0922 0.0922 0.0922
Recall the statement of the decoding problem: find a path x ∈ argmaxx P[x,y]. Equivalently, and perhaps more intuitively, find a path x ∈ argmaxx P[y|x]P[x].
The Viterbi algorithmusesdynamicprogramming tosolvethedecodingproblemby findinglonger and longer subpaths: for all t =0,...,T ,
0:t 1:t]
x 0:t ∈ arg max P[x ,y
0:t }
{x
0:t−1
Define σt(j) = max0:t−1 P[x,y1:t,Xt = xj ]. In words, σt(j) is the value of the best possible
x
pathto statejattime t,wherethe “best”possiblepathisthat which maximizesthejointprobability
0:t
of xand y1:t, terminating at time t at state j. The Viterbi algorithm computes these values inductively as follows:
σ0(j)= π(xj ) σt+1(j) t+1),
= max σt(i)aij bij (y for 0 ≤ t ≤ T −1
i
Simultaneously, the algorithm stores the states that comprise a most likely path:
τt+1(j) t+1),
= argmaxσt(i)aij bij (y for 0 ≤ t ≤ T −1
i T
In words, τt(j) is the state at time t − 1 on the best possible path to state j at time t. Let x
∗
T
denote the state j that maximizes σT (j); that is, xis the state j at time T that maximizes the
∗
0:T 1:TT
thejointprobability of xand y. We can compute the most likely path to x∗ , by backtracing through the τ values, as follows:
T
x ∈ argmaxσT (j)
∗
j t−1 t
x = τt(x∗), for T ≥ t ≥ 1
∗
Table 4: Viterbi Algorithm: What is the most likely path, given “THT”?
Time σt(1) τt(1) σt(2) τt(2) σt(3) τt(3) Backtrace 0 1/3 – 1/3 – 1/3 –3 101 max{0,1/12} 2 max{0,1/12,2/9} 33 201 max{0,1/48} 2 max{0,1/48,2/27} 33 301 max{0,1/192} 2 max{0,1/192,4/81} 33