Hidden Markov Models


This page is under construction.

Introduction

The following presentation is adapted from [Rabiner & Juang, 1986] and [Charniak, 1993].

Notational conventions

T = length of the sequence of observations (training set)
N = number of states (we either know or guess this number)
M = number of possible observations (from the training set)
Omega_X = {q_1,...q_N} (finite set of possible states)
Omega_O = {v_1,...,v_M} (finite set of possible observations)
X_t random variable denoting the state at time t (state variable)
O_t random variable denoting the observation at time t (output variable)
sigma = o_1,...,o_T (sequence of actual observations)

Distributional parameters

A = {a_ij} s.t. a_ij = Pr(X_t+1 = q_j |X_t = q_i) (transition probabilities)
B = {b_i} s.t. b_i(k) = Pr(O_t = v_k | X_t = q_i t) (observation probabilities)
pi = {pi_i} s.t. pi_i = Pr(X_0 = q_i) (initial state distribution)

Definitions

A hidden Markov model (HMM) is a five-tuple (Omega_X,Omega_O,A,B,pi). Let lambda = {A,B,pi} denote the parameters for a given HMM with fixed Omega_X and Omega_O.

Problems

1. Find Pr(sigma|lambda): the probability of the observations given the model.
2. Find the most likely state trajectory given the model and observations.
3. Adjust lambda = {A,B,pi} to maximize Pr(sigma|lambda).

Motivation

A discrete-time, discrete-space dynamical system governed by a Markov chain emits a sequence of observable outputs: one output (observation) for each state in a trajectory of such states. From the observable sequence of outputs, infer the most likely dynamical system. The result is a model for the underlying process. Alternatively, given a sequence of outputs, infer the most likely sequence of states. We might also use the model to predict the next observation or more generally a continuation of the sequence of observations.

Hidden Markov models are used in speech recognition. Suppose that we have a set W of words and a separate training set for each word. Build an HMM for each word using the associated training set. Let lambda_w denote the HMM parameters associated with the word w. When presented with a sequence of observations sigma, choose the word with the most likely model, i.e.,

        w* = arg max_{w in W} Pr(sigma|lambda_w)

Forward-Backward Algorithm

Preliminaries

Define the alpha values as follows,
        alpha_t(i) = Pr(O_1=o_1,...,O_t=o_t, X_t = q_i | lambda)
Note that
        alpha_T(i) = Pr(O_1=o_1,...,O_T=o_T, X_T = q_i | lambda)
                   = Pr(sigma, X_T = q_i | lambda)
The alpha values enable us to solve Problem 1 since, marginalizing, we obtain
        Pr(sigma|lambda) = sum_i=1^N Pr(o_1,...,o_T, X_T = q_i | lambda)
                         = sum_i=1^N alpha_T(i)

Define the beta values as follows,

        beta_t(i) = Pr(O_t+1=o_t+1,...,O_T=o_T | X_t = q_i, lambda)
We will need the beta values later in the Baum-Welch algorithm.

Algorithmic Details

1. Compute the forward (alpha) values:
        a. alpha_1(i) = pi_i b_i(o_1)
        b. alpha_t+1(j) = [sum_i=1^N alpha_t(i) a_ij] b_j(o_t+1)
2. Computing the backward (beta) values:
        a. beta_T(i) = 1
        b. beta_t(i) = sum_j=1^N a_ij b_j(o_t+1) beta_t+1(j)

Viterbi Algorithm

Intuition

Compute the most likely trajectory starting with the empty output sequence; use this result to compute the most likely trajectory with an output sequence of length one; recurse until you have the most likely trajectory for the entire sequence of outputs.

Algorithmic Details

1. Initialization:
        For 1 <= i <= N,

        a. delta_1(i) = pi b_i(o_1)

        b. Phi_1(i) = 0
2. Recursion:
        For 2 <= t <= T, 1 <= j <= N,

        a. delta_t(j) = max_i [delta_t-1(i)a_ij]b_j(o_t)

        b. Phi_t(j) = argmax_i [delta_t-1(i)a_ij]
3. Termination:
        a. p* = max_i [delta_T(i)]

        b. i*_T = argmax_i [delta_T(i)]
4. Reconstruction:
        For t = t-1,t-2,...,1,

        i*_t = Phi_t+1(i*_t+1) 
The resulting trajectory, i*_1,...,i*_T, solves Problem 2.

Baum-Welch Algorithm

Intuition

To solve Problem 3 we need a method of adjusting the lambda parameters to maximize the likelihood of the training set.

Suppose that the outputs (observations) are in a 1-1 correspondence with the states so that N = M, varphi(q_i) = v_i and b_i(j) = 1 for j = i and 0 for j != i. Now the Markov process is not hidden at all and the HMM is just a Markov chain. To estimate the lambda parameters for this Markov chain it is enough just to calculate the appropriate frequencies from the observed sequence of outputs. These frequencies constitute sufficient statistics for the underlying distributions.

In the more general case, we can't observe the states directly so we can't calculate the required frequencies. In the hidden case, we use expectation maximization (EM) as described in [Dempster et al., 1977].

Instead of calculating the required frequencies directly from the observed outputs, we iteratively estimated the parameters. We start by choosing arbitrary values for the parameters (just make sure that the values satisfy the requirements for probability distributions).

We then compute the expected frequencies given the model and the observations. The expected frequencies are obtained by weighting the observed transitions by the probabilities specified in the current model. The expected frequencies so obtained are then substituted for the old parameters and we iterate until there is no improvement. On each iteration we improve the probability of O being observed from the model until some limiting probability is reached. This iterative procedure is guaranteed to converge on a local maximum of the cross entropy (Kullback-Leibler) performance measure.

Preliminaries

The probability of a trajectory being in state q_i at time t and making the transition to q_j at t+1 given the observation sequence and model.
        xi_t(i,j) = Pr(X_t = q_i, X_t+1 = q_j | sigma, lambda)
We compute these probabilities using the forward backward variables.
                       alpha_t(i) a_ij(o_t+1) beta_t+1(j)
        xi_t(i,j) =  -------------------------------------
                                  Pr(O | lambda) 
The probability of being in q_i at t given the observation sequence and model.
        gamma_t(i) = Pr(X_t = q_i | sigma, lambda)
Which we obtain by marginalization.
        gamma_t(i) = sum_j xi_t(i,j)
Note that
        sum_t=1^T gamma_t(i) = expected number of transitions from q_i
and
        sum_t=1^T xi_t(i,j) = expected number of transitions from q_i to q_j

Algorithmic Details

1. Choose the initial parameters, lambda = {A, B, pi}, arbitrarily.
2. Reestimate the parameters.
        a. bar{pi}_i = gamma_t(i) 

                        sum_t=1^T-1 xi_t(i,j) 
        b. bar{a}_ij = ------------------------
                        sum_t=1^T-1 gamma_t(i) 

                          sum_t=1^T-1 gamma_t(j) 1_{o_t = k}
        c. bar{b}_j(k) = ------------------------------------
                               sum_t=1^T-1 gamma_t(j) 
where 1_{o_t = k} = 1 if o_t = k and 0 otherwise.
3. Let bar{A} = {bar{a}_ij}, bar{B} = {bar{b}_i(k)}, and bar{pi} = {{bar{pi}_i}.
4. Set bar{lambda} to be {bar{A}, bar{B}, bar{pi}}.
5. If lambda = bar{lambda} then quit, else set lambda to be bar{lambda} and return to Step 2.

Bayesian Network Algorithms

The Bayesian network representation is shown in Figure 1.
        X_0     X_1     X_2     X_3             X_T-1   X_T
        o ----> o ----> o ----> o       ...     o ----> o 
        |       |       |       |               |       |  
        |       |       |       |               |       |  
        v       v       v       v               v       v  
        o       o       o       o       ...     o       o  
        O_0     O_1     O_2     O_3             O_T-1   O_T

        Fig. 1: Bayesian network representation for an HMM 
In the description of the Baum-Welch algorithm provided above, the computation of the expected sufficient statistics depends on computing the following term for all i and j in Omega_X.
        xi_t(i,j) = Pr( X_t=q_i, X_t+1=q_j | sigma, lambda)
These computations in turn rely on computing the forward and backward variables (the alpha's and beta's).
                       alpha_t(i) a_ij(o_t+1) beta_t+1(j)
        xi_t(i,j) =  -------------------------------------
                              Pr(sigma | lambda)
Generally, the forward and backward variables are computed using the forward-backward procedure which uses dynamic programming to compute the variables in time polynomial in |Omega_X|, |Omega_O|, and T. In the following paragraphs, we show how the xi's can be computed using standard Bayesian network inference algorithms in the same big-Oh complexity. One advantage of this approach is that it extends easily to the case in which the hidden part of the model is factored into some number of state variables.

In the network shown in Figure 1 the O_t's are known. In particular, we have that O_i=o_i. If we assign the O_t's accordingly, use the probabilities indicated by lambda, and apply a standard Bayesian network inference algorithm, we obtain for every X_t the posterior distribution Pr(X_t|sigma,lambda). This isn't exactly what we need since X_t and X_t+1 are clearly not independent. If they were independent, then we could obtain Pr(X_t,X_t+1|sigma,lambda) from the product of Pr(X_t|sigma,lambda) and Pr(X_t+1|sigma,lambda). There are a number of remedies but one approach which is graphically intuitive involves adding a new state variable (X_t,X_t+1) which is the obvious deterministic function X_t and X_t+1. This addition results in the network shown in Figure 2.

             X_0,X_1 X_1,X_2 X_2,X_3    ...          X_T-1,X_T    
                o       o       o               o       o       
              / ^     / ^     / ^               ^     / ^       
            /   |   /   |   /   |               |   /   |       
          /     | /     | /     |               | /     |       
        o ----> o ----> o ----> o       ...     o ----> o       
        |       |       |       |               |       |       
        |       |       |       |               |       |       
        v       v       v       v               v       v       
        o       o       o       o       ...     o       o  
        O_0     O_1     O_2     O_3             O_T-1   O_T     

        Fig. 2: Bayesian network with joint variables, (X_t,X_t+1)
If we update the network in Figure 2 with O_t=o_t then we obtain Pr((X_t,X_t+1)|sigma,lambda) directly. It should be clear that we can eliminate the X_t (except for X_0) to obtain the singly connected network shown in Figure 3 which can be updated in time polynomial in |Omega_X|, |Omega_O|, and T.
       X_0   X_0,X_1 X_1,X_2 X_2,X_3    ...          X_T-1,X_T    
        o ----> o ----> o ----> o               o ----> o       
        |       |       |       |               |       |       
        |       |       |       |               |       |       
        v       v       v       v               v       v       
        o       o       o       o       ...     o       o  
        O_0     O_1     O_2     O_3             O_T-1   O_T     

        Fig. 3: Bayesian network with X_t's eliminated
The extension to HMMs with factored state spaces (e.g., see Figure 4) is graphically straightforward. The computational picture is more complicated and depends on the specifics of the update algorithm. It is important to point out, however, that there is a wide range of update algorithms, both approximate and exact, to choose from.
   X_1  o ----> o ----> o ----> o       ...     o ----> o 
          \       \       \                       \     
            \       \       \                       \   
              \       \       \                       \ 
   X_2  o ----> o ----> o ----> o       ...     o ----> o 
        |       |       |       |               |       |  
        |       |       |       |               |       |  
        v       v       v       v               v       v  
    O   o       o       o       o       ...     o       o  

        Fig. 4: Bayesian network representation for an HMM with
        factored state space Omega_X = Omega_X_1 times Omega_X_2.
	The state variable is two dimensional X_t = X_{1,t},X_{2,t}.

References

See [Rabiner & Juang, 1986] and [Rabiner, 1989] for a general introduction and applications in speech. See [Charniak, 1993] for applications in natural language processing including part of speech tagging. Charniak [1993] provides lots of examples that provide useful insight. See [Rabiner, 1989] and [Fraser & Dimitriadis, 1994] for details regarding numerical issues that arise in implementing the above algorithm. Rabiner and Juang [1986] also discuss variant algorithms for continuous observation spaces using multivariate Gaussian models.
Back to Tutorial