1. Introduction to HMMs
1.1 Markov Assumption
Hidden Markov Models (HMMs) are probabilistic models used for representing sequential data where the underlying process that generates the observations is hidden.
The Markov assumption (also called Limited Horizon) is fundamental to HMMs and states that the probability of the current state depends only on the previous state, not on the entire history of states.
Mathematically, this is expressed as:
Where:
- is the state at time t
- is the state at time t-1
The Markov assumption allows us to calculate the joint probability of a sequence of events as:
Importance of the Markov Assumption
The Markov assumption is crucial for HMMs because it:
- Makes calculations tractable by limiting the dependencies
- Reduces data sparseness when estimating parameters
- Allows for efficient algorithms like Viterbi
Example: Weather Prediction
Consider predicting tomorrow's weather based on historical observations:
- Without Markov assumption:
- With Markov assumption:
The simplification makes the model more manageable, requiring far less data to estimate parameters.
1.2 Hidden vs. Observable States
Markov Chains vs. Hidden Markov Models
Markov Chains:
- States are fully observable (you can directly see what state the system is in)
- Transitions between states are probabilistic
- Can be viewed as a probabilistic finite-state automaton
Showing the Markov Chain for weather prediction with two states (rainy and cloudy) and transition probabilities between them
Hidden Markov Models:
- States are hidden (not directly observable)
- We only observe emissions or observations that are probabilistically related to the hidden states
- No one-to-one mapping between observations and hidden states
- A particular observation can be emitted from multiple different hidden states
Showing the time-elapsed view of HMM with hidden states and observations
Formal Structure
In an HMM:
- Hidden states: The underlying states of the system that we cannot directly observe
- Observations: The visible outputs that depend probabilistically on the hidden states
- Transition probabilities: The probability of moving from one hidden state to another
- Emission probabilities: The probability of an observation being generated from a particular hidden state
Key Difference
The key difference is that in Markov Chains, we know exactly which state the system is in at any time, while in HMMs, we need to infer the most likely sequence of hidden states based on the observations.
1.3 Applications of HMMs
HMMs are widely used in scenarios where:
- We have sequential data
- The underlying process is hidden or not directly observable
- We want to infer the hidden process from observable outputs
Common Applications
Speech Recognition
- Observations: Audio signal (acoustic features)
- Hidden States: Phonemes or words
- Task: Determine the most likely sequence of words given the acoustic signal
Part-of-Speech Tagging
- Observations: Words in a sentence
- Hidden States: Part-of-speech tags (noun, verb, adjective, etc.)
- Task: Assign the correct grammatical category to each word
Machine Translation
- Observations: Target words (translated text)
- Hidden States: Source words (original text)
- Task: Find the most likely translation
Bioinformatics
- Observations: DNA/protein sequences
- Hidden States: Structural or functional elements
- Task: Identify functional regions in biological sequences
Example: The Fraudulent Croupier Scenario
A classic example used to explain HMMs involves a dishonest casino croupier:
- A croupier secretly switches between a fair dice and a loaded dice
- The fair dice has equal probability (1/6) for each number
- The loaded dice has a biased probability distribution
- An observer can only see the sequence of dice rolls (observations)
- The task is to determine when the croupier was using which dice (hidden states)
Showing the dice HMM with states, observations, and probabilities
In this example:
- Hidden States: The type of dice being used (fair or loaded)
- Observations: The numbers rolled (1-6)
- Transition Probabilities: How likely the croupier is to switch dice
- Emission Probabilities: The probability of rolling each number with each dice
This scenario demonstrates how HMMs can be used to infer hidden processes (which dice is being used) from observable data (the dice rolls).
2. Structure of HMMs
2.1 State Transition Probability Matrix
The state transition probability matrix (or simply transition matrix) is a fundamental component of an HMM, represented as matrix A. It defines the probability of moving from one state to another.
For a first-order HMM with N emitting states, plus special start and end states, the transition matrix A is of size :
Where:
- represents the probability of transitioning from state to state
- Formally:
Properties of Transition Probabilities
Row Summation: For each state , the sum of all outgoing transition probabilities must equal 1:
Non-negativity: All transition probabilities must be non-negative:
Markov Property: The probability depends only on the current state, not on how the model arrived at that state.
Example: Dice HMM
In the fraudulent croupier example with two dice (fair and loaded):
- = probability of using the fair dice again after using the fair dice
- = probability of switching from fair to loaded dice
- = probability of switching from loaded to fair dice
- = probability of using the loaded dice again after using the loaded dice
These probabilities can be estimated from training data:
2.2 Emission Probability Matrix
The emission probability matrix (also called observation likelihood matrix) is represented as matrix B. It defines the probability of observing a particular output given that the model is in a specific hidden state.
For an HMM with N emitting states and M possible observations, plus special start and end symbols, the emission matrix B is of size (M+2) × (N+2):
Where:
- represents the probability of emitting observation when in state
- Formally:
Output Independence Property
An important property of HMMs is output independence, which states that the probability of an output observation depends only on the current state:
This means that the current observation is conditionally independent of all other states and observations given the current state.
Estimation of Emission Probabilities
Emission probabilities can be estimated from training data:
Example: Dice HMM
In the fraudulent croupier scenario:
For the fair dice (state F):
For the loaded dice (state L):
- The probabilities would have a different distribution, reflecting the bias of the loaded dice.
2.3 Start and End States
HMMs typically include special start state () and special end state () to explicitly model the beginning and end of observation sequences.
Start State ()
- Not associated with "real" observations
- Transitions from to other states (represented by ) define the initial state distribution
- No transitions into the start state are defined (all are undefined)
End State ()
- Not associated with "real" observations
- Transitions from other states to (represented by ) define the probability of sequence termination
- No transitions out of the end state are defined (all are undefined)
Special Observations
Similarly, special symbols are defined:
- : special start symbol, emitted only by the start state
- : special end symbol, emitted only by the end state
Complete HMM Definition
With these components defined, a complete first-order HMM is formally specified by:
- A set of emitting states
- Special start state and end state
- An output alphabet of observations
- Special start symbol and end symbol
- State transition probability matrix
- Emission probability matrix
The HMM can be denoted as with these parameters.
Visual Representation
Showing the dice HMM with start state, end state, and transitions between states
In this diagram, and represent probabilities of starting with the loaded or fair dice, while and represent probabilities of ending the sequence when in each state.
3. HMM Problems and Algorithms
3.1 Problem 1: Labelled Learning (Parameter Estimation)
Labelled learning is the problem of estimating the parameters of an HMM (transition and emission probabilities) when we have access to both observation sequences and their corresponding hidden state sequences.
Input and Output
- Input: Parallel sequences of observations and states (dual tape)
- Output: HMM parameters (transition matrix A and emission matrix B)
This is the simplest of the HMM learning problems because we can directly count transitions and emissions from the labelled data.
Parameter Estimation Formulas
The transition probabilities are estimated using:
Where:
- is the number of times state is followed by state
- is the total number of times state appears
The emission probabilities are estimated using:
Where:
- is the number of times observation is emitted from state
- is the total number of times state appears
Example: Dice HMM Parameter Estimation
Given the following labelled sequence:
States: F F F F L L L F F F F L L L L F F F L L
Observations: 1 3 4 5 6 6 5 1 2 3 1 4 3 5 4 1 2 6 1 2
We can count:
- Transitions: F→F, F→L, L→L, L→F
- Emissions: (F,1), (F,2), (F,3), ..., (L,1), (L,2), ...
For example, to calculate (probability of transitioning from fair to loaded dice):
Similarly, to calculate (probability of rolling a 6 with the loaded dice):
Smoothing
In practice, add-one smoothing (or other smoothing techniques) is often applied to handle zero counts and avoid zero probabilities.
3.2 Problem 2: Unlabelled Learning
Unlabelled learning is the problem of estimating the HMM parameters when we only have access to observation sequences, without the corresponding hidden state sequences.
This is a more challenging problem since we don't know which states generated which observations.
Solution Approach
The standard algorithm for unlabelled learning is the Baum-Welch algorithm, which is a special case of the Expectation-Maximization (EM) algorithm:
- Expectation step: Compute the expected counts of transitions and emissions based on current parameter estimates
- Maximization step: Re-estimate the parameters to maximize the likelihood of the observations
The algorithm iterates between these steps until convergence, typically reaching a local optimum.
Challenges
- Requires initialization of parameters
- Converges to local optima, not necessarily global
- More computationally intensive than labelled learning
- May require multiple random restarts with different initializations
3.3 Problem 3: Likelihood Calculation
Likelihood calculation is the problem of determining the probability of an observation sequence given an HMM: .
This calculation is useful for:
- Comparing different HMMs
- Evaluating how well an HMM explains new data
- Component in the Baum-Welch algorithm
Forward Algorithm
The standard approach uses the Forward algorithm, which efficiently computes the likelihood using dynamic programming:
- Define forward variables as the probability of observing the partial sequence and ending in state
- Recursively calculate these variables
- Sum over all possible final states to get the total likelihood
The algorithm has complexity where is the number of states and is the sequence length.
3.4 Problem 4: Decoding (Viterbi Algorithm)
Decoding is the problem of finding the most likely sequence of hidden states that could have generated a given observation sequence.
Formally, we want to find:
Which expands to:
Dynamic Programming Solution
The search space of all possible state sequences is , making brute force search impractical. Instead, we use the Viterbi algorithm, which is a dynamic programming approach.
The Viterbi algorithm works because HMMs satisfy two key properties:
- Optimal substructure: The best path to a state contains within it optimal sub-paths
- Overlapping subproblems: Calculations for later states reuse calculations from earlier states
The Viterbi Algorithm
The algorithm uses a trellis data structure to memoize intermediate results. The trellis is a grid where:
- Rows represent states
- Columns represent time steps
Key Variables
- Viterbi Probability: The probability of the most likely path ending in state at time
- Backpointer: The previous state in the most likely path to state at time
Algorithm Steps
1. Initialization (t=0):
- Initialize the Viterbi probability for the start state:
- Initialize all other Viterbi probabilities to 0: for
2. Recursion (t=1 to T):
- For each state and time , compute:
3. Termination:
- Compute the probability of the most likely path ending in the final state:
4. Backtracing:
- Start at the end state:
- For down to 1, find the previous state:
The result is the most likely state sequence .
Example: Dice HMM Decoding
Suppose we observe the sequence [4, 3, 5] and want to determine whether the fair or loaded dice was used at each step.
The trellis calculation would proceed as follows:
- Initialization: Start at the start state
- For t=1 (observation=4):
- Calculate and based on transitions from start state
- For t=2 (observation=3):
- Calculate and based on values from t=1
- For t=3 (observation=5):
- Calculate and based on values from t=2
- Termination: Calculate transition to end state
- Backtracing: Follow the backpointers to determine the most likely sequence
The output would be the most likely sequence of dice used (F or L) that produced the observed numbers.
Complexity
The time complexity of the Viterbi algorithm is , where:
- is the number of states
- is the length of the observation sequence
This is much more efficient than the brute force approach of checking all possible state sequences.
4. Evaluation and Testing
4.1 Precision and Recall
When evaluating HMM performance, we often need more nuanced metrics than overall accuracy, especially when the classes are imbalanced or we're particularly interested in one type of state.
Precision and recall are evaluation metrics imported from information retrieval that provide a more detailed assessment of performance for specific classes of interest.
Why Not Just Use Accuracy?
Accuracy measures the overall correctness of predictions:
However, accuracy has limitations:
- It can be misleading when classes are imbalanced
- It doesn't distinguish between different types of errors
- It doesn't focus on the class of interest
Precision, Recall and F-measure
Given a confusion matrix:
System says: F | System says: L | Total | |
---|---|---|---|
Truth is: F | a | b | a+b |
Truth is: L | c | d | c+d |
Total | a+c | b+d | a+b+c+d |
The key metrics are:
Precision of L: The proportion of instances classified as L that are actually L
Recall of L: The proportion of actual L instances that are correctly classified as L
F-measure of L: The harmonic mean of precision and recall
Example: Evaluating Dice HMM
Consider a test set where our HMM tries to determine whether the fair (F) or loaded (L) dice was used:
Predicted: F | Predicted: L | Total | |
---|---|---|---|
Actual: F | 45 | 5 | 50 |
Actual: L | 10 | 40 | 50 |
Total | 55 | 45 | 100 |
Calculating metrics for the loaded dice (L):
- Precision: (88.9% of predictions of L are correct)
- Recall: (80% of actual L instances are detected)
- F-measure:
- Accuracy: (85% overall correct predictions)
4.2 Baseline Algorithms
A baseline algorithm is a simpler alternative to your proposed solution that serves as a comparison point. Baselines help quantify how much improvement your model provides over simpler approaches.
Importance of Baselines
Baselines serve several crucial purposes:
- They establish a minimum performance level that any useful algorithm should exceed
- They help contextualize how "good" your results actually are
- They demonstrate the value added by your more complex approach
Criteria for Effective Baselines
A good baseline should:
- Require considerably less effort than your proposed solution
- Be easily understandable to other researchers
- Have "fair" access to a reasonable amount of information
- Be strong enough to be meaningful but not so strong as to be nearly equivalent to your solution
Common Baseline Approaches
- Most Frequent Category (MFC): Always predict the most common class in the training data
- Random Prediction: Randomly assign states based on their observed distribution
- Simple Rules: Apply straightforward heuristics (e.g., always predict state X after observation Y)
- Existing Implementations: Use previously published algorithms as comparison points
Baseline for HMM Decoding
For HMM decoding, a common baseline is random guessing of hidden states:
- Only knows how often the HMM is in each state on average (the observed distribution of states)
- Does not use information about sequences or observations
- For each position, randomly selects a state according to the state's overall frequency
This baseline tests how much the sequential modeling in an HMM improves prediction compared to simply knowing state frequencies.
4.3 Statistical Significance Testing
When comparing two systems (e.g., your HMM vs. a baseline), observed performance differences might be due to chance rather than actual superiority of one approach. Statistical significance testing helps determine whether observed differences are likely to be meaningful.
Null Hypothesis Testing
The framework is based on testing the null hypothesis that two result sets come from the same distribution (i.e., the systems are equally good).
Steps in statistical testing:
- Choose a significance level (α), typically 0.01 or 0.05
- Calculate a test statistic and its corresponding p-value
- If p-value < α, reject the null hypothesis with confidence 1-α (95% or 99%)
The Sign Test
The sign test is a non-parametric, paired test that's particularly suitable for comparing classification systems:
For each test instance, determine which system performs better:
- Positive outcome: System 1 beats System 2
- Negative outcome: System 2 beats System 1
- Tie: Both systems perform equally
Under the null hypothesis (systems are equal), the probability of positive and negative outcomes should be equal (0.5)
The counts follow a binomial distribution B(N,q) where:
- N is the total number of non-tie comparisons
- q is the probability of a negative outcome (0.5 under the null hypothesis)
Binomial Distribution
The probability of observing exactly k negative events out of N is:
The probability of observing at most k negative events:
One-Tailed vs. Two-Tailed Tests
- One-tailed test: Tests if System 1 is better than System 2 (directional)
- Two-tailed test: Tests if the systems are different, regardless of direction (non-directional)
For a two-tailed test, the p-value is 2×P(X ≤ k) for the smaller count.
Example: Sign Test
Suppose we compare our HMM against a baseline on 100 test instances:
- HMM better: 65 instances
- Baseline better: 35 instances
- Ties: 0 instances
Using the binomial distribution with N=100, k=35, q=0.5:
- P(X ≤ 35|100) = 0.0018
Since 0.0018 < 0.01 (our chosen significance level), we can reject the null hypothesis and conclude that the HMM is significantly better than the baseline with 99% confidence.
Handling Ties
Ties (where both systems perform equally) are common in classification tasks. They can be handled by:
- Excluding ties from the analysis (reducing N)
- Adding 0.5 to both positive and negative counts for each tie
Reporting Significance
When reporting results, only claim statistical significance if your test indicates it:
- "The difference between System 1 and System 2 is statistically significant at α = 0.01"
- Raw accuracy differences without significance testing may be meaningless
5. Practical Applications
5.1 Fraudulent Croupier Example
The fraudulent croupier scenario is a classic example used to illustrate HMMs in action. This example demonstrates how HMMs can detect hidden states when only observations are visible.
Scenario Description
The scenario involves a dishonest casino croupier who secretly switches between dice:
- The croupier has multiple dice: a fair dice (F) and one or more loaded dice (L)
- The fair dice has equal probability (1/6) for each number
- The loaded dice have biased probability distributions
- When people bet on outcomes, the croupier secretly switches between dice
- An observer can only see the numbers rolled, not which dice is being used
- The goal is to detect when the croupier is cheating by using the loaded dice
HMM Formulation
This scenario can be modeled as an HMM where:
- Hidden states: The type of dice being used (F or L)
- Observations: The numbers rolled (1-6)
- Transition probabilities: How likely the croupier is to switch dice or continue using the same one
- Emission probabilities: The probability of rolling each number with each dice
Showing the dice HMM with states, observations, and transition probabilities
Variant with Three Dice
In a more complex version (as seen in tutorial 04), the croupier has three dice:
- A fair dice (F)
- Two different loaded dice (L1 and L2)
The HMM structure becomes more complex with additional states and transitions.
Parameter Estimation Example
Given training data with known dice types:
States: F F F F L L L F F F F L L L L F F F L L
Observations: 1 3 4 5 6 6 5 1 2 3 1 4 3 5 4 1 2 6 1 2
We can estimate the transition probabilities:
- (staying with fair dice):
- (switching from fair to loaded):
- (staying with loaded dice):
- (switching from loaded to fair):
And the emission probabilities:
- For fair dice:
- For loaded dice:
Decoding with Viterbi
Given only an observation sequence [4, 3, 5], we can use the Viterbi algorithm to determine the most likely sequence of dice used:
- Construct a trellis with states (F, L) as rows and time steps as columns
- Fill in the probability of reaching each state at each time step
- Backtrace to find the most likely state sequence
The result might be [F, F, L], indicating the first two rolls were likely from the fair dice and the third from the loaded dice.
Variants and Constraints
The tutorial materials (tut04.pdf) discuss several interesting variants of the croupier's behavior:
- Never switching directly from F to L2: This constraint can be represented by setting
- Always ending with fair dice: This constraint means (only fair dice can transition to end state)
- Never using loaded dice more than twice in a row: This requires a second-order HMM to properly model
- Always switching after rolling a 6: This relates emissions to transitions and cannot be directly modeled in a standard HMM
5.2 Part-of-Speech Tagging
Part-of-Speech (POS) tagging is the task of assigning grammatical categories (noun, verb, adjective, etc.) to words in a text. This is a classic application of HMMs in natural language processing.
The Problem of Ambiguity
Many words in English and other languages are ambiguous regarding their part of speech:
- "can" → verb or auxiliary verb or noun
- "fish" → noun or verb
- "watch" → noun or verb
For example, the sentence "We can fish" is ambiguous:
- "We can fish" = "We are able to fish" (auxiliary verb + verb)
- "We can fish" = "We put fish in cans" (verb + noun)
A human understands the correct interpretation based on context, and an HMM can use probabilistic transitions to model this context.
HMM Formulation for POS Tagging
In the POS tagging HMM:
- Hidden states: The POS tags (noun, verb, pronoun, etc.)
- Observations: The words in the text
- Transition probabilities: The probability of one POS tag following another
- Emission probabilities: The probability of a word being generated by a particular POS tag
Example from Lecture Notes
The lecture presents an example with the sentence "We can fish." and an HMM with four states:
- = verb
- = noun
- = personal pronoun
- = auxiliary verb
The transition probability matrix A is:
And the emission probability matrix B is:
Analyzing Two Interpretations
The notes examine two possible state sequences:
- (We = pronoun, can = auxiliary verb, fish = verb)
- (We = pronoun, can = verb, fish = noun)
Calculating the probabilities:
Since , the model predicts the first interpretation: "We (pronoun) can (auxiliary verb) fish (verb)" = "We are able to fish."
Viterbi for POS Tagging
In practice, we use the Viterbi algorithm to find the most likely sequence of POS tags for a given sentence:
- For each word and each possible POS tag, compute the probability of the best path ending with that tag
- Keep track of the previous tag in the best path
- Backtrace from the end to recover the most likely sequence of tags
Importance in NLP
POS tagging is a fundamental task in natural language processing that serves as:
- A preprocessing step for parsing and semantic analysis
- An essential component in information extraction
- A disambiguation technique for machine translation
- A foundation for many higher-level NLP tasks
The success of HMMs in POS tagging demonstrates their power in modeling sequential data with hidden structure, even in complex domains like human language.