1. Introduction to HMMs

1.1 Markov Assumption

Hidden Markov Models (HMMs) are probabilistic models used for representing sequential data where the underlying process that generates the observations is hidden.

The Markov assumption (also called Limited Horizon) is fundamental to HMMs and states that the probability of the current state depends only on the previous state, not on the entire history of states.

Mathematically, this is expressed as:

P (X_{t} ∣ X_{1}, X_{2}, \dots, X_{t - 1}) \approx P (X_{t} ∣ X_{t - 1})

Where:

$X_{t}$ is the state at time t
$X_{t - 1}$ is the state at time t-1

The Markov assumption allows us to calculate the joint probability of a sequence of events as:

P (w_{1}, w_{2}, \dots, w_{t}) = t = 1 \prod n P (w_{t} ∣ w_{t - 1})

Importance of the Markov Assumption

The Markov assumption is crucial for HMMs because it:

Makes calculations tractable by limiting the dependencies
Reduces data sparseness when estimating parameters
Allows for efficient algorithms like Viterbi

Example: Weather Prediction

Consider predicting tomorrow's weather based on historical observations:

Without Markov assumption: $P (w_{t} = R ain y ∣ w_{t - 1} = R ain y, w_{t - 2} = Cl o u d y, w_{t - 3} = Cl o u d y, w_{t - 4} = R ain y)$
With Markov assumption: $P (w_{t} = R ain y ∣ w_{t - 1} = R ain y)$

The simplification makes the model more manageable, requiring far less data to estimate parameters.

1.2 Hidden vs. Observable States

Markov Chains vs. Hidden Markov Models

Markov Chains:

States are fully observable (you can directly see what state the system is in)
Transitions between states are probabilistic
Can be viewed as a probabilistic finite-state automaton

Showing the Markov Chain for weather prediction with two states (rainy and cloudy) and transition probabilities between them

Hidden Markov Models:

States are hidden (not directly observable)
We only observe emissions or observations that are probabilistically related to the hidden states
No one-to-one mapping between observations and hidden states
A particular observation can be emitted from multiple different hidden states

Showing the time-elapsed view of HMM with hidden states and observations

Formal Structure

In an HMM:

Hidden states: The underlying states of the system that we cannot directly observe
Observations: The visible outputs that depend probabilistically on the hidden states
Transition probabilities: The probability of moving from one hidden state to another
Emission probabilities: The probability of an observation being generated from a particular hidden state

Key Difference

The key difference is that in Markov Chains, we know exactly which state the system is in at any time, while in HMMs, we need to infer the most likely sequence of hidden states based on the observations.

1.3 Applications of HMMs

HMMs are widely used in scenarios where:

We have sequential data
The underlying process is hidden or not directly observable
We want to infer the hidden process from observable outputs

Common Applications

Speech Recognition
- Observations: Audio signal (acoustic features)
- Hidden States: Phonemes or words
- Task: Determine the most likely sequence of words given the acoustic signal
Part-of-Speech Tagging
- Observations: Words in a sentence
- Hidden States: Part-of-speech tags (noun, verb, adjective, etc.)
- Task: Assign the correct grammatical category to each word
Machine Translation
- Observations: Target words (translated text)
- Hidden States: Source words (original text)
- Task: Find the most likely translation
Bioinformatics
- Observations: DNA/protein sequences
- Hidden States: Structural or functional elements
- Task: Identify functional regions in biological sequences

Example: The Fraudulent Croupier Scenario

A classic example used to explain HMMs involves a dishonest casino croupier:

A croupier secretly switches between a fair dice and a loaded dice
The fair dice has equal probability (1/6) for each number
The loaded dice has a biased probability distribution
An observer can only see the sequence of dice rolls (observations)
The task is to determine when the croupier was using which dice (hidden states)

Showing the dice HMM with states, observations, and probabilities

In this example:

Hidden States: The type of dice being used (fair or loaded)
Observations: The numbers rolled (1-6)
Transition Probabilities: How likely the croupier is to switch dice
Emission Probabilities: The probability of rolling each number with each dice

This scenario demonstrates how HMMs can be used to infer hidden processes (which dice is being used) from observable data (the dice rolls).

2. Structure of HMMs

2.1 State Transition Probability Matrix

The state transition probability matrix (or simply transition matrix) is a fundamental component of an HMM, represented as matrix A. It defines the probability of moving from one state to another.

For a first-order HMM with N emitting states, plus special start and end states, the transition matrix A is of size $(N + 2) \times (N + 2)$ :

A = - - - - - - a_{01} a_{11} a_{21} ⋮ a_{N 1} - a_{02} a_{12} a_{22} ⋮ a_{N 2} - a_{03} a_{13} a_{23} ⋮ a_{N 3} - \dots \dots \dots ⋱ \dots - a_{0 N} a_{1 N} a_{2 N} ⋮ a_{NN} - - a_{1 f} a_{2 f} ⋮ a_{N f} -

Where:

$a_{ij}$ represents the probability of transitioning from state $s_{i}$ to state $s_{j}$
Formally: $a_{ij} = P (X_{t} = s_{j} ∣ X_{t - 1} = s_{i})$

Properties of Transition Probabilities

Row Summation: For each state $i$ , the sum of all outgoing transition probabilities must equal 1:
$j = 0 \sum N + 1 a_{ij} = 1$
Non-negativity: All transition probabilities must be non-negative:
$a_{ij} \geq 0 for all i, j$
Markov Property: The probability depends only on the current state, not on how the model arrived at that state.

Example: Dice HMM

In the fraudulent croupier example with two dice (fair and loaded):

$a_{FF}$ = probability of using the fair dice again after using the fair dice
$a_{F L}$ = probability of switching from fair to loaded dice
$a_{L F}$ = probability of switching from loaded to fair dice
$a_{LL}$ = probability of using the loaded dice again after using the loaded dice

These probabilities can be estimated from training data:

a_{ij} \approx \frac{count _{trans} ( X _{t} = s _{i} , X _{t + 1} = s _{j} )}{count _{trans} ( X _{t} = s _{i} )}

2.2 Emission Probability Matrix

The emission probability matrix (also called observation likelihood matrix) is represented as matrix B. It defines the probability of observing a particular output given that the model is in a specific hidden state.

For an HMM with N emitting states and M possible observations, plus special start and end symbols, the emission matrix B is of size (M+2) × (N+2):

B = b_{0} (k_{0}) - - - - - - b_{1} (k_{1}) b_{1} (k_{2}) ⋮ b_{1} (k_{M}) - - b_{2} (k_{1}) b_{2} (k_{2}) ⋮ b_{2} (k_{M}) - - b_{3} (k_{1}) b_{3} (k_{2}) ⋮ b_{3} (k_{M}) - \dots \dots \dots ⋱ \dots \dots - b_{N} (k_{1}) b_{N} (k_{2}) ⋮ b_{N} (k_{M}) - - - - - - b_{f} (k_{f})

Where:

$b_{i} (k_{j})$ represents the probability of emitting observation $k_{j}$ when in state $s_{i}$
Formally: $b_{i} (k_{j}) = P (O_{t} = k_{j} ∣ X_{t} = s_{i})$

Output Independence Property

An important property of HMMs is output independence, which states that the probability of an output observation depends only on the current state:

P (O_{t} ∣ X_{1} ... X_{t}, ..., X_{T}, O_{1}, ..., O_{t}, ..., O_{T}) \approx P (O_{t} ∣ X_{t})

This means that the current observation is conditionally independent of all other states and observations given the current state.

Estimation of Emission Probabilities

Emission probabilities can be estimated from training data:

b_{i} (k_{j}) \approx \frac{count _{emission} ( O _{t} = k _{j} , X _{t} = s _{i} )}{count _{emission} ( X _{t} = s _{i} )}

Example: Dice HMM

In the fraudulent croupier scenario:

For the fair dice (state F):
- $b_{F} (1) = b_{F} (2) = b_{F} (3) = b_{F} (4) = b_{F} (5) = b_{F} (6) = \frac{1}{6}$
For the loaded dice (state L):
- The probabilities $b_{L} (1), b_{L} (2), \dots, b_{L} (6)$ would have a different distribution, reflecting the bias of the loaded dice.

2.3 Start and End States

HMMs typically include special start state ( $s_{0}$ ) and special end state ( $s_{f}$ ) to explicitly model the beginning and end of observation sequences.

Start State ( $s_{0}$ )

Not associated with "real" observations
Transitions from $s_{0}$ to other states (represented by $a_{0 i}$ ) define the initial state distribution
No transitions into the start state are defined (all $a_{i 0}$ are undefined)

End State ( $s_{f}$ )

Not associated with "real" observations
Transitions from other states to $s_{f}$ (represented by $a_{i f}$ ) define the probability of sequence termination
No transitions out of the end state are defined (all $a_{f i}$ are undefined)

Special Observations

Similarly, special symbols are defined:

$k_{0}$ : special start symbol, emitted only by the start state
$k_{f}$ : special end symbol, emitted only by the end state

Complete HMM Definition

With these components defined, a complete first-order HMM is formally specified by:

A set of $N$ emitting states $S_{e} = {s_{1}, s_{2}, \dots, s_{N}}$
Special start state $s_{0}$ and end state $s_{f}$
An output alphabet of $M$ observations $K = {k_{1}, k_{2}, \dots, k_{M}}$
Special start symbol $k_{0}$ and end symbol $k_{f}$
State transition probability matrix $A$
Emission probability matrix $B$

The HMM can be denoted as $μ = (A, B)$ with these parameters.

Visual Representation

Showing the dice HMM with start state, end state, and transitions between states

In this diagram, $a_{01}$ and $a_{02}$ represent probabilities of starting with the loaded or fair dice, while $a_{1 f}$ and $a_{2 f}$ represent probabilities of ending the sequence when in each state.

3. HMM Problems and Algorithms

3.1 Problem 1: Labelled Learning (Parameter Estimation)

Labelled learning is the problem of estimating the parameters of an HMM (transition and emission probabilities) when we have access to both observation sequences and their corresponding hidden state sequences.

Input and Output

Input: Parallel sequences of observations and states (dual tape)
Output: HMM parameters (transition matrix A and emission matrix B)

This is the simplest of the HMM learning problems because we can directly count transitions and emissions from the labelled data.

Parameter Estimation Formulas

The transition probabilities $a_{ij}$ are estimated using:

a_{ij} = P (X_{t + 1} = s_{j} ∣ X_{t} = s_{i}) \approx \frac{count _{trans} ( X _{t} = s _{i} , X _{t + 1} = s _{j} )}{count _{trans} ( X _{t} = s _{i} )}

Where:

$count_{trans} (X_{t} = s_{i}, X_{t + 1} = s_{j})$ is the number of times state $s_{i}$ is followed by state $s_{j}$
$count_{trans} (X_{t} = s_{i})$ is the total number of times state $s_{i}$ appears

The emission probabilities $b_{i} (k_{j})$ are estimated using:

b_{i} (k_{j}) = P (O_{t} = k_{j} ∣ X_{t} = s_{i}) \approx \frac{count _{emission} ( O _{t} = k _{j} , X _{t} = s _{i} )}{count _{emission} ( X _{t} = s _{i} )}

Where:

$count_{emission} (O_{t} = k_{j}, X_{t} = s_{i})$ is the number of times observation $k_{j}$ is emitted from state $s_{i}$
$count_{emission} (X_{t} = s_{i})$ is the total number of times state $s_{i}$ appears

Example: Dice HMM Parameter Estimation

Given the following labelled sequence:

States:   F F F F L L L F F F F L L L L F F F L L
Observations: 1 3 4 5 6 6 5 1 2 3 1 4 3 5 4 1 2 6 1 2

We can count:

Transitions: F→F, F→L, L→L, L→F
Emissions: (F,1), (F,2), (F,3), ..., (L,1), (L,2), ...

For example, to calculate $a_{F L}$ (probability of transitioning from fair to loaded dice):

a_{F L} = \frac{count ( F \to L )}{count ( F )} = \frac{3}{12} = 0.25

Similarly, to calculate $b_{L} (6)$ (probability of rolling a 6 with the loaded dice):

b_{L} (6) = \frac{count ( 6∣ L )}{count ( L )} = \frac{1}{8} = 0.125

Smoothing

In practice, add-one smoothing (or other smoothing techniques) is often applied to handle zero counts and avoid zero probabilities.

3.2 Problem 2: Unlabelled Learning

Unlabelled learning is the problem of estimating the HMM parameters when we only have access to observation sequences, without the corresponding hidden state sequences.

This is a more challenging problem since we don't know which states generated which observations.

Solution Approach

The standard algorithm for unlabelled learning is the Baum-Welch algorithm, which is a special case of the Expectation-Maximization (EM) algorithm:

Expectation step: Compute the expected counts of transitions and emissions based on current parameter estimates
Maximization step: Re-estimate the parameters to maximize the likelihood of the observations

The algorithm iterates between these steps until convergence, typically reaching a local optimum.

Challenges

Requires initialization of parameters
Converges to local optima, not necessarily global
More computationally intensive than labelled learning
May require multiple random restarts with different initializations

3.3 Problem 3: Likelihood Calculation

Likelihood calculation is the problem of determining the probability of an observation sequence given an HMM: $P (O ∣ μ)$ .

This calculation is useful for:

Comparing different HMMs
Evaluating how well an HMM explains new data
Component in the Baum-Welch algorithm

Forward Algorithm

The standard approach uses the Forward algorithm, which efficiently computes the likelihood using dynamic programming:

Define forward variables $α_{t} (i)$ as the probability of observing the partial sequence $O_{1}, O_{2}, ..., O_{t}$ and ending in state $s_{i}$
Recursively calculate these variables
Sum over all possible final states to get the total likelihood

The algorithm has complexity $O (N^{2} T)$ where $N$ is the number of states and $T$ is the sequence length.

3.4 Problem 4: Decoding (Viterbi Algorithm)

Decoding is the problem of finding the most likely sequence of hidden states that could have generated a given observation sequence.

Formally, we want to find:

\hat{X} = ar g X max P (X, O ∣ μ) = ar g X max P (O ∣ X, μ) P (X ∣ μ)

Which expands to:

\hat{X} = ar g X_{1... T} max t = 1 \prod T P (O_{t} ∣ X_{t}) P (X_{t} ∣ X_{t - 1})

Dynamic Programming Solution

The search space of all possible state sequences is $O (N^{T})$ , making brute force search impractical. Instead, we use the Viterbi algorithm, which is a dynamic programming approach.

The Viterbi algorithm works because HMMs satisfy two key properties:

Optimal substructure: The best path to a state contains within it optimal sub-paths
Overlapping subproblems: Calculations for later states reuse calculations from earlier states

The Viterbi Algorithm

The algorithm uses a trellis data structure to memoize intermediate results. The trellis is a $(N + 2) \times (T + 2)$ grid where:

Rows represent states
Columns represent time steps

Key Variables

Viterbi Probability $δ_{j} (t)$ : The probability of the most likely path ending in state $s_{j}$ at time $t$
Backpointer $ψ_{j} (t)$ : The previous state in the most likely path to state $s_{j}$ at time $t$

Algorithm Steps

1. Initialization (t=0):

Initialize the Viterbi probability for the start state: $δ_{0} (0) = 1$
Initialize all other Viterbi probabilities to 0: $δ_{j} (0) = 0$ for $j \neq = 0$

2. Recursion (t=1 to T):

For each state $s_{j}$ and time $t$ , compute:
$δ_{j} (t) = 1 \leq i \leq N max [δ_{i} (t - 1) \cdot a_{ij} \cdot b_{j} (O_{t})]$
$ψ_{j} (t) = ar g 1 \leq i \leq N max [δ_{i} (t - 1) \cdot a_{ij} \cdot b_{j} (O_{t})]$

3. Termination:

Compute the probability of the most likely path ending in the final state:
$δ_{f} (T + 1) = 1 \leq i \leq N max [δ_{i} (T) \cdot a_{i f}]$
$ψ_{f} (T + 1) = ar g 1 \leq i \leq N max [δ_{i} (T) \cdot a_{i f}]$

4. Backtracing:

Start at the end state: $x_{T + 1} = s_{f}$
For $t = T$ down to 1, find the previous state:
$x_{t} = s_{ψ_{x_{t + 1}} (t + 1)}$

The result is the most likely state sequence $X = (x_{1}, x_{2}, ..., x_{T})$ .

Example: Dice HMM Decoding

Suppose we observe the sequence [4, 3, 5] and want to determine whether the fair or loaded dice was used at each step.

The trellis calculation would proceed as follows:

Initialization: Start at the start state
For t=1 (observation=4):
- Calculate $δ_{F} (1)$ and $δ_{L} (1)$ based on transitions from start state
For t=2 (observation=3):
- Calculate $δ_{F} (2)$ and $δ_{L} (2)$ based on values from t=1
For t=3 (observation=5):
- Calculate $δ_{F} (3)$ and $δ_{L} (3)$ based on values from t=2
Termination: Calculate transition to end state
Backtracing: Follow the backpointers to determine the most likely sequence

The output would be the most likely sequence of dice used (F or L) that produced the observed numbers.

Complexity

The time complexity of the Viterbi algorithm is $O (N^{2} T)$ , where:

$N$ is the number of states
$T$ is the length of the observation sequence

This is much more efficient than the brute force approach of checking all $N^{T}$ possible state sequences.

4. Evaluation and Testing

4.1 Precision and Recall

When evaluating HMM performance, we often need more nuanced metrics than overall accuracy, especially when the classes are imbalanced or we're particularly interested in one type of state.

Precision and recall are evaluation metrics imported from information retrieval that provide a more detailed assessment of performance for specific classes of interest.

Why Not Just Use Accuracy?

Accuracy measures the overall correctness of predictions:

Accuracy = \frac{Number of correct predictions}{Total number of predictions} = \frac{a + d}{a + b + c + d}

However, accuracy has limitations:

It can be misleading when classes are imbalanced
It doesn't distinguish between different types of errors
It doesn't focus on the class of interest

Precision, Recall and F-measure

Given a confusion matrix:

	System says: F	System says: L	Total
Truth is: F	a	b	a+b
Truth is: L	c	d	c+d
Total	a+c	b+d	a+b+c+d

The key metrics are:

Precision of L: The proportion of instances classified as L that are actually L

P_{L} = \frac{d}{b + d} = \frac{number of correctly predicted L}{number of predicted L}

Recall of L: The proportion of actual L instances that are correctly classified as L

R_{L} = \frac{d}{c + d} = \frac{number of correctly predicted L}{true number of L}

F-measure of L: The harmonic mean of precision and recall

F_{L} = \frac{2 P _{L} R _{L}}{P _{L} + R _{L}}

Example: Evaluating Dice HMM

Consider a test set where our HMM tries to determine whether the fair (F) or loaded (L) dice was used:

	Predicted: F	Predicted: L	Total
Actual: F	45	5	50
Actual: L	10	40	50
Total	55	45	100

Calculating metrics for the loaded dice (L):

Precision: $P_{L} = \frac{40}{45} = 0.889$ (88.9% of predictions of L are correct)
Recall: $R_{L} = \frac{40}{50} = 0.8$ (80% of actual L instances are detected)
F-measure: $F_{L} = \frac{2 \times 0.889 \times 0.8}{0.889 + 0.8} = \frac{1.422}{1.689} = 0.842$
Accuracy: $\frac{45 + 40}{100} = 0.85$ (85% overall correct predictions)

4.2 Baseline Algorithms

A baseline algorithm is a simpler alternative to your proposed solution that serves as a comparison point. Baselines help quantify how much improvement your model provides over simpler approaches.

Importance of Baselines

Baselines serve several crucial purposes:

They establish a minimum performance level that any useful algorithm should exceed
They help contextualize how "good" your results actually are
They demonstrate the value added by your more complex approach

Criteria for Effective Baselines

A good baseline should:

Require considerably less effort than your proposed solution
Be easily understandable to other researchers
Have "fair" access to a reasonable amount of information
Be strong enough to be meaningful but not so strong as to be nearly equivalent to your solution

Common Baseline Approaches

Most Frequent Category (MFC): Always predict the most common class in the training data
Random Prediction: Randomly assign states based on their observed distribution
Simple Rules: Apply straightforward heuristics (e.g., always predict state X after observation Y)
Existing Implementations: Use previously published algorithms as comparison points

Baseline for HMM Decoding

For HMM decoding, a common baseline is random guessing of hidden states:

Only knows how often the HMM is in each state on average (the observed distribution of states)
Does not use information about sequences or observations
For each position, randomly selects a state according to the state's overall frequency

This baseline tests how much the sequential modeling in an HMM improves prediction compared to simply knowing state frequencies.

4.3 Statistical Significance Testing

When comparing two systems (e.g., your HMM vs. a baseline), observed performance differences might be due to chance rather than actual superiority of one approach. Statistical significance testing helps determine whether observed differences are likely to be meaningful.

Null Hypothesis Testing

The framework is based on testing the null hypothesis that two result sets come from the same distribution (i.e., the systems are equally good).

Steps in statistical testing:

Choose a significance level (α), typically 0.01 or 0.05
Calculate a test statistic and its corresponding p-value
If p-value < α, reject the null hypothesis with confidence 1-α (95% or 99%)

The Sign Test

The sign test is a non-parametric, paired test that's particularly suitable for comparing classification systems:

For each test instance, determine which system performs better:
- Positive outcome: System 1 beats System 2
- Negative outcome: System 2 beats System 1
- Tie: Both systems perform equally
Under the null hypothesis (systems are equal), the probability of positive and negative outcomes should be equal (0.5)
The counts follow a binomial distribution B(N,q) where:
- N is the total number of non-tie comparisons
- q is the probability of a negative outcome (0.5 under the null hypothesis)

Binomial Distribution

The probability of observing exactly k negative events out of N is:

P_{q} (X = k ∣ N) = (k N) q^{k} (1 - q)^{N - k}

The probability of observing at most k negative events:

P_{q} (X \leq k ∣ N) = i = 0 \sum k (i N) q^{i} (1 - q)^{N - i}

One-Tailed vs. Two-Tailed Tests

One-tailed test: Tests if System 1 is better than System 2 (directional)
Two-tailed test: Tests if the systems are different, regardless of direction (non-directional)

For a two-tailed test, the p-value is 2×P(X ≤ k) for the smaller count.

Example: Sign Test

Suppose we compare our HMM against a baseline on 100 test instances:

HMM better: 65 instances
Baseline better: 35 instances
Ties: 0 instances

Using the binomial distribution with N=100, k=35, q=0.5:

P(X ≤ 35|100) = 0.0018

Since 0.0018 < 0.01 (our chosen significance level), we can reject the null hypothesis and conclude that the HMM is significantly better than the baseline with 99% confidence.

Handling Ties

Ties (where both systems perform equally) are common in classification tasks. They can be handled by:

Excluding ties from the analysis (reducing N)
Adding 0.5 to both positive and negative counts for each tie

Reporting Significance

When reporting results, only claim statistical significance if your test indicates it:

"The difference between System 1 and System 2 is statistically significant at α = 0.01"
Raw accuracy differences without significance testing may be meaningless

5. Practical Applications

5.1 Fraudulent Croupier Example

The fraudulent croupier scenario is a classic example used to illustrate HMMs in action. This example demonstrates how HMMs can detect hidden states when only observations are visible.

Scenario Description

The scenario involves a dishonest casino croupier who secretly switches between dice:

The croupier has multiple dice: a fair dice (F) and one or more loaded dice (L)
The fair dice has equal probability (1/6) for each number
The loaded dice have biased probability distributions
When people bet on outcomes, the croupier secretly switches between dice
An observer can only see the numbers rolled, not which dice is being used
The goal is to detect when the croupier is cheating by using the loaded dice

HMM Formulation

This scenario can be modeled as an HMM where:

Hidden states: The type of dice being used (F or L)
Observations: The numbers rolled (1-6)
Transition probabilities: How likely the croupier is to switch dice or continue using the same one
Emission probabilities: The probability of rolling each number with each dice

Showing the dice HMM with states, observations, and transition probabilities

Variant with Three Dice

In a more complex version (as seen in tutorial 04), the croupier has three dice:

A fair dice (F)
Two different loaded dice (L1 and L2)

The HMM structure becomes more complex with additional states and transitions.

Parameter Estimation Example

Given training data with known dice types:

States:      F F F F L L L F F F F L L L L F F F L L
Observations: 1 3 4 5 6 6 5 1 2 3 1 4 3 5 4 1 2 6 1 2

We can estimate the transition probabilities:

$a_{FF}$ (staying with fair dice): $\frac{co u n t ( F \to F )}{co u n t ( F )} = \frac{9}{12} = 0.75$
$a_{F L}$ (switching from fair to loaded): $\frac{co u n t ( F \to L )}{co u n t ( F )} = \frac{3}{12} = 0.25$
$a_{LL}$ (staying with loaded dice): $\frac{co u n t ( L \to L )}{co u n t ( L )} = \frac{6}{8} = 0.75$
$a_{L F}$ (switching from loaded to fair): $\frac{co u n t ( L \to F )}{co u n t ( L )} = \frac{2}{8} = 0.25$

And the emission probabilities:

For fair dice: $b_{F} (1) = \frac{co u n t ( 1∣ F )}{co u n t ( F )} = \frac{4}{12} = 0.333$
For loaded dice: $b_{L} (1) = \frac{co u n t ( 1∣ L )}{co u n t ( L )} = \frac{1}{8} = 0.125$

Decoding with Viterbi

Given only an observation sequence [4, 3, 5], we can use the Viterbi algorithm to determine the most likely sequence of dice used:

Construct a trellis with states (F, L) as rows and time steps as columns
Fill in the probability of reaching each state at each time step
Backtrace to find the most likely state sequence

The result might be [F, F, L], indicating the first two rolls were likely from the fair dice and the third from the loaded dice.

Variants and Constraints

The tutorial materials (tut04.pdf) discuss several interesting variants of the croupier's behavior:

Never switching directly from F to L2: This constraint can be represented by setting $a_{F, L 2} = 0$
Always ending with fair dice: This constraint means $a_{1 E} = a_{2 E} = 0$ (only fair dice can transition to end state)
Never using loaded dice more than twice in a row: This requires a second-order HMM to properly model
Always switching after rolling a 6: This relates emissions to transitions and cannot be directly modeled in a standard HMM

5.2 Part-of-Speech Tagging

Part-of-Speech (POS) tagging is the task of assigning grammatical categories (noun, verb, adjective, etc.) to words in a text. This is a classic application of HMMs in natural language processing.

The Problem of Ambiguity

Many words in English and other languages are ambiguous regarding their part of speech:

"can" → verb or auxiliary verb or noun
"fish" → noun or verb
"watch" → noun or verb

For example, the sentence "We can fish" is ambiguous:

"We can fish" = "We are able to fish" (auxiliary verb + verb)
"We can fish" = "We put fish in cans" (verb + noun)

A human understands the correct interpretation based on context, and an HMM can use probabilistic transitions to model this context.

HMM Formulation for POS Tagging

In the POS tagging HMM:

Hidden states: The POS tags (noun, verb, pronoun, etc.)
Observations: The words in the text
Transition probabilities: The probability of one POS tag following another
Emission probabilities: The probability of a word being generated by a particular POS tag

Example from Lecture Notes

The lecture presents an example with the sentence "We can fish." and an HMM with four states:

$s_{1}$ = verb
$s_{2}$ = noun
$s_{3}$ = personal pronoun
$s_{4}$ = auxiliary verb

The transition probability matrix A is:

A = a_{01} = 0.01 a_{11} = 0.02 a_{21} = 0.49 a_{31} = 0.40 a_{41} = 0.73 a_{02} = 0.10 a_{12} = 0.63 a_{22} = 0.20 a_{32} = 0.05 a_{42} = 0.01 a_{03} = 0.60 a_{13} = 0.07 a_{23} = 0.10 a_{33} = 0.05 a_{43} = 0.15 a_{04} = 0.29 a_{14} = 0.13 a_{24} = 0.01 a_{34} = 0.40 a_{44} = 0.01 a_{1 f} = 0.15 a_{2 f} = 0.20 a_{3 f} = 0.10 a_{4 f} = 0.10

And the emission probability matrix B is:

B = b_{1} (f i s h) = 0.89 b_{1} (c an) = 0.10 b_{1} (w e) = 0.01 b_{2} (f i s h) = 0.75 b_{2} (c an) = 0.24 b_{2} (w e) = 0.01 b_{3} (f i s h) = 0 b_{3} (c an) = 0 b_{3} (w e) = 1 b_{4} (f i s h) = 0 b_{4} (c an) = 1 b_{4} (w e) = 0

Analyzing Two Interpretations

The notes examine two possible state sequences:

$X_{a} = s_{0}, s_{3}, s_{4}, s_{1}, s_{f}$ (We = pronoun, can = auxiliary verb, fish = verb)
$X_{b} = s_{0}, s_{3}, s_{1}, s_{2}, s_{f}$ (We = pronoun, can = verb, fish = noun)

Calculating the probabilities:

$P (X_{a}) = a_{03} \times a_{34} \times a_{41} \times a_{1 f} = 0.60 \times 0.40 \times 0.73 \times 0.15 = 0.02628$
$P (X_{a}, O) = P (X_{a}) \times b_{3} (w e) \times b_{4} (c an) \times b_{1} (f i s h) = 0.02628 \times 1 \times 1 \times 0.89 = 0.0233892$
$P (X_{b}) = 0.03024$
$P (X_{b}, O) = 0.002268$

Since $P (X_{a}, O) > P (X_{b}, O)$ , the model predicts the first interpretation: "We (pronoun) can (auxiliary verb) fish (verb)" = "We are able to fish."

Viterbi for POS Tagging

In practice, we use the Viterbi algorithm to find the most likely sequence of POS tags for a given sentence:

For each word and each possible POS tag, compute the probability of the best path ending with that tag
Keep track of the previous tag in the best path
Backtrace from the end to recover the most likely sequence of tags

Importance in NLP

POS tagging is a fundamental task in natural language processing that serves as:

A preprocessing step for parsing and semantic analysis
An essential component in information extraction
A disambiguation technique for machine translation
A foundation for many higher-level NLP tasks

The success of HMMs in POS tagging demonstrates their power in modeling sequential data with hidden structure, even in complex domains like human language.

Exploring Backends: Custom vs Managed Solutions

An In-depth Analysis of Backend Development Approaches, Tools, and Security Considerations

Skills for blog:

Exploring Backends: Custom vs Managed Solutions

An In-depth Analysis of Backend Development Approaches, Tools, and Security Considerations

1. Introduction to HMMs

1.1 Markov Assumption

Importance of the Markov Assumption

Example: Weather Prediction

1.2 Hidden vs. Observable States

Markov Chains vs. Hidden Markov Models

Formal Structure

Key Difference

1.3 Applications of HMMs

Common Applications

Example: The Fraudulent Croupier Scenario

2. Structure of HMMs

2.1 State Transition Probability Matrix

Properties of Transition Probabilities

Example: Dice HMM

2.2 Emission Probability Matrix

Output Independence Property

Estimation of Emission Probabilities

Example: Dice HMM

2.3 Start and End States

Start State (s0​)

End State (sf​)

Special Observations

Complete HMM Definition

Visual Representation

3. HMM Problems and Algorithms

3.1 Problem 1: Labelled Learning (Parameter Estimation)

Input and Output

Parameter Estimation Formulas

Example: Dice HMM Parameter Estimation

Smoothing

3.2 Problem 2: Unlabelled Learning

Solution Approach

Challenges

3.3 Problem 3: Likelihood Calculation

Forward Algorithm

3.4 Problem 4: Decoding (Viterbi Algorithm)

Dynamic Programming Solution

The Viterbi Algorithm

Key Variables

Algorithm Steps

Example: Dice HMM Decoding

Complexity

4. Evaluation and Testing

4.1 Precision and Recall

Why Not Just Use Accuracy?

Precision, Recall and F-measure

Example: Evaluating Dice HMM

4.2 Baseline Algorithms

Importance of Baselines

Criteria for Effective Baselines

Common Baseline Approaches

Baseline for HMM Decoding

4.3 Statistical Significance Testing

Null Hypothesis Testing

The Sign Test

Binomial Distribution

One-Tailed vs. Two-Tailed Tests

Example: Sign Test

Handling Ties

Reporting Significance

5. Practical Applications

5.1 Fraudulent Croupier Example

Scenario Description

HMM Formulation

Variant with Three Dice

Parameter Estimation Example

Decoding with Viterbi

Variants and Constraints

5.2 Part-of-Speech Tagging

The Problem of Ambiguity

HMM Formulation for POS Tagging

Example from Lecture Notes

Analyzing Two Interpretations

Viterbi for POS Tagging

Start State ( $s_{0}$ )

End State ( $s_{f}$ )