Exploring Backends: Custom vs Managed Solutions

An In-depth Analysis of Backend Development Approaches, Tools, and Security Considerations

Skills for blog:

Cloud Computing

Firebase

Supabase

PocketBase

Database Management Systems

Non Relational Databases (NoSQL)

Databases

Indexing

User Authentication

Relational Databases (SQL)

Exploring Backends: Custom vs Managed Solutions

An In-depth Analysis of Backend Development Approaches, Tools, and Security Considerations

1. Introduction to HMMs

1.1 Markov Assumption

Hidden Markov Models (HMMs) are probabilistic models used for representing sequential data where the underlying process that generates the observations is hidden.

The Markov assumption (also called Limited Horizon) is fundamental to HMMs and states that the probability of the current state depends only on the previous state, not on the entire history of states.

Mathematically, this is expressed as:

Where:

  • is the state at time t
  • is the state at time t-1

The Markov assumption allows us to calculate the joint probability of a sequence of events as:

Importance of the Markov Assumption

The Markov assumption is crucial for HMMs because it:

  1. Makes calculations tractable by limiting the dependencies
  2. Reduces data sparseness when estimating parameters
  3. Allows for efficient algorithms like Viterbi

Example: Weather Prediction

Consider predicting tomorrow's weather based on historical observations:

  • Without Markov assumption:
  • With Markov assumption:

The simplification makes the model more manageable, requiring far less data to estimate parameters.

1.2 Hidden vs. Observable States

Markov Chains vs. Hidden Markov Models

Markov Chains:

  • States are fully observable (you can directly see what state the system is in)
  • Transitions between states are probabilistic
  • Can be viewed as a probabilistic finite-state automaton

image

Showing the Markov Chain for weather prediction with two states (rainy and cloudy) and transition probabilities between them

Hidden Markov Models:

  • States are hidden (not directly observable)
  • We only observe emissions or observations that are probabilistically related to the hidden states
  • No one-to-one mapping between observations and hidden states
  • A particular observation can be emitted from multiple different hidden states

image

Showing the time-elapsed view of HMM with hidden states and observations

Formal Structure

In an HMM:

  • Hidden states: The underlying states of the system that we cannot directly observe
  • Observations: The visible outputs that depend probabilistically on the hidden states
  • Transition probabilities: The probability of moving from one hidden state to another
  • Emission probabilities: The probability of an observation being generated from a particular hidden state

Key Difference

The key difference is that in Markov Chains, we know exactly which state the system is in at any time, while in HMMs, we need to infer the most likely sequence of hidden states based on the observations.

1.3 Applications of HMMs

HMMs are widely used in scenarios where:

  1. We have sequential data
  2. The underlying process is hidden or not directly observable
  3. We want to infer the hidden process from observable outputs

Common Applications

  1. Speech Recognition

    • Observations: Audio signal (acoustic features)
    • Hidden States: Phonemes or words
    • Task: Determine the most likely sequence of words given the acoustic signal
  2. Part-of-Speech Tagging

    • Observations: Words in a sentence
    • Hidden States: Part-of-speech tags (noun, verb, adjective, etc.)
    • Task: Assign the correct grammatical category to each word
  3. Machine Translation

    • Observations: Target words (translated text)
    • Hidden States: Source words (original text)
    • Task: Find the most likely translation
  4. Bioinformatics

    • Observations: DNA/protein sequences
    • Hidden States: Structural or functional elements
    • Task: Identify functional regions in biological sequences

Example: The Fraudulent Croupier Scenario

A classic example used to explain HMMs involves a dishonest casino croupier:

  • A croupier secretly switches between a fair dice and a loaded dice
  • The fair dice has equal probability (1/6) for each number
  • The loaded dice has a biased probability distribution
  • An observer can only see the sequence of dice rolls (observations)
  • The task is to determine when the croupier was using which dice (hidden states)

image

image

Showing the dice HMM with states, observations, and probabilities

In this example:

  • Hidden States: The type of dice being used (fair or loaded)
  • Observations: The numbers rolled (1-6)
  • Transition Probabilities: How likely the croupier is to switch dice
  • Emission Probabilities: The probability of rolling each number with each dice

This scenario demonstrates how HMMs can be used to infer hidden processes (which dice is being used) from observable data (the dice rolls).

2. Structure of HMMs

2.1 State Transition Probability Matrix

The state transition probability matrix (or simply transition matrix) is a fundamental component of an HMM, represented as matrix A. It defines the probability of moving from one state to another.

For a first-order HMM with N emitting states, plus special start and end states, the transition matrix A is of size :

Where:

  • represents the probability of transitioning from state to state
  • Formally:

Properties of Transition Probabilities

  1. Row Summation: For each state , the sum of all outgoing transition probabilities must equal 1:

  2. Non-negativity: All transition probabilities must be non-negative:

  3. Markov Property: The probability depends only on the current state, not on how the model arrived at that state.

Example: Dice HMM

In the fraudulent croupier example with two dice (fair and loaded):

  • = probability of using the fair dice again after using the fair dice
  • = probability of switching from fair to loaded dice
  • = probability of switching from loaded to fair dice
  • = probability of using the loaded dice again after using the loaded dice

These probabilities can be estimated from training data:

2.2 Emission Probability Matrix

The emission probability matrix (also called observation likelihood matrix) is represented as matrix B. It defines the probability of observing a particular output given that the model is in a specific hidden state.

For an HMM with N emitting states and M possible observations, plus special start and end symbols, the emission matrix B is of size (M+2) × (N+2):

Where:

  • represents the probability of emitting observation when in state
  • Formally:

Output Independence Property

An important property of HMMs is output independence, which states that the probability of an output observation depends only on the current state:

This means that the current observation is conditionally independent of all other states and observations given the current state.

Estimation of Emission Probabilities

Emission probabilities can be estimated from training data:

Example: Dice HMM

In the fraudulent croupier scenario:

  • For the fair dice (state F):

  • For the loaded dice (state L):

    • The probabilities would have a different distribution, reflecting the bias of the loaded dice.

2.3 Start and End States

HMMs typically include special start state () and special end state () to explicitly model the beginning and end of observation sequences.

Start State ()

  • Not associated with "real" observations
  • Transitions from to other states (represented by ) define the initial state distribution
  • No transitions into the start state are defined (all are undefined)

End State ()

  • Not associated with "real" observations
  • Transitions from other states to (represented by ) define the probability of sequence termination
  • No transitions out of the end state are defined (all are undefined)

Special Observations

Similarly, special symbols are defined:

  • : special start symbol, emitted only by the start state
  • : special end symbol, emitted only by the end state

Complete HMM Definition

With these components defined, a complete first-order HMM is formally specified by:

  1. A set of emitting states
  2. Special start state and end state
  3. An output alphabet of observations
  4. Special start symbol and end symbol
  5. State transition probability matrix
  6. Emission probability matrix

The HMM can be denoted as with these parameters.

Visual Representation

image

image

Showing the dice HMM with start state, end state, and transitions between states

In this diagram, and represent probabilities of starting with the loaded or fair dice, while and represent probabilities of ending the sequence when in each state.

3. HMM Problems and Algorithms

3.1 Problem 1: Labelled Learning (Parameter Estimation)

Labelled learning is the problem of estimating the parameters of an HMM (transition and emission probabilities) when we have access to both observation sequences and their corresponding hidden state sequences.

Input and Output

  • Input: Parallel sequences of observations and states (dual tape)
  • Output: HMM parameters (transition matrix A and emission matrix B)

This is the simplest of the HMM learning problems because we can directly count transitions and emissions from the labelled data.

Parameter Estimation Formulas

The transition probabilities are estimated using:

Where:

  • is the number of times state is followed by state
  • is the total number of times state appears

The emission probabilities are estimated using:

Where:

  • is the number of times observation is emitted from state
  • is the total number of times state appears

Example: Dice HMM Parameter Estimation

Given the following labelled sequence:

States:   F F F F L L L F F F F L L L L F F F L L
Observations: 1 3 4 5 6 6 5 1 2 3 1 4 3 5 4 1 2 6 1 2

We can count:

  • Transitions: F→F, F→L, L→L, L→F
  • Emissions: (F,1), (F,2), (F,3), ..., (L,1), (L,2), ...

For example, to calculate (probability of transitioning from fair to loaded dice):

Similarly, to calculate (probability of rolling a 6 with the loaded dice):

Smoothing

In practice, add-one smoothing (or other smoothing techniques) is often applied to handle zero counts and avoid zero probabilities.

3.2 Problem 2: Unlabelled Learning

Unlabelled learning is the problem of estimating the HMM parameters when we only have access to observation sequences, without the corresponding hidden state sequences.

This is a more challenging problem since we don't know which states generated which observations.

Solution Approach

The standard algorithm for unlabelled learning is the Baum-Welch algorithm, which is a special case of the Expectation-Maximization (EM) algorithm:

  1. Expectation step: Compute the expected counts of transitions and emissions based on current parameter estimates
  2. Maximization step: Re-estimate the parameters to maximize the likelihood of the observations

The algorithm iterates between these steps until convergence, typically reaching a local optimum.

Challenges

  • Requires initialization of parameters
  • Converges to local optima, not necessarily global
  • More computationally intensive than labelled learning
  • May require multiple random restarts with different initializations

3.3 Problem 3: Likelihood Calculation

Likelihood calculation is the problem of determining the probability of an observation sequence given an HMM: .

This calculation is useful for:

  • Comparing different HMMs
  • Evaluating how well an HMM explains new data
  • Component in the Baum-Welch algorithm

Forward Algorithm

The standard approach uses the Forward algorithm, which efficiently computes the likelihood using dynamic programming:

  1. Define forward variables as the probability of observing the partial sequence and ending in state
  2. Recursively calculate these variables
  3. Sum over all possible final states to get the total likelihood

The algorithm has complexity where is the number of states and is the sequence length.

3.4 Problem 4: Decoding (Viterbi Algorithm)

Decoding is the problem of finding the most likely sequence of hidden states that could have generated a given observation sequence.

Formally, we want to find:

Which expands to:

Dynamic Programming Solution

The search space of all possible state sequences is , making brute force search impractical. Instead, we use the Viterbi algorithm, which is a dynamic programming approach.

The Viterbi algorithm works because HMMs satisfy two key properties:

  1. Optimal substructure: The best path to a state contains within it optimal sub-paths
  2. Overlapping subproblems: Calculations for later states reuse calculations from earlier states

The Viterbi Algorithm

The algorithm uses a trellis data structure to memoize intermediate results. The trellis is a grid where:

  • Rows represent states
  • Columns represent time steps

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

Key Variables

  1. Viterbi Probability: The probability of the most likely path ending in state at time
  2. Backpointer: The previous state in the most likely path to state at time

Algorithm Steps

1. Initialization (t=0):

  • Initialize the Viterbi probability for the start state:
  • Initialize all other Viterbi probabilities to 0: for

2. Recursion (t=1 to T):

  • For each state and time , compute:

3. Termination:

  • Compute the probability of the most likely path ending in the final state:

4. Backtracing:

  • Start at the end state:
  • For down to 1, find the previous state:

The result is the most likely state sequence .

Example: Dice HMM Decoding

Suppose we observe the sequence [4, 3, 5] and want to determine whether the fair or loaded dice was used at each step.

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image The trellis calculation would proceed as follows:

  1. Initialization: Start at the start state
  2. For t=1 (observation=4):
    • Calculate and based on transitions from start state
  3. For t=2 (observation=3):
    • Calculate and based on values from t=1
  4. For t=3 (observation=5):
    • Calculate and based on values from t=2
  5. Termination: Calculate transition to end state
  6. Backtracing: Follow the backpointers to determine the most likely sequence

The output would be the most likely sequence of dice used (F or L) that produced the observed numbers.

Complexity

The time complexity of the Viterbi algorithm is , where:

  • is the number of states
  • is the length of the observation sequence

This is much more efficient than the brute force approach of checking all possible state sequences.

4. Evaluation and Testing

4.1 Precision and Recall

When evaluating HMM performance, we often need more nuanced metrics than overall accuracy, especially when the classes are imbalanced or we're particularly interested in one type of state.

Precision and recall are evaluation metrics imported from information retrieval that provide a more detailed assessment of performance for specific classes of interest.

Why Not Just Use Accuracy?

Accuracy measures the overall correctness of predictions:

However, accuracy has limitations:

  • It can be misleading when classes are imbalanced
  • It doesn't distinguish between different types of errors
  • It doesn't focus on the class of interest

Precision, Recall and F-measure

Given a confusion matrix:

System says: F System says: L Total
Truth is: F a b a+b
Truth is: L c d c+d
Total a+c b+d a+b+c+d

The key metrics are:

Precision of L: The proportion of instances classified as L that are actually L

Recall of L: The proportion of actual L instances that are correctly classified as L

F-measure of L: The harmonic mean of precision and recall

Example: Evaluating Dice HMM

Consider a test set where our HMM tries to determine whether the fair (F) or loaded (L) dice was used:

Predicted: F Predicted: L Total
Actual: F 45 5 50
Actual: L 10 40 50
Total 55 45 100

Calculating metrics for the loaded dice (L):

  • Precision: (88.9% of predictions of L are correct)
  • Recall: (80% of actual L instances are detected)
  • F-measure:
  • Accuracy: (85% overall correct predictions)

4.2 Baseline Algorithms

A baseline algorithm is a simpler alternative to your proposed solution that serves as a comparison point. Baselines help quantify how much improvement your model provides over simpler approaches.

Importance of Baselines

Baselines serve several crucial purposes:

  • They establish a minimum performance level that any useful algorithm should exceed
  • They help contextualize how "good" your results actually are
  • They demonstrate the value added by your more complex approach

Criteria for Effective Baselines

A good baseline should:

  1. Require considerably less effort than your proposed solution
  2. Be easily understandable to other researchers
  3. Have "fair" access to a reasonable amount of information
  4. Be strong enough to be meaningful but not so strong as to be nearly equivalent to your solution

Common Baseline Approaches

  1. Most Frequent Category (MFC): Always predict the most common class in the training data
  2. Random Prediction: Randomly assign states based on their observed distribution
  3. Simple Rules: Apply straightforward heuristics (e.g., always predict state X after observation Y)
  4. Existing Implementations: Use previously published algorithms as comparison points

Baseline for HMM Decoding

For HMM decoding, a common baseline is random guessing of hidden states:

  • Only knows how often the HMM is in each state on average (the observed distribution of states)
  • Does not use information about sequences or observations
  • For each position, randomly selects a state according to the state's overall frequency

This baseline tests how much the sequential modeling in an HMM improves prediction compared to simply knowing state frequencies.

4.3 Statistical Significance Testing

When comparing two systems (e.g., your HMM vs. a baseline), observed performance differences might be due to chance rather than actual superiority of one approach. Statistical significance testing helps determine whether observed differences are likely to be meaningful.

Null Hypothesis Testing

The framework is based on testing the null hypothesis that two result sets come from the same distribution (i.e., the systems are equally good).

Steps in statistical testing:

  1. Choose a significance level (α), typically 0.01 or 0.05
  2. Calculate a test statistic and its corresponding p-value
  3. If p-value < α, reject the null hypothesis with confidence 1-α (95% or 99%)

The Sign Test

The sign test is a non-parametric, paired test that's particularly suitable for comparing classification systems:

  1. For each test instance, determine which system performs better:

    • Positive outcome: System 1 beats System 2
    • Negative outcome: System 2 beats System 1
    • Tie: Both systems perform equally
  2. Under the null hypothesis (systems are equal), the probability of positive and negative outcomes should be equal (0.5)

  3. The counts follow a binomial distribution B(N,q) where:

    • N is the total number of non-tie comparisons
    • q is the probability of a negative outcome (0.5 under the null hypothesis)

Binomial Distribution

The probability of observing exactly k negative events out of N is:

The probability of observing at most k negative events:

One-Tailed vs. Two-Tailed Tests

  • One-tailed test: Tests if System 1 is better than System 2 (directional)
  • Two-tailed test: Tests if the systems are different, regardless of direction (non-directional)

For a two-tailed test, the p-value is 2×P(X ≤ k) for the smaller count.

Example: Sign Test

Suppose we compare our HMM against a baseline on 100 test instances:

  • HMM better: 65 instances
  • Baseline better: 35 instances
  • Ties: 0 instances

Using the binomial distribution with N=100, k=35, q=0.5:

  • P(X ≤ 35|100) = 0.0018

Since 0.0018 < 0.01 (our chosen significance level), we can reject the null hypothesis and conclude that the HMM is significantly better than the baseline with 99% confidence.

Handling Ties

Ties (where both systems perform equally) are common in classification tasks. They can be handled by:

  • Excluding ties from the analysis (reducing N)
  • Adding 0.5 to both positive and negative counts for each tie

Reporting Significance

When reporting results, only claim statistical significance if your test indicates it:

  • "The difference between System 1 and System 2 is statistically significant at α = 0.01"
  • Raw accuracy differences without significance testing may be meaningless

5. Practical Applications

5.1 Fraudulent Croupier Example

The fraudulent croupier scenario is a classic example used to illustrate HMMs in action. This example demonstrates how HMMs can detect hidden states when only observations are visible.

Scenario Description

The scenario involves a dishonest casino croupier who secretly switches between dice:

  • The croupier has multiple dice: a fair dice (F) and one or more loaded dice (L)
  • The fair dice has equal probability (1/6) for each number
  • The loaded dice have biased probability distributions
  • When people bet on outcomes, the croupier secretly switches between dice
  • An observer can only see the numbers rolled, not which dice is being used
  • The goal is to detect when the croupier is cheating by using the loaded dice

HMM Formulation

This scenario can be modeled as an HMM where:

  • Hidden states: The type of dice being used (F or L)
  • Observations: The numbers rolled (1-6)
  • Transition probabilities: How likely the croupier is to switch dice or continue using the same one
  • Emission probabilities: The probability of rolling each number with each dice

image

image

Showing the dice HMM with states, observations, and transition probabilities

Variant with Three Dice

In a more complex version (as seen in tutorial 04), the croupier has three dice:

  • A fair dice (F)
  • Two different loaded dice (L1 and L2)

The HMM structure becomes more complex with additional states and transitions.

Parameter Estimation Example

Given training data with known dice types:

States:      F F F F L L L F F F F L L L L F F F L L
Observations: 1 3 4 5 6 6 5 1 2 3 1 4 3 5 4 1 2 6 1 2

We can estimate the transition probabilities:

  • (staying with fair dice):
  • (switching from fair to loaded):
  • (staying with loaded dice):
  • (switching from loaded to fair):

And the emission probabilities:

  • For fair dice:
  • For loaded dice:

Decoding with Viterbi

Given only an observation sequence [4, 3, 5], we can use the Viterbi algorithm to determine the most likely sequence of dice used:

  1. Construct a trellis with states (F, L) as rows and time steps as columns
  2. Fill in the probability of reaching each state at each time step
  3. Backtrace to find the most likely state sequence

The result might be [F, F, L], indicating the first two rolls were likely from the fair dice and the third from the loaded dice.

Variants and Constraints

The tutorial materials (tut04.pdf) discuss several interesting variants of the croupier's behavior:

  1. Never switching directly from F to L2: This constraint can be represented by setting
  2. Always ending with fair dice: This constraint means (only fair dice can transition to end state)
  3. Never using loaded dice more than twice in a row: This requires a second-order HMM to properly model
  4. Always switching after rolling a 6: This relates emissions to transitions and cannot be directly modeled in a standard HMM

5.2 Part-of-Speech Tagging

Part-of-Speech (POS) tagging is the task of assigning grammatical categories (noun, verb, adjective, etc.) to words in a text. This is a classic application of HMMs in natural language processing.

The Problem of Ambiguity

Many words in English and other languages are ambiguous regarding their part of speech:

  • "can" → verb or auxiliary verb or noun
  • "fish" → noun or verb
  • "watch" → noun or verb

For example, the sentence "We can fish" is ambiguous:

  • "We can fish" = "We are able to fish" (auxiliary verb + verb)
  • "We can fish" = "We put fish in cans" (verb + noun)

A human understands the correct interpretation based on context, and an HMM can use probabilistic transitions to model this context.

HMM Formulation for POS Tagging

In the POS tagging HMM:

  • Hidden states: The POS tags (noun, verb, pronoun, etc.)
  • Observations: The words in the text
  • Transition probabilities: The probability of one POS tag following another
  • Emission probabilities: The probability of a word being generated by a particular POS tag

Example from Lecture Notes

The lecture presents an example with the sentence "We can fish." and an HMM with four states:

  • = verb
  • = noun
  • = personal pronoun
  • = auxiliary verb

The transition probability matrix A is:

And the emission probability matrix B is:

Analyzing Two Interpretations

The notes examine two possible state sequences:

  1. (We = pronoun, can = auxiliary verb, fish = verb)
  2. (We = pronoun, can = verb, fish = noun)

Calculating the probabilities:

Since , the model predicts the first interpretation: "We (pronoun) can (auxiliary verb) fish (verb)" = "We are able to fish."

Viterbi for POS Tagging

In practice, we use the Viterbi algorithm to find the most likely sequence of POS tags for a given sentence:

  1. For each word and each possible POS tag, compute the probability of the best path ending with that tag
  2. Keep track of the previous tag in the best path
  3. Backtrace from the end to recover the most likely sequence of tags

Importance in NLP

POS tagging is a fundamental task in natural language processing that serves as:

  • A preprocessing step for parsing and semantic analysis
  • An essential component in information extraction
  • A disambiguation technique for machine translation
  • A foundation for many higher-level NLP tasks

The success of HMMs in POS tagging demonstrates their power in modeling sequential data with hidden structure, even in complex domains like human language.