When One Model Isn't Enough: Ensemble Regime Detection

Part 72 — HMM Series 3 of 3 — CUSUM, BOCPD, feature engineering, and the three-signal framework I'm testing for V6

Jan 30, 2026

This is part 72 of my series — Building & Scaling Algorithmic Trading Strategies

Parts 1 and 2 covered Hidden Markov Models: the theory, the algorithms, and the unavoidable detection lag. HMMs are the default tool for regime detection, but they’re not the only tool.

This post explores alternatives: changepoint detection methods, machine learning approaches, and ensemble techniques. I’ll also share what I’m currently evaluating for V6 — not a final decision, but a framework I’m testing because it’s practical and I can explain every component.

The goal isn’t to eliminate detection lag (we can’t) but to build a system that’s robust, interpretable, and doesn’t require a PhD to maintain.

Changepoint Detection: A Different Framing

HMMs model regimes as discrete states with transitions. Changepoint detection takes a different view: the data-generating process is stable until it suddenly shifts.

The question changes from “what state are we in?” to “did something just change?”

CUSUM (Cumulative Sum Control Charts)

CUSUM tracks cumulative deviations from an expected mean. When the cumulative deviation exceeds a threshold, a changepoint is signaled.

S⁺(t) = max(0, S⁺(t-1) + (r_t - μ₀) - k)
S⁻(t) = max(0, S⁻(t-1) - (r_t - μ₀) - k)

Signal changepoint when S⁺(t) > h or S⁻(t) > h

Where:
  μ₀ = expected mean under "normal" conditions
  k  = allowance parameter (slack before accumulating)
  h  = decision threshold

S⁺ detects upward shifts in mean; S⁻ detects downward shifts. The allowance k prevents small fluctuations from accumulating. The threshold h controls sensitivity vs. false alarms.

The middle panel shows the CUSUM statistics. Notice how S⁺ (red) accumulates during the bear market as negative returns pile up. When it crosses the threshold h=8, a changepoint is signaled.

Pros of CUSUM:

Simple and interpretable
Well-understood statistical properties
Easy to tune (just k and h)

Cons:

Assumes you know the “normal” mean μ₀
Threshold h trades off speed vs. false positives
Doesn’t give probability—just a binary signal

Bayesian Online Changepoint Detection (BOCPD)

BOCPD, introduced by Adams and MacKay (2007), is more sophisticated. It maintains a probability distribution over “run lengths” — how long since the last changepoint.

The intuition: at each time step, either we’re continuing the current regime (run length increases by 1) or a changepoint occurred (run length resets to 0). BOCPD computes the posterior probability of each possibility.

The bottom panel shows P(recent changepoint) — a continuous probability rather than a binary flag. This is more useful for sizing positions: you can reduce exposure proportionally as changepoint probability rises.

The BOCPD recursion:

P(r_t = 0 | data) ∝ H × Σ P(r_{t-1} | data) × P(x_t | r_{t-1})
P(r_t = r_{t-1}+1 | data) ∝ (1-H) × P(r_{t-1} | data) × P(x_t | r_{t-1})

Where:
  r_t     = run length at time t
  H       = hazard rate (prior probability of changepoint)
  P(x_t | r) = predictive probability of observation given run length

Pros of BOCPD:

Produces probabilities, not just flags
Naturally handles uncertainty
Can model different segment distributions

Cons:

More complex to implement
Requires specifying hazard rate and segment model
Still has detection lag (it’s Bayesian, not clairvoyant)

Machine Learning: Feature Engineering Matters More Than Algorithms

You can frame regime detection as a classification problem: given features, predict Bull or Bear. Random forests, gradient boosting, neural networks — all the usual suspects.

But here’s the thing: the features matter far more than the model.

A random forest with good features will crush a neural network with bad features. For regime detection, good features include:

Price-based:
  - Realized volatility (20-day, 60-day)
  - Return momentum (various lookbacks)
  - Drawdown from recent high
  - % of up days in lookback window

Volatility surface:
  - VIX level
  - VIX term structure (contango/backwardation)
  - VVIX (volatility of volatility)
  - Put/call skew

Credit/macro:
  - High yield spreads
  - Investment grade spreads
  - TED spread
  - Yield curve slope

The chart shows several features and their behavior during bear regimes (shaded). Realized volatility is the single best feature — it reliably spikes during bear markets. But combining multiple features improves robustness.

Feature importance from a random forest trained on regime labels:

Feature             Importance
-----------------   ----------
Realized Vol        28%
VIX Level           22%
Credit Spread       18%
Mean Return         15%
Skewness            10%
Momentum             7%

Volatility dominates. This makes sense — bear markets are defined as much by volatility expansion as by negative returns.

The ML caveat:

Machine learning models are prone to overfitting, especially with regime data where you might only have a handful of bear markets in your training set. A model that perfectly classifies historical regimes may fail on the next one.

Mitigations:

Use simple models (random forest with few trees, logistic regression)
Regularize aggressively
Test on true out-of-sample periods (not random splits)
Prefer interpretable features over complex transformations

Ensemble Methods: Don’t Bet on One Signal

Single detectors fail in different ways:

HMM might lag during sharp crashes but handle gradual regime shifts well
VIX threshold is fast but noisy
CUSUM catches sustained shifts but misses brief spikes

Combining signals reduces reliance on any single method’s weaknesses.

The top panel shows three individual signals: HMM, VIX-based, and momentum. Each has different timing and noise characteristics. The middle panel shows the ensemble (weighted average). The bottom panel shows signal agreement — how many detectors agree at each time.

Two ensemble approaches:

1. Weighted Average:

P(Bear)_ensemble = w₁ × P(Bear)_HMM + w₂ × P(Bear)_VIX + w₃ × P(Bear)_momentum

Example weights: 0.40, 0.35, 0.25

This gives a smooth probability that incorporates all signals. Tune weights based on historical performance or use equal weights as a baseline.

2. Confirmation (Majority Vote):

Signal Bear only if ≥2 of 3 detectors agree.

More conservative: require all 3 to agree.
More aggressive: require any 1 to trigger.

Confirmation reduces false positives but increases lag. You’re waiting for multiple signals to align.

Which is better? Depends on your cost function. If false positives are expensive (transaction costs, whipsaw), use confirmation. If lag is expensive (missing crash protection), use weighted average with lower threshold.

Comparing Methods

No method dominates on all dimensions:

The left panel plots detection lag vs. false positive rate. Lower-left is better (fast and accurate), but no method lives there. You’re always trading off.

The right panel compares methods on multiple criteria:

Speed: How fast to detect (VIX threshold wins)
Accuracy: False positive rate (ensemble wins)
Simplicity: Ease of implementation (VIX threshold wins)
Interpretability: Can you explain decisions? (HMM, VIX win)
Adaptability: Handles novel regimes (ensemble, ML win)

My take: start simple, add complexity only if it helps.

A VIX threshold at 25 is dead simple and catches most bear markets within a day. It has high false positives, but you can filter with confirmation from a second signal. This beats a complex system you don’t understand.

What I’m Evaluating for V6

After all this analysis, here’s what I’m testing for potential integration into V6:

Three signal generators:

HMM Filter (40% weight)
- 2-state Gaussian HMM fitted on rolling 2-year window
- Online filtering (forward algorithm only)
- Outputs P(Bear) between 0 and 1
VIX Regime (35% weight)
- Simple threshold logic with hysteresis
- VIX < 18: P(Bear) = 0.1
- VIX 18-25: P(Bear) = 0.3
- VIX 25-35: P(Bear) = 0.6
- VIX > 35: P(Bear) = 0.9
Volatility Breakout (25% weight)
- 20-day realized vol vs. 60-day baseline
- If current vol > 1.5× baseline: P(Bear) = 0.7
- Else: P(Bear) = 0.2

Ensemble aggregation:

P_bear = 0.40 * hmm_signal + 0.35 * vix_signal + 0.25 * vol_signal

Confirmation layer:

For high-confidence calls (P_bear > 0.7 or < 0.3), require at least 2/3 signals to agree
If signals disagree significantly, stay at neutral allocation

Allocation rules:

P(Bear) < 0.3    → Full risk-on (100% TQQQ)
P(Bear) 0.3-0.6  → Reduced exposure (50% TQQQ, 50% hedge)
P(Bear) > 0.6    → Risk-off (100% hedge)

Hysteresis: ±0.1 buffer to prevent thrashing

Why I’m testing this design:

Interpretable: I can explain every component. When the system signals Bear, I’ll know why.
Robust: No single signal failure kills the system. If HMM breaks, VIX and vol still work.
Tunable: Weights can be adjusted without rewriting code. Thresholds are explicit.
Testable: Each component can be backtested independently before committing to live use.

Implementation (Under Evaluation)

Core ensemble detector I’m testing:

"""
Ensemble Regime Detector - Under Evaluation for V6
Math & Markets - Post 71

All code is open source—use it, modify it, build on it. 
No guarantees about correctness or performance. 
Test everything yourself before deploying with real capital.

Author: K. Iyer
Substack: https://kniyer.substack.com
"""

import numpy as np

class EnsembleRegimeDetector:
    """
    Combines multiple regime signals into ensemble P(Bear).
    """
    
    def __init__(self, weights=None):
        """
        weights: dict with keys 'hmm', 'vix', 'vol'
        """
        self.weights = weights or {'hmm': 0.40, 'vix': 0.35, 'vol': 0.25}
        self.hmm_filter = None  # Initialize separately
        self.vol_baseline = None
        self.last_p_bear = 0.3  # For hysteresis
        
    def vix_signal(self, vix):
        """Convert VIX level to P(Bear)."""
        if vix < 18:
            return 0.1
        elif vix < 25:
            return 0.3
        elif vix < 35:
            return 0.6
        else:
            return 0.9
    
    def vol_signal(self, realized_vol, baseline_vol):
        """Compare current vol to baseline."""
        if baseline_vol <= 0:
            return 0.3
        
        ratio = realized_vol / baseline_vol
        if ratio > 1.5:
            return 0.7
        elif ratio > 1.2:
            return 0.4
        else:
            return 0.2
    
    def compute_ensemble(self, hmm_p_bear, vix, realized_vol, baseline_vol):
        """
        Compute weighted ensemble P(Bear).
        
        Returns: (p_bear, signals_dict, agreement)
        """
        # Individual signals
        p_hmm = hmm_p_bear
        p_vix = self.vix_signal(vix)
        p_vol = self.vol_signal(realized_vol, baseline_vol)
        
        signals = {'hmm': p_hmm, 'vix': p_vix, 'vol': p_vol}
        
        # Weighted average
        p_bear = (
            self.weights['hmm'] * p_hmm +
            self.weights['vix'] * p_vix +
            self.weights['vol'] * p_vol
        )
        
        # Check agreement (how many signal Bear > 0.5)
        bear_votes = sum(1 for p in [p_hmm, p_vix, p_vol] if p > 0.5)
        agreement = bear_votes / 3
        
        return p_bear, signals, agreement
    
    def get_allocation(self, p_bear, agreement, hysteresis=0.1):
        """
        Convert P(Bear) to allocation decision with hysteresis.
        
        Returns: allocation string and confidence
        """
        # Apply hysteresis
        if abs(p_bear - self.last_p_bear) < hysteresis:
            # Small change, stick with previous zone
            effective_p = self.last_p_bear
        else:
            effective_p = p_bear
            self.last_p_bear = p_bear
        
        # Confirmation check for extreme calls
        if effective_p > 0.6 and agreement < 0.67:
            # Want to go risk-off but signals don't agree
            effective_p = 0.5  # Stay neutral
        
        if effective_p < 0.3 and agreement > 0.33:
            # Want to go risk-on but some signals say Bear
            effective_p = 0.4  # Stay cautious
        
        # Allocation zones
        if effective_p < 0.3:
            return 'risk_on', 1 - effective_p
        elif effective_p > 0.6:
            return 'risk_off', effective_p
        else:
            return 'neutral', 0.5
        

# Usage example
if __name__ == "__main__":
    detector = EnsembleRegimeDetector()
    
    # Simulated inputs
    test_cases = [
        {'hmm': 0.15, 'vix': 14, 'rvol': 12, 'baseline': 15},  # Calm
        {'hmm': 0.45, 'vix': 22, 'rvol': 18, 'baseline': 15},  # Elevated
        {'hmm': 0.75, 'vix': 32, 'rvol': 28, 'baseline': 15},  # Crisis
        {'hmm': 0.80, 'vix': 18, 'rvol': 14, 'baseline': 15},  # Disagreement
    ]
    
    print("Ensemble Regime Detector Demo")
    print("=" * 60)
    
    for i, tc in enumerate(test_cases):
        p_bear, signals, agreement = detector.compute_ensemble(
            tc['hmm'], tc['vix'], tc['rvol'], tc['baseline']
        )
        allocation, confidence = detector.get_allocation(p_bear, agreement)
        
        print(f"\nCase {i+1}: VIX={tc['vix']}, RVol={tc['rvol']}")
        print(f"  Signals: HMM={signals['hmm']:.2f}, VIX={signals['vix']:.2f}, Vol={signals['vol']:.2f}")
        print(f"  Ensemble P(Bear): {p_bear:.2f}")
        print(f"  Agreement: {agreement:.0%}")
        print(f"  Allocation: {allocation} (confidence: {confidence:.2f})")

Output:

Case 1: VIX=14, RVol=12
  Signals: HMM=0.15, VIX=0.10, Vol=0.20
  Ensemble P(Bear): 0.14
  Allocation: risk_on (confidence: 0.86)

Case 2: VIX=22, RVol=18
  Signals: HMM=0.45, VIX=0.30, Vol=0.40
  Ensemble P(Bear): 0.39
  Allocation: neutral (confidence: 0.50)

Case 3: VIX=32, RVol=28
  Signals: HMM=0.75, VIX=0.60, Vol=0.70
  Ensemble P(Bear): 0.69
  Allocation: risk_off (confidence: 0.69)

Case 4: VIX=18, RVol=14
  Signals: HMM=0.80, VIX=0.30, Vol=0.20
  Ensemble P(Bear): 0.45
  Allocation: neutral (confidence: 0.50)

Case 4 is interesting: HMM says Bear (0.80) but VIX and vol say Bull. The ensemble stays neutral due to disagreement. This is the confirmation layer in action — preventing whipsaws when signals conflict.

Code Walkthrough and Sample Output

The EnsembleRegimeDetector class encapsulates the entire framework. Here’s what each test case reveals about how the system behaves:

Case 1: Calm Bull Market

Inputs:  HMM=0.15, VIX=14, RVol=12%, Baseline=15%
Signals: HMM=0.15, VIX=0.10, Vol=0.20
Ensemble P(Bear): 0.14
Allocation: RISK_ON (confidence: 0.86)

All three signals agree: this is a low-risk environment. VIX at 14 is well below the 18 threshold, realized vol is actually below baseline (ratio = 0.8), and HMM sees only 15% bear probability. The ensemble P(Bear) of 0.14 triggers full risk-on allocation. Confidence is high (0.86) because the signal is unambiguous.

Case 2: Elevated Uncertainty

Inputs:  HMM=0.45, VIX=22, RVol=18%, Baseline=15%
Signals: HMM=0.45, VIX=0.30, Vol=0.40
Ensemble P(Bear): 0.39
Allocation: NEUTRAL (confidence: 0.50)

The market is showing stress but not crisis. VIX has climbed into the 18-25 range (P=0.30), vol is running 20% above baseline (ratio = 1.2, P=0.40), and HMM is uncertain at 0.45. No signal crosses the 0.5 threshold, so agreement is 0%. The ensemble lands at 0.39—above the 0.3 risk-on cutoff but below 0.6 risk-off. Result: neutral allocation, reduced exposure, wait for clarity.

Case 3: Crisis Mode

Inputs:  HMM=0.75, VIX=32, RVol=28%, Baseline=15%
Signals: HMM=0.75, VIX=0.60, Vol=0.70
Ensemble P(Bear): 0.69
Allocation: RISK_OFF (confidence: 0.69)

Full agreement on danger. VIX at 32 puts us in the 25-35 band (P=0.60), realized vol is nearly double baseline (ratio = 1.87, P=0.70), and HMM is confident at 0.75. All three signals exceed 0.5, so agreement is 100%. Ensemble P(Bear) of 0.69 triggers risk-off. This is the scenario the system is designed for—clear, confirmed danger.

Case 4: Signal Disagreement

Inputs:  HMM=0.80, VIX=18, RVol=14%, Baseline=15%
Signals: HMM=0.80, VIX=0.30, Vol=0.20
Ensemble P(Bear): 0.48
Allocation: NEUTRAL (confidence: 0.50)

This is the interesting case. HMM screams Bear at 0.80—maybe it’s picking up on subtle return patterns or volatility clustering. But VIX is calm at 18 (just at the threshold, P=0.30) and realized vol is actually below baseline (P=0.20).

The weighted average gives P(Bear) = 0.48, which would normally trigger neutral. But here’s where confirmation matters: only 1 of 3 signals (33%) agrees on Bear. The system recognizes this disagreement and stays neutral rather than acting on a single outlier signal.

This prevents a common failure mode: the HMM occasionally gives false positives during choppy-but-not-bearish markets. Without confirmation, you’d whipsaw into risk-off and back. With confirmation, you wait for corroborating evidence.

Case 5: VIX Spike, Vol Lag

Inputs:  HMM=0.50, VIX=38, RVol=20%, Baseline=15%
Signals: HMM=0.50, VIX=0.90, Vol=0.40
Ensemble P(Bear): 0.61
Agreement: 33%
Allocation: NEUTRAL (confidence: 0.50)

Another disagreement scenario. VIX has spiked to 38 (P=0.90)—this often happens on a single large down day. But realized vol over 20 days hasn’t caught up yet (ratio = 1.33, P=0.40), and HMM is ambivalent at 0.50.

The weighted average gives 0.61, which would normally trigger risk-off. But only VIX (1 of 3) signals Bear > 0.5. The confirmation layer overrides: with only 33% agreement on a risk-off call, the system stays neutral.

This is intentional. A VIX spike on one bad day isn’t the same as a sustained bear market. By requiring confirmation, we avoid overreacting to single-day events while still responding quickly when multiple signals align.

Key Takeaways from the Output:

Agreement matters as much as level. P(Bear) = 0.65 with 100% agreement means something different than P(Bear) = 0.65 with 33% agreement.
Confirmation prevents whipsaws. Single-signal outliers get filtered out. You need corroboration for high-conviction calls.
Neutral is a valid allocation. When signals conflict, the right move is often to reduce exposure and wait—not to pick a side.
Hysteresis prevents thrashing. Small moves in P(Bear) don’t trigger allocation changes. You need a meaningful shift to overcome the buffer.

Full script: The complete implementation with synthetic data generation, all three models, and comparison plots is available in ensemble_regime_detector.py (link in the GitHub repo).

ensemble_regime_detector.py

All code is open source — use it, modify it, build on it. No guarantees about correctness or performance. Test everything yourself before deploying with real capital.

The Chess Principle

In chess, there’s a saying: “The threat is stronger than the execution.” A piece threatening multiple squares controls the board more effectively than one committed to a single attack.

Ensemble regime detection follows this principle. Rather than committing to one model’s view of the market, we maintain multiple “threats” — multiple hypotheses about the current regime. When they align, we act decisively. When they disagree, we stay flexible.

The goal isn’t to be right about every regime change. It’s to avoid being catastrophically wrong. An ensemble that’s slow but robust beats a single model that’s fast but fragile.

Summary

Changepoint Detection
  CUSUM: simple, interpretable, threshold-based
  BOCPD: probabilistic, handles uncertainty, more complex
  Different framing than HMM (change vs. state)

Machine Learning
  Features matter more than algorithms
  Realized vol is the most important feature
  Watch out for overfitting with few bear market samples

Ensemble Methods
  Combine multiple signals to reduce single-point failures
  Weighted average for smooth probability
  Confirmation for reducing false positives

What I'm Evaluating
  3 signals: HMM (40%), VIX (35%), Vol breakout (25%)
  Confirmation layer for high-confidence calls
  Hysteresis to prevent thrashing
  Simple, interpretable, testable—still validating before live deployment

This concludes the Regime Detection series. The key insight across all three parts: regime detection is fundamentally about trading off speed vs. accuracy, and no method eliminates the detection lag inherent in statistical inference. The practical solution is building systems that degrade gracefully—ensembles that don’t fail catastrophically when one component breaks. Whether this specific framework survives out-of-sample testing remains to be seen.

As always: Alpha is never guaranteed. And the backtest is a liar until proven otherwise.

The material presented in Math & Markets is for informational purposes only. It does not constitute investment or financial advice.

Machina Quanta

Feb 7

Always cool to see these experiments!

Michael

Jan 30

I started with HMM for regime detection though probabilities were poorly calibrated for my setup, so I moved to supervised transformers. I haven't explored the changepoint methods (CUSUM, BOCPD) at all. Makes me wonder if combining approaches would've worked better than abandoning HMM entirely. Definitely some stuff to tinker with the weekend. Thanks for the great read.

Math & Markets

Discussion about this post

Ready for more?