Lookahead Bias: From 3 Strategies to 1

Part 31 talks about the importance of backtesting in filtering out biases

Nov 23, 2025

This is part 31 of my series — Building & Scaling Algorithmic Trading Strategies

The Harsh Truth About Backtesting

I thought I had three exceptional trading strategies. The numbers looked incredible: CAGRs ranging from 31% to 58%, Sharpe ratios above 1.0, and ROIs that would make any quant fund jealous. Then I did what every trader should do but many don’t: I looked for lookahead bias.

By the end of this analysis, only one strategy remained standing. The other two collapsed spectacularly when I fixed critical flaws in my backtesting logic.

The Starting Lineup

I had narrowed my research down to three promising strategies:

1. Two-Factor Dual Allocator (V3)

A volatility-enhanced allocation strategy that dynamically adjusts leverage between SPY and leveraged ETFs based on market conditions.

Initial metrics (≈10 years):

ROI: 14.48x
CAGR: 31.5%
Sharpe: 1.08
Max Drawdown: 31.2%

2. VIX Term Structure Strategy

A mean-reversion play on the VIX futures curve, going long XIV (short VIX) when the curve is in contango and flat otherwise.

Initial metrics (≈5.3 years):

ROI: 11.28x
CAGR: 57.7%
Sharpe: 0.70
Max Drawdown: 28.0%

3. SPY/VIX Synthetic Toggle

A machine learning model that predicts when to be long SPY vs. flat based on VIX and market dynamics.

Initial metrics (≈9.7 years):

ROI: 31.96x
CAGR: 43.3%
Sharpe: 2.92
Max Drawdown: 10.9%

Initial Takeaways

Looking at these numbers, the SPY/VIX toggle seemed almost too good to be true: a Sharpe of 2.92 with only 11% drawdown? That’s institutional-grade performance. The Dual Allocator delivered solid returns with reasonable risk, while the VIX strategy showed explosive growth (though over a suspiciously short period).

There is absolutely some bias or overfitting going on.

The Overfitting Question

Before committing capital, I asked myself: Are these numbers real or artifacts of data mining?

My initial plan was sensible:

Primary sleeve: Use the SPY/VIX Synthetic Toggle with walk-forward validation
Satellite sleeve: Deploy Dual Allocator V3 with tighter risk controls
Exploratory: Keep VIX term structure on the bench until more data validated it

But I needed to do one more thing first: stress-test each strategy for the most common backtesting errors.

The Autopsy Begins

Question 1: Is the SPY/VIX Toggle Too Good to Be True?

The First Red Flag

I ran a sanity check by flipping the trading logic. The strategy was coded as “go long when prob_long < 0.5”—which seemed backwards. When I tested the natural rule (long when prob_long > 0.5), the results were catastrophic:

ROI: 0.11x (lost 89% of capital)
CAGR: -20.6%
Sharpe: -1.76
Max Drawdown: -91%

The Problem: The model’s probability labels were inverted relative to the trading rule. The “32x ROI” wasn’t skill—it was accidentally trading against a broken model and getting lucky.

The Fix

I rebuilt the strategy from scratch:

Replaced XGBoost with a simple logistic regression
Aligned the exposure rule: long when prob_long > threshold
Added proper train/test split with grid search on training data only
Best threshold found: 0.3

The Truth (after fix):

Full period: 2.80x equity, 11.2% CAGR, 0.69 Sharpe, -34.1% drawdown
Test period: 1.79x equity, 22.7% CAGR, 1.39 Sharpe, -19.0% drawdown
Buy-and-hold SPY: 3.55x equity, 13.9% CAGR, 0.81 Sharpe, -34.1% drawdown

The model achieved ~97-100% exposure (essentially always long) and barely beat buy-and-hold on the training set while underperforming on the full period. The 32x ROI was pure fantasy.

Verdict: ❌ Strategy eliminated

Question 2: What About the Other Two?

I documented my concerns:

Dual Allocator V3:

Positions applied to same-day returns (lookahead bias)
Missing borrow/financing costs for leveraged ETFs

VIX Term Structure:

No borrow costs despite 5x leverage on short positions
Only 5.3 years of data (small sample)
Sample dominated by 2020-2022 volatility regime

Both needed immediate fixes.

The Dual Allocator Collapse

The Lookahead Problem

The original backtest calculated signals using end-of-day data and then applied those positions to that same day’s returns. This is a classic lookahead bias: you’re implicitly assuming you knew the day’s price movement before it happened.

The Fix: Shift execution by one day. Signals computed at close of day T apply to returns from day T+1.

The Results:

V3 (Volatility-Enhanced) Strategy:

Before fix: 1,447% return, 33.8% CAGR, 1.078 Sharpe ✨ FAKE
After fix: 251% return, 13.4% CAGR, 0.514 Sharpe 📉 REAL
Impact: 82% performance decrease
Verdict: Still profitable but underperforms buy-and-hold

V5 (ML-Optimized) Strategy:

Before fix: 6,387,419% return, 202% CAGR, 3.97 Sharpe 🚀 FANTASY
After fix: -27.97% return, -3.2% CAGR, 0.043 Sharpe 💀 DISASTER
Impact: Complete collapse—the strategy loses money
Verdict: Total failure

Benchmarks (unchanged):

QQQ Buy & Hold: 417% return, 17.9% CAGR, 0.848 Sharpe
TQQQ Buy & Hold: 1,773% return, 34.1% CAGR, 0.779 Sharpe

The V5 Disaster: The ML optimization didn’t learn market patterns—it learned to exploit the lookahead bias. When I removed the bias, the emperor had no clothes. The SHAP analysis was picking up on future returns because they were leaking into the features.

Verdict: ❌ Both strategies eliminated

I am still hopeful that with TLT hedging, I might be able to resurrect this strategy but for now, it has been eliminated.

Question 3: The Last Strategy Standing

Only the VIX Term Structure strategy remained. But was it clean?

Potential Overstatement Factors:

Missing borrow costs: The backtest uses 5x leverage with no borrow fees. Real-world costs would be:
- Borrow fees for short VIXM positions
- Margin financing costs
- These could reduce returns by 200-500 bps annually
High leverage assumption: 5× notional with 50% position sizing assumes perfect margin availability—unlikely in stressed markets when you’d want the strategy most
Short sample bias: 5.3 years starting in 2020 captures the COVID volatility spike and subsequent regime. The 57.7% CAGR is inflated by this unusual period
Sharpe of only 0.70: Despite the high CAGR, the Sharpe ratio suggests significant volatility—consistent with a carry trade that occasionally blows up

What’s Right:

Positions are already shifted by one day (no lookahead)
Trading costs included (5 bps + 15 bps slippage + 2 bps liquidity)
The strategy logic is sound: contango means VIX futures roll down over time

Estimated Reality Check:

If I add realistic borrow costs and extend the backtest to include 2004-2020 (when reconstructing the VIX curve), I’d expect:

CAGR to drop to 25-35% range
Sharpe to remain around 0.5-0.7
Max drawdown to potentially increase (2008, 2011 were rough for short vol)

Verdict: ⚠️ Strategy survives but needs conservative assumptions

The Key Lessons

1. Lookahead Bias is Everywhere

The most common mistake in backtesting is using information you wouldn’t have had at trade time. My two biggest culprits:

Same-day execution: Applying signals to same-day returns
Inverted logic: Building a model on one relationship and trading the opposite

2. Spectacular Returns Deserve Spectacular Scrutiny

A Sharpe ratio above 2.0 in long/flat equity strategies? Almost certainly wrong.
CAGR of 202%? You’ve found a bug, not alpha.
6+ million percent returns? Your code is broken.

If your backtest shows better risk-adjusted returns than Renaissance or Citadel, you probably have lookahead bias, not a billion-dollar strategy.

3. ML Can Learn Your Mistakes

The V5 strategy’s complete collapse was the most educational failure. The machine learning model achieved its amazing results by learning to exploit the lookahead bias, not by discovering real market patterns. When I removed the bias, the model had nothing left.

Lesson: ML models are incredibly good at finding patterns—including patterns in your bugs.

4. Simple Often Beats Complex

After all this analysis:

QQQ buy-and-hold: 17.9% CAGR, 0.848 Sharpe
My best “strategy”: 13.4% CAGR, 0.514 Sharpe (V3 after fixes)

The simple approach won. And that’s before considering:

Implementation complexity
Transaction costs
Slippage in real execution
Psychological difficulty of trading actively

5. Short Samples Lie

The VIX Term Structure strategy’s 57% CAGR came from a 5.3-year sample heavily weighted toward an unusual volatility regime. That’s not a robust estimate—it’s a regime-dependent outcome that will likely disappoint going forward.

Final Thoughts

This is why you backtest and always critically evaluate your strategies.

One that accidentally inverted its logic
Two that exploited lookahead bias
One that might work but needs conservative assumptions

If you’re backtesting your own strategies, I hope this post helps you find your bugs before the market does.

Leveraged ETFs carry significant risks and may not be suitable for all investors. The information presented in Math & Markets is not investment or financial advice and should not be construed as such.

Math & Markets

Discussion about this post

Ready for more?