Three-Phase Backtesting Protocol Stops Curve Fitting

You hand a swing system fourteen years of daily bars, sweep eight parameter knobs, and the equity curve climbs at a 38 degree slope with a 1.92 Sharpe and a 9.4 percent maximum drawdown. You go live in March. By August the strategy is flat. By November it is bleeding. Nothing about the market broke. The backtest broke before the first trade was ever placed.

That gap between a printed curve and a real one is the problem the three-phase backtesting protocol exists to close. The protocol splits the historical record into three roles instead of one. An in-sample window where you are allowed to optimise. A walk-forward stage where rolling re-optimisation tests how stable those parameter choices really are. A locked out-of-sample holdout that you never touch until the very end. Used together, the three windows turn a single confident curve into a much smaller, much more honest estimate of edge.

What follows is the practitioner version of that protocol. No formulas, just the discipline.

Why a single history is not the three-phase backtesting protocol

The instinctive way to test a strategy is to take the longest history you can find, run your full parameter grid across it, and pick the variant that produced the best result. The problem is that you have used the entire dataset to choose your rules. The curve you are looking at is not a forecast of how the system performs in the future. It is the upper envelope of how it performed on the exact data you tuned it against.

The longer the history and the wider the parameter sweep, the more brutal this gets. A grid with three knobs of ten values each is a thousand candidate strategies. By pure chance, one of them will look spectacular on walk-forward analysis done badly. That is the curve-fit. A single contiguous backtest cannot detect it by construction, because there is no honest data left to detect it with.

The three-phase backtesting protocol partitions the problem. You sacrifice some of your history to a holdout you are not allowed to touch. You spend the rest on a fitting window and a rolling stability test. The number it produces is smaller. It is also far closer to what live trading will deliver.

Phase one of the three-phase backtesting protocol: in-sample optimisation

The in-sample period is where you are allowed to look at outcomes and adjust. You choose the indicator family, you set the parameter grid, you decide on entry and exit logic. Every modelling choice gets made here.

Two sizing rules carry most of the weight. The first is sample size measured in trades, not in years. A daily swing system that averages twelve trading days per round-trip needs to log at least 200 completed in-sample trades for parameter selection to mean anything. Below that, the optimiser is fitting to noise. The second is parameter discipline. I cap each parameter at five values during the sweep. Five times five times five is 125 candidates. Eight values across three knobs is 512. The candidate count is the multiplier on your false-positive rate, and it compounds harder than people expect.

The output of phase one is not a winning strategy. It is a parameter region. If the best Sharpe sits at a thirty-bar lookback and a 1.5 ATR stop, but the second, third, and fourth best variants all sit within two bars and a tenth of an ATR of that point, you have a region. If the top variant is one tight peak surrounded by losers, you have a fluke. Regions survive walk-forward. Peaks do not.

Concrete negative: a high in-sample Sharpe is not a result. It is the starting condition for the next two phases. Treating it as a result is how good engineers ship terrible systems.

Phase two of the three-phase backtesting protocol: walk-forward analysis

Walk-forward is the bridge between fitting and forecasting. You split the in-sample period into a series of rolling windows. On window one, you re-run the optimisation. You take the winning parameter set, you apply it forward to a small look-ahead chunk, you record the result. You slide both windows forward, and you do it again. Window after window, the optimisation is allowed to redecide, but the forward-applied trades are forced to live with the previous decision.

Window sizing depends on the holding period. The fitting window should hold at least 60 to 100 in-window trades. The forward step should hold at least 15 to 25 trades. For a system averaging twelve bars per round-trip that translates to roughly an eighteen-month fit and a five-month step. Anchored walk-forward keeps the start of the fit window fixed and only the end moves. Rolling walk-forward moves both ends. Anchored is gentler on systems whose edge degrades slowly. Rolling is harsher and catches regime drift sooner. I use rolling by default.

What the walk-forward output really tells you is parameter stability. If every window picks roughly the same parameter set, the strategy has a stable preference and you have found something real. If each window’s optimum jumps wildly from the last, the surface is flat and the optimiser is choosing among look-alike candidates. That second case looks fine on the equity curve and fails the moment live regimes shift, because there was never a stable peak to lock onto.

This stage is also where survivorship bias in your backtest universe can quietly inflate every window. If your ticker list is today’s index members, you have already filtered out the failures. Use a point-in-time membership list or accept that the result is an optimistic ceiling.

Phase three of the three-phase backtesting protocol: the out-of-sample holdout

The out-of-sample holdout is the part of history you locked away at the beginning. You did not look at it during phase one. You did not let walk-forward touch it. The parameters chosen by phases one and two are now frozen, applied once to the holdout, and the result is read once. That is the entire protocol for phase three. One pass. No tweaks.

The discipline is procedural, not technical. The moment you let a poor holdout result push you back into phase one to “improve” the strategy, the holdout is no longer out-of-sample. It has been seen. Once seen, it cannot be unseen, and the next holdout you carve out of the same dataset will be smaller, more recent, and more dangerous. The only honest response to a failed holdout is to abandon the strategy or to walk all the way back to a different idea on a different question.

How big should the holdout be? Two answers, both measured in trades. The minimum is 30 to 40 completed trades, below which the result is dominated by sampling noise. The preference is closer to 80 to 100, because that is where confidence intervals start tightening enough to distinguish a real edge from a flat zero. A four-year holdout with 18 trades is a coin flip. A two-year holdout with 80 trades is information.

Concrete negative: a positive OOS Sharpe with 11 trades is not evidence of edge. It is one good streak that survived random chance. The protocol cannot rescue a sample size that small.

Failure modes the three-phase backtesting protocol cannot rescue

Splitting the data into three phases prevents specific errors. It does not prevent all errors. Four leaks make the protocol cosmetic if you let them in.

The first is parameter sprawl. Five values per knob is discipline. Twelve values per knob is a multiple-comparisons problem dressed up as research. Even a clean OOS pass means less when you started with 4,000 candidates than when you started with 60.

The second is reusing the holdout. The first time you re-touch the OOS window to fix a strategy, it stops being out-of-sample for that strategy family. Keep a written log of which idea touched which holdout. Carve a fresh holdout from a different period for a fresh question.

The third is look-ahead bias built into the inputs themselves. If your feature uses end-of-day data that was not available until two hours after the bar closed, your backtest is reading the future. The cleanest test is to ask, for every input value at bar T, exactly when it was knowable in real time. Anything that becomes knowable after the trade decision is contamination.

The fourth is reporting Sharpe as a single number when the per-window dispersion was huge. A 1.4 average Sharpe across walk-forward windows that ranged from minus 0.3 to plus 3.1 is not the same animal as a 1.4 average from windows that ranged 1.1 to 1.7. The second is a strategy. The first is a coin that flips harder than it should.

The checklist I run before trusting any three-phase backtesting protocol result

Once a strategy has gone through the protocol, the result needs a final read. The check is procedural and ordered.

How many in-sample trades drove parameter selection? Below 200, treat the choice as provisional.
How wide was the parameter grid? Multiply the value counts. If it crossed 500 candidates, the OOS bar has to rise.
Did walk-forward produce a stable parameter region, or did each window jump? Stability matters more than the headline Sharpe.
How many out-of-sample trades drove the holdout result? Below 30, the result is not yet data.
What is the OOS drawdown profile compared to in-sample? If the holdout drawdown is materially deeper, the in-sample fit was hiding tail risk.
What is the trade distribution? A strategy that owes its OOS Sharpe to two outsized winners is fragile to the next missed signal.
How does the OOS performance hold under monte carlo resampling of the equity curve? A real edge survives bootstrapped reordering. A fragile one collapses.
Were the inputs strictly point-in-time, with no look-ahead?
Was the universe point-in-time, or today’s survivors?
Has the OOS window been touched before for any earlier variant of this idea?

Anything that fails one of those checks is not a system to fund. It is a hypothesis to either fix at the design layer or shelve. The discipline echoes how Ed Seykota framed systematic trading from the start: a system you cannot test honestly is a system you cannot trust under pressure.

From confidence trap to filter with the three-phase backtesting protocol

A backtest done as one continuous pass tells you what the best variant of your strategy did on the data you tuned it on. That is a confidence trap, and it has emptied more accounts than any one bad trade ever has. The three-phase backtesting protocol does something different. It tells you, with a smaller and more sober number, what a frozen version of your strategy would have done on a window it never saw. That is a filter. The number is lower. The decision it supports is far better.

The discipline is annoying. You will sometimes spend a month on a strategy that posts a beautiful in-sample curve and dies in walk-forward. Good. That month is the month you did not put live capital into a curve-fit. Learn the pattern. Ride the trend. Keep the gains.

Educational content only. Not investment advice. Trading involves risk. You are responsible for your decisions.

Why a single history is not the three-phase backtesting protocol

Phase one of the three-phase backtesting protocol: in-sample optimisation

Phase two of the three-phase backtesting protocol: walk-forward analysis

Phase three of the three-phase backtesting protocol: the out-of-sample holdout

Failure modes the three-phase backtesting protocol cannot rescue

The checklist I run before trusting any three-phase backtesting protocol result

From confidence trap to filter with the three-phase backtesting protocol

You Might Also Like

MarketSenseAI Validation – Can Multi-Agent LLM Systems Beat Single-Model Stock Scoring

Weinstein Stage Analysis: The Four-Stage Trend Filter

McGinley Dynamic Moving Average Explained