Back to Blog

How a Backtest Lies (And What to Do About It)

2026-07-01 · investingquantitativebacktestingstatisticsoverfittingmarket-neutral

A confession before we start

We spent a long stretch of compute time trying to find an intraday trading edge in liquid US tech stocks. We looked at semiconductor names, we looked at the full XLK universe, we sliced them into groups of two, three, four, and five. By the end we had run roughly sixty-five million backtests.

The best group we found had a Sharpe ratio of 2.70, was positive every single one of the three years in the sample, and included five completely reasonable-sounding tech names.

We do not recommend trading it. In fact, we recommend actively avoiding it. Here is why, and what the whole exercise taught us about how backtests trick people into losing money.

The idea that started it

The seed was straightforward. Within a tight universe of related stocks, say the semiconductor cluster, names occasionally dislocate from their peers over short windows. If Stock A drops one percent in the last thirty minutes while its twenty peers drift sideways, is that dislocation likely to revert by the close? If yes, you have a recipe for a market-neutral trade. Buy the laggard, short the leader, wait for the pack to close up, exit at the bell. Repeat every day. The market's overall direction cancels out. All you're capturing is the tiny elastic band pulling names back to their peers.

The academic literature calls this cross-sectional reversal, and it's a well-documented effect at longer horizons. The question was whether it worked at the intraday scale with a small basket of related tickers.

What we found (or thought we found)

The first cut looked wonderful. On a 24-name semiconductor cluster, a signal-weighted, beta-neutral book entered at noon and exited at the close produced a net Sharpe ratio of roughly 1.14 over 518 trading days, after realistic transaction costs measured from actual bid-ask spreads.

For context, most professional quant strategies would kill for a real Sharpe of one. This was the number you want to see. Positive in every year of the sample, market-neutral so it shouldn't blow up when the S&P coughs, thin edge per trade but repeatable. Everything on paper said build it, paper-trade it, ship it.

We didn't ship it. Here's what stopped us.

Trap one: the edge lives in one crazy stock

The first thing to do with any multi-name strategy is ask what happens when you remove a single name. If the whole thing collapses without that one ticker, you don't have a diversified strategy. You have a bet on that ticker in a 24-name costume.

We did the leave-one-out test. Remove any given semiconductor name and the Sharpe barely wobbled. Except for one. Remove SMCI and the Sharpe fell from 1.14 to 0.59. Remove three of the most volatile names in the basket and it fell to 0.24. Restrict the universe to only the constituents of SMH, the mainstream semiconductor ETF, dropping the wildest AI-hardware names, and the Sharpe went to negative 0.14. The "diversified 24-name market-neutral strategy" was, to first approximation, a bet on SMCI reverting, in a 24-name costume.

SMCI had a legendary 2024–2026 window: an accounting scandal, a near-delisting from a major exchange, the AI-server boom, wild multi-hundred-percent moves in both directions. Any strategy that fed on that name's volatility reverting would have looked great in that window. Whether it works when SMCI calms down, or has one big move that doesn't revert, is a very different question. And it is not a question you can answer from data that includes only the wild years.

The lesson: whenever a strategy leans on a small number of names, be honest about what you actually own. It's not a diversified portfolio. It's a concentrated bet with a lot of legs.

Trap two: the beautiful average that hides a regime flip

Even after removing SMCI, an interesting pattern remained. A broader cross-sectional reversion on the full XLK tech ETF universe (73 names) still showed a full-sample Sharpe of about 1.36.

Then we broke it out by year.

Bar chart showing the same broad tech reversion strategy with a Sharpe of -1.55 in 2025 and +2.17 in 2026, averaging to a headline +1.36

The strategy had a Sharpe of negative 1.55 in 2025 (a full calendar year) and positive 2.17 in the first half of 2026. The pretty +1.36 average is what you'd calculate on paper. It is not what anyone would have actually experienced. What they would have experienced is a year of bleeding, quitting the strategy in disgust, and then watching it work brilliantly for someone else the next year.

This is not the "same edge, some noise around it" kind of variability. This is the sign flipping. In 2025, dislocations continued (momentum). In 2026, they reverted (mean reversion). The strategy is fundamentally regime-dependent, and the two regimes look opposite from each other.

You can only harvest an edge like this if you know which regime you're in before the fact. With two and a half years of data and one flip, you cannot build a reliable regime detector. You would be guessing, and the cost of guessing wrong is a year of drawdowns.

The lesson: never trust a headline Sharpe on a multi-year backtest without breaking it out year by year. If a full calendar year is negative, you don't have an edge. You have a lucky average.

Trap three: the more you test, the more "signal" you invent

This is the trap that catches almost every retail quant, and a fair number of professional ones. You sit down to build a strategy and try one idea. Sharpe of 0.5. Meh. Try another. And another. And another. Eventually one has a Sharpe of 2.7. You've found the winner.

Except you haven't. You've discovered a statistical certainty called the expected maximum. Even when the true edge of every strategy you're testing is exactly zero, if you test enough of them, some will look great by chance alone. Not "might" look great. Will.

The math is uncomfortably simple. Any Sharpe ratio measured over a couple of years of daily returns has a built-in estimation error of about 0.7. That's not sloppiness; that's just how much a random sequence of 500 numbers naturally jitters around its true mean. Take the maximum of a lot of noisy estimates like that, and you get a predictable answer.

Line chart showing expected maximum Sharpe rising with the number of strategies tested, reaching about 4 at 10 million tests

Test 100 strategies and you're owed a maximum Sharpe of about 2.1 by luck. Test a thousand and you're owed 2.6. Test a million and you're owed 3.7. When we ran our exhaustive XLK search across all pairs and triplets and 4-name and 5-name groups at three different times of day, we tested roughly sixty-five million combinations. The maximum Sharpe you'd expect from pure noise on that many tests is around 4.

The best group we found had a Sharpe of 2.70. It didn't even beat what noise would have handed us for free.

Histogram showing 62,196 group Sharpe ratios distributed around zero with a small right tail containing the "best" result

That's the whole distribution the best result came from. It's centered on zero. There is no edge in the population. The right tail exists anyway, because right tails always exist, and the strategy at the far end of it looks like a genius. It isn't. It's a coin that came up heads seven times in a row when a hundred million coins were being flipped.

The lesson: before you report a backtested Sharpe, ask what Sharpe you were "owed" by the size of your search. If your best strategy's Sharpe doesn't clear that number by a lot, you found nothing.

Trap four: significance without a story

There's a related test worth running. How often does a randomly chosen group of names have a positive Sharpe in every year of the sample? If groups were randomly good or bad in each year, you'd expect that rate to be about 12.5 percent (0.5 times 0.5 times 0.5). That's the baseline. A real edge should push the rate above it.

In our giant XLK search, the actual rate across 15 million five-name groups at noon was 10.8 percent. Below chance. Even the small hint of a real reversion signal in that slice wasn't enough to push the stability rate above the baseline. The "positive every year" groups we found in the top-N list are simply the luckiest of a population that has no more edge than a random walk.

Related to this: the top-ranked "robust" groups all shared the same handful of overlapping names. That's not ten independent discoveries. It's two or three names that happened to co-move in a particular way once, showing up in every combination that contained them. One coincidence wearing many hats.

The lesson: for a backtested edge to be trustworthy, it needs to (a) beat the chance baseline for its search size, and (b) have a reason to work. If you can't explain in one sentence why the market is paying you (some risk premium, some behavioral pattern, some institutional flow), you don't have an edge. You have a rationalization looking for a cause.

The four honest questions

If you take one thing away from this post, take these four questions. Ask them of every backtested strategy you're tempted to trade, especially your own.

One. If I remove the top-contributing name from the universe, does the Sharpe still hold? If the answer is "no, it collapses" then you don't have a portfolio. You have a single-name bet.

Two. What does the Sharpe look like broken out by calendar year? If any full year is meaningfully negative, the headline is a mirage. You cannot harvest an average you never experience.

Three. How many strategies did I test to find this one? If the answer is "a lot," compare your top Sharpe to what pure luck would have produced from a search that big. If you didn't beat luck comfortably, you found nothing.

Four. Can I state, in one sentence, why the market is paying me? If not, you've discovered a coincidence, and coincidences do not repeat on schedule.

Why any of this matters

Backtesting is a genuinely powerful tool. You can characterize an idea in seconds that would have taken years to observe live. The problem is not the tool. The problem is that human pattern-matching, in the presence of noisy data and a large search, will always find something that looks amazing. That's not a bug in the human. It's what our brains are for.

The discipline is to build the guardrails before you look at the winner. Set your chance baseline before you start. Split your data by year before you decide it's a great strategy. Ask what would break your thesis before you fall in love with it. When you find something that survives all of those, you have something worth paper-trading. When you find something that only survives one of them, you have something worth writing about, in a post like this one, as a cautionary tale.

We spent weeks looking for an edge and did not find one worth trading. That's not a disappointment. That's the search working. The people who get hurt are the ones whose searches "succeed" and who put real money on the winner. The tools that saved us — the leave-one-out test, the per-year breakdown, the chance-max calculation, the demand for a story — are cheap. Use them on every backtest, especially your own. The best backtest you ever run is the one you talk yourself out of trading.