AlphaSharpe and LLM-Evolved Ranking Metrics - Why the Sharpe Ratio Fails at Stock Screening

You rank 200 stocks by trailing Sharpe ratio, pick the top 20, and build your watchlist. Six months later, half of them have underperformed the index. The metric that was supposed to identify the best risk-adjusted performers did not predict anything useful. This is not bad luck. It is a structural problem with how the Sharpe ratio behaves when used as a screening tool rather than a portfolio evaluation tool.

A January 2025 paper from aiXplain, “AlphaSharpe: LLM-Driven Discovery of Robust Risk-Adjusted Metrics” (arXiv 2502.00029, Yuksel and Sawaf), put numbers on this problem. The researchers used large language models to evolve new ranking metrics through iterative mutation and crossover. Their best evolved metrics achieved over 3x higher rank correlation with future Sharpe ratios than the traditional Sharpe ratio itself. Portfolios built from the top-ranked stocks showed realized Sharpe improvements of 76% to 102% depending on the selection threshold. The Sharpe ratio, when used as a stock ranker, barely predicted its own future values.

I have been screening stocks with composite factor scores for years, and this finding matches my experience. The Sharpe ratio works well as a backward-looking report card. It fails as a forward-looking filter. Understanding why it fails, and what the AlphaSharpe approach actually fixes, matters for anyone building multi-factor composite screening systems.

Why the Sharpe Ratio Fails as a Stock Screening Tool

The Sharpe ratio was designed to evaluate a completed investment period. William Sharpe published it in 1966 as a way to compare mutual fund performance after the fact. The formula is simple: excess return divided by standard deviation of returns. When you use it to rank stocks for future selection, you are asking it to do something it was never built for.

Three problems appear immediately.

First, standard deviation treats upside and downside volatility identically. A stock that grinds steadily higher for 11 months and then spikes 15% in month 12 gets penalized for that spike. The denominator inflates, and the Sharpe ratio drops, even though that volatility was entirely in your favor. When I screen for swing trade candidates, I regularly see stocks with moderate Sharpe ratios that had excellent risk-adjusted returns once you separate upside from downside moves.

Second, the Sharpe ratio assumes returns follow a normal distribution. Real stock returns are skewed and fat-tailed. A stock with occasional large drawdowns followed by steady recovery looks statistically similar to one with smooth, consistent gains. The Sharpe ratio cannot distinguish between these two profiles. A screening tool should, because the first stock will eventually hit you with a drawdown right after you buy it.

Third, the metric is unstable in small samples. If you compute trailing 6-month Sharpe ratios and re-rank weekly, the rankings shift dramatically based on which days enter and exit your lookback window. I have watched stocks jump 50 positions in a weekly ranking simply because a single volatile day rolled out of the calculation window. That level of rank instability means the screener is not measuring anything persistent.

What AlphaSharpe Actually Does

The AlphaSharpe paper does not propose a single replacement formula. It proposes a framework: use an LLM to generate candidate metrics as Python functions, score those metrics on how well their historical rankings predict future realized Sharpe ratios, then feed the top-scoring metrics back through mutation and crossover to produce better variants.

The scoring process uses three measures of ranking quality. Spearman rank correlation checks whether assets ranked highly by the metric also rank highly in future performance. Kendall’s tau does the same with a different statistical approach. NDCG (normalized discounted cumulative gain) measures whether the metric puts the best future performers near the top of the list, weighting errors at the top of the ranking more heavily.

The researchers tested on 15 years of data covering 3,246 US stocks and ETFs, with a 3-year out-of-sample test period that included the 2020 COVID crash. Their best evolved metrics produced Spearman correlations of 0.409 with future performance, compared to 0.130 for the traditional Sharpe ratio. That is not a marginal improvement. It is the difference between a metric that weakly suggests future performance and one that materially predicts it.

Four evolved metrics emerged from the framework, each building on the previous one. The first, alphaS1, replaced arithmetic mean returns with expected log excess returns and added stability constants to reduce sensitivity to outliers. The second, alphaS2, added downside risk and a forward-looking volatility estimate using only the most recent quarter of data. The third, alphaS3, incorporated skewness, kurtosis, and maximum drawdown adjustments. The fourth, alphaS4, added a regime-dependent factor that slightly boosts scores when mean excess returns are positive.

The Sharpe Ratio Stock Screening Problem in Practice

The paper’s ranking correlation numbers tell you something important about practical screening. A Spearman correlation of 0.130 between trailing Sharpe and future Sharpe means the relationship is barely above random. If you ranked 100 stocks by last year’s Sharpe and then checked their Sharpe ratios over the next year, you would find almost no correspondence between the two lists.

Contrast that with what most traders assume. Many screening systems use trailing Sharpe as a quality filter: “only show me stocks with Sharpe above 1.5.” This assumes that a high historical Sharpe signals something persistent about the stock’s risk-return profile. The data says otherwise.

I run multi-factor screens that combine momentum, drawdown quality, relative strength, and volatility characteristics. When I tested adding a raw Sharpe filter to those screens, the hit rate barely changed. When I replaced it with a metric that penalized downside volatility separately and adjusted for the most recent volatility regime, the improvement was immediate. Not because I discovered some secret formula, but because the replacement metric addressed the exact weaknesses the AlphaSharpe paper identifies.

What the Evolved Metrics Fix

The four AlphaSharpe metrics share a common pattern. Each one starts with log excess returns instead of arithmetic returns. This is not a cosmetic change. Log returns compound properly, and they compress extreme values. A stock that returns 100% and then loses 50% has an arithmetic mean return of 25% but a geometric (log) return of 0%. The log-return-based metric correctly identifies this as a stock going nowhere.

The formula for the simplest evolved metric, alphaS1, looks like this:

\alpha_{S1} = \frac{\exp(\mathbb{E}[\log R - r_f])}{\sqrt{(\sigma_{\log R}^2 + \epsilon) \cdot (\sigma_{\log R} + \epsilon)}}

The epsilon terms are stability constants that prevent division by near-zero values when a stock has very low volatility over the lookback period. This is a practical fix for a real problem: when you compute standard Sharpe for a stock that barely moved, the ratio can spike to extreme values purely from the small denominator.

AlphaS2 adds two components to the denominator. First, a downside risk measure that separately counts the standard deviation of negative returns, weighted by the frequency of negative days. This addresses the upside-versus-downside problem directly. Second, a “forecasted volatility” term calculated from only the most recent quarter of data. Instead of treating the full lookback window equally, the metric puts more weight on recent historical volatility.

AlphaS3 adjusts for skewness and kurtosis explicitly. A stock with negative skew (more large down days than large up days) gets penalized. A stock with excess kurtosis (fat tails) gets penalized. Maximum drawdown enters the formula as well. These are concepts from the Sortino ratio, the Calmar ratio, and skewness-adjusted Sharpe variants that already exist in finance. The difference is how they combine. The LLM discovered a multiplicative interaction between these factors that no human researcher had tested in exactly that form.

How to Build and Test Your Own Screening Metrics

You do not need an LLM to apply the lessons from this paper. The core insight is that a good stock screening metric should predict future ranking, not describe past performance. Testing that property is straightforward.

Start with your current screening metric. Compute it for your universe of stocks at the start of each rolling period (say, every month over 5 years). Then compute the realized Sharpe ratio for each stock over the following 6 months. Calculate the Spearman rank correlation between your metric’s ranking and the future Sharpe ranking. If the correlation is below 0.20, your metric is not predicting anything useful.

Next, try modifications inspired by what AlphaSharpe found. Switch from arithmetic returns to log returns. Separate downside and upside volatility in your denominator. Add a maximum drawdown penalty. Weight recent data more heavily than older data. Each modification is a hypothesis you can test with the same Spearman correlation framework.

Here is a concrete example. Take a standard 6-month Sharpe ratio and modify it in three steps:

Step 1: Replace mean excess return with exponentiated mean log excess return. This handles compounding correctly.

Step 2: Replace standard deviation with a blend of downside deviation and recent-quarter volatility. Weight them equally to start.

Step 3: Multiply the result by (1 – max_drawdown) to penalize stocks with deep pullbacks.

Test each step independently. Measure Spearman correlation with future 6-month Sharpe at each stage. If a modification does not improve the correlation, drop it. If it does, keep it and add the next one.

The Overfitting Problem With Custom Metrics

Any time you optimize a metric against historical data, you risk overfitting. The AlphaSharpe paper addresses this with time-series cross-validation: splitting data into overlapping folds, evolving metrics on the in-sample portion, and testing on an out-of-sample period. The out-of-sample test included the extreme volatility of 2020, which is a useful stress test for any evolved metric.

For a retail trader testing custom metrics, the overfitting risk is real. Three guardrails help.

First, always hold out a test period. If you have 10 years of data, use 7 for development and the final 3 for validation. Never touch the test period until you have finalized your metric. If you peek at test results and then modify the metric, you have contaminated your test set.

Second, count your degrees of freedom. The alphaS1 metric has essentially 2 parameters: the choice to use log returns and the epsilon constant. AlphaS4, the most complex variant, adds maybe 6 design choices. A metric with 20 tunable parameters fit to 5 years of data is almost certainly overfit. Keep the number of free choices small relative to your data length.

Third, check stability across subperiods. Split your in-sample data into 3-year blocks and compute the Spearman correlation for each block separately. If the correlation is 0.35 in one block and 0.05 in another, the metric is not stable. You want consistent predictive power across different market environments.

Connecting This to Multi-Factor Screening

If you already use multi-factor screens, this paper suggests a specific improvement. Replace any raw Sharpe ratio component with a modified version that uses log returns, separates downside risk, and adjusts for drawdowns. You do not need to replicate the LLM evolution process. You just need a ranking metric that better predicts future risk-adjusted returns.

The multi-factor approach actually protects you from the overfitting risk of a single custom metric. When you combine momentum, relative strength, volatility characteristics, and a modified Sharpe-style quality measure, each factor provides a partial view. The composite is more stable than any individual metric, even an evolved one.

I use a composite scoring system where each factor gets ranked separately, and the final score is a weighted average of percentile ranks. Replacing the vanilla Sharpe component with a log-return, downside-adjusted variant improved my screen’s 6-month forward hit rate. The improvement was modest (about 3-4 percentage points on a base of 55-60%), but it was consistent across different market conditions. That consistency matters more than the magnitude.

What the Paper Does Not Tell You

The AlphaSharpe results come from ranking 3,246 US stocks and ETFs using equal-weight portfolios of the top 10-25% of the universe. That is an institutional-scale selection problem: pick the best 300-800 names from a broad universe.

Most retail swing traders work with a much smaller universe and a much smaller portfolio. If you are picking 5-10 stocks from a universe of 200, the statistical properties change. Rank correlation metrics like Spearman and NDCG are less stable with smaller samples. A metric that works well for sorting 3,000 names might not differentiate meaningfully among 200.

The paper also uses equal-weight allocation across all selected stocks. This removes position-sizing from the analysis entirely. In practice, how much you allocate to each position matters as much as which positions you select. A screening metric that puts a large-drawdown stock at rank 5 instead of rank 50 is only a problem if you size positions equally. If you size by inverse volatility or by conviction, the ranking error matters less.

The out-of-sample test period (approximately 2020-2022) includes a crash and a strong recovery. The evolved metrics performed well through both regimes. But 3 years is still a short validation window. I would want to see performance through a prolonged bear market (2008-type) and a low-volatility grind (2017-type) before concluding the metrics generalize broadly.

Practical Steps for Your Screening System

If this topic resonates, here is a checklist for testing a modified ranking metric in your own system.

1. Export your universe’s daily returns for at least 5 years. Compute trailing 6-month values of your current screening metric and the forward 6-month realized Sharpe. Calculate Spearman correlation. This is your baseline.

2. Switch to log returns in your metric’s numerator. Recompute. Compare Spearman correlation to baseline.

3. Replace total volatility with downside deviation (standard deviation of negative returns only). Recompute. Compare.

4. Add a maximum drawdown penalty. Multiply your metric by (1 – max_drawdown_over_lookback). Recompute. Compare.

5. Weight recent data more heavily. Compute volatility using only the most recent quarter instead of the full lookback. Recompute. Compare.

6. Validate on a held-out test period. If the improved Spearman correlation holds in the test period, you have a better screening metric.

Each step is testable independently. You might find that log returns alone account for most of the improvement, and the other adjustments add noise for your particular universe. That is fine. The point is to test rather than assume.

Educational content only. Not investment advice. Trading involves risk. You are responsible for your decisions.

Why the Sharpe Ratio Fails as a Stock Screening Tool

What AlphaSharpe Actually Does

The Sharpe Ratio Stock Screening Problem in Practice

What the Evolved Metrics Fix

How to Build and Test Your Own Screening Metrics

The Overfitting Problem With Custom Metrics

Connecting This to Multi-Factor Screening

What the Paper Does Not Tell You

Practical Steps for Your Screening System

You Might Also Like

Directional Movement Index DMI explained

Negative Volume Index – What Quiet Days Reveal About Trend

Double Exponential Moving Average (DEMA) Explained for Traders: Calculation, Periods, and When It Works Best