LLM-Conditioned Momentum - News Sentiment Screening

You run a standard 12-month minus 1-month momentum screen on S&P 500 names. Twenty stocks pass. Three of them have earnings misses buried in last week’s news. Two have analyst downgrades. One faces a regulatory investigation. The raw momentum score does not know any of this. It only sees trailing price returns. An LLM conditioned momentum filter catches what the price curve misses by reading the news before you commit capital.

I started experimenting with this approach after reading Anic, Barbon, Seiz, and Zarattini’s 2025 paper “ChatGPT in Systematic Investing” (arXiv 2510.26228). Their results were clear enough to test: an LLM that evaluates whether recent news supports a continuation of past returns can improve Sharpe and Sortino ratios relative to a naive momentum benchmark. The gains are strongest in concentrated portfolios. That matters for swing traders who typically hold 5-15 names, not 50.

What LLM-Conditioned Momentum Actually Means

Classic cross-sectional momentum ranks stocks by their trailing 12-month return, skipping the most recent month (the “12M-1M” screen). You buy the top decile, hold for a month, rebalance. This has been one of the most studied factors in quantitative finance since Jegadeesh and Titman documented it in 1993.

The problem: momentum crashes. Stocks that screen well on trailing returns sometimes reverse violently because the price trend was driven by a catalyst that has expired. An earnings beat six months ago inflated the return. The company just lost a major contract. The momentum score still looks strong. The forward return does not.

LLM-conditioned momentum adds a second layer. After ranking by trailing returns, you feed each candidate’s recent news to a large language model and ask a specific question: “Does the news flow support a continuation of this stock’s recent trend?” The model returns a score. Stocks with low continuation scores get filtered out or down-weighted.

This is not sentiment analysis in the traditional NLP sense, where you count positive and negative words. The LLM processes context, nuance, and the relationship between a news event and a stock’s existing price trend. A positive earnings surprise for a stock already up 40% might get a neutral continuation score if the surprise was already priced in via pre-announcement guidance. That contextual reasoning is what makes LLMs different from keyword-based sentiment tools.

The Research Behind It

Anic et al. (2025) tested this on S&P 500 constituents using daily returns and high-frequency news data. Their methodology used prompt-engineered queries that informed ChatGPT when a stock was about to enter a momentum portfolio, then asked the model to evaluate whether recent news supported continuation. The LLM produced scores that conditioned both stock selection and portfolio weights.

Key findings from the paper:

The LLM-enhanced strategy outperformed a standard long-only momentum benchmark on both Sharpe and Sortino ratios
Results held in-sample and in a truly out-of-sample period after the model’s pre-training cutoff
Gains held up against transaction costs, prompt design variations, and portfolio constraints
The improvement was strongest for concentrated, high-conviction portfolios

That last point is the most relevant for individual traders. Institutional quant funds run diversified 100-stock momentum portfolios where the LLM filter adds modest incremental alpha. A swing trader running a 10-stock portfolio gets a proportionally larger benefit from removing the two or three names where the news narrative has already turned.

A Practical Workflow for Swing Traders

I run this as a weekend screening process. Here is the workflow I use, adapted for anyone with access to a momentum screener and ChatGPT (or Claude, or any capable LLM):

Step 1: Run the 12M-1M screen. I use a universe of S&P 500 or Russell 1000 names. Rank by 12-month total return minus the most recent month. Pull the top 20-30 candidates. This is your raw momentum list.

Step 2: Gather recent news for each candidate. I pull headlines and summaries from the past 2-4 weeks. You need firm-specific news, not market commentary. Earnings, analyst actions, product launches, executive changes, legal issues, contract wins or losses. Most financial data providers surface this. Free sources include company press releases and SEC filings.

Step 3: Prompt the LLM with context. The key insight from Anic et al. is that you tell the model the stock is a momentum candidate. My prompt template looks roughly like this:

“Stock [TICKER] has returned [X]% over the past 12 months (excluding the most recent month). It currently ranks in the top decile of a cross-sectional momentum screen. Below are the most recent news items for this company from the past 3 weeks. Based on this news, rate on a scale of 1-5 how likely the stock’s positive trend is to continue over the next 1-3 months. A score of 1 means the news strongly contradicts continuation. A score of 5 means the news strongly supports continuation. Provide a one-sentence rationale.”

Step 4: Filter or weight by score. I typically require a score of 3 or higher to keep a stock in my watchlist. Scores of 4-5 get full position sizing. Scores of 3 get half size. Below 3, the stock is removed regardless of its momentum rank.

Step 5: Cross-check with drawdown quality as a secondary momentum filter to confirm the price path was orderly, not a single-spike return.

What the LLM Catches That Price Does Not

Running this screen over the past several months, I have seen the filter remove candidates for reasons a pure price screen would miss entirely:

A semiconductor stock ranked in the top quintile on momentum. The LLM flagged it with a score of 2 because three separate news items discussed rising inventory levels at key customers and potential order deferrals. The trailing return reflected a surge from an AI-spending narrative six months prior. The news had shifted to oversupply concerns. The stock declined 12% over the following month.

A healthcare name scored a 5 from the LLM. Trailing momentum was strong, and the recent news included FDA approval of a supplemental indication, analyst upgrades, and an institutional accumulation filing. The news confirmed continuation. The stock added another 8% over six weeks.

These are not cherry-picked results from backtests. They are observations from live screening where the LLM filter changed my allocation versus what pure momentum would have dictated.

Prompt Design Matters More Than Model Choice

Anic et al. found that results held across prompt design variations, which is encouraging. But in practice, I have noticed three prompt elements that improve output quality:

First, always tell the model the stock’s recent return. Without this context, the LLM evaluates news in a vacuum. A negative earnings revision for a stock up 60% means something different than for a stock up 10%. The model needs the baseline to assess continuation probability.

Second, specify the holding period. “Continuation over the next 1-3 months” produces different scores than “next week” or “next year.” Match your prompt to your actual holding period.

Third, request a rationale. A one-sentence explanation forces the model to commit to a reasoning chain. When the rationale contradicts the score (a score of 4 with a rationale that mentions “headwinds”), I re-evaluate manually. The rationale also helps you build intuition over time about what the model finds important.

Model choice is less critical. I have tested this with GPT-4, Claude, and open-source models. The scores differ slightly but the directional signal is consistent across models for clear-cut cases. Borderline cases (scores of 2-3) vary more between models, which is where your own judgement adds the most value.

Combining LLM Scores with Multi-Factor Screening

The LLM sentiment filter works best as one layer in a multi-factor composite screen. I use it after momentum ranking and before final position selection. The sequence matters: screen for momentum first (quantitative, deterministic), then apply the LLM filter (qualitative, contextual), then check technical structure for entry timing.

If you already use regime-based factor screening, the LLM filter slots in at the individual-stock level after the regime filter has already adjusted your factor tilts at the portfolio level. Regime filtering tells you whether momentum as a factor is currently working. LLM conditioning tells you whether a specific momentum candidate deserves inclusion.

For position sizing, the Anic et al. paper found that using LLM scores as portfolio weights (higher scores get more capital) outperformed equal-weight allocation among filtered names. I have implemented a simplified version: full size for scores of 4-5, half size for 3, excluded for 1-2. This avoids over-engineering while capturing most of the information.

Limitations and Where This Breaks Down

This approach has clear failure modes. Recognizing them prevents you from over-relying on the filter.

News lag is the biggest issue. If you are screening weekly, the LLM evaluates news that is 1-7 days old. In fast-moving situations (earnings week, regulatory actions), the stock may have already repriced before your next rebalance. The filter adds the most value during quiet news periods when a slow narrative shift has not yet hit the price.

LLM knowledge cutoffs matter. If you use a model with a training cutoff six months ago, it cannot reason about recent industry dynamics it never learned. Using a model with web access or feeding it the full text of news items (not just asking it to recall information) solves this. The Anic et al. paper explicitly tested out-of-sample performance beyond the model’s pre-training cutoff and found results held, suggesting the model generalizes from patterns rather than memorizing specific events.

Prompt injection via news text is a theoretical risk. If you feed raw article text to the model, adversarial content could bias the output. In practice, using headlines and summaries from reputable financial news sources makes this a low-probability concern for stock screening. But be aware of it if you scrape unusual sources.

The filter does not help with market-wide momentum crashes. When the entire factor reverses (March 2009, March 2020), individual stock news will not save you. You need separate regime detection for that.

Cost and Scalability for Individual Traders

Running 20-30 prompts per week through a paid LLM API costs under $2 at current GPT-4o pricing. Each prompt with 3 weeks of news summaries runs roughly 500-800 tokens of input and 100 tokens of output. At $2.50 per million input tokens, this is economically trivial for a retail swing trader rebalancing weekly.

If you want to avoid API costs entirely, open-source models like Llama 3 running locally produce acceptable results for this task. The scoring is slightly noisier but directionally similar to frontier models. A gaming GPU from the last 3 years can run inference on a quantized 8B parameter model in seconds per prompt.

Scaling to larger universes (2000+ stocks) requires batching and parallel processing, but that is an institutional concern. For a swing trader screening 20-50 candidates from a pre-filtered universe, manual copy-paste into ChatGPT works fine as a starting point.

Building Conviction Without Overfitting

I treat the LLM score as one vote among several, not an oracle. My decision framework for a momentum candidate looks like this:

Momentum rank (top 20% of universe): pass/fail gate. LLM continuation score (3+): pass/fail gate. Drawdown quality (no single drawdown exceeding 15% during the lookback): pass/fail gate. Volume trend (not declining over the past month): soft filter. Technical structure (price above 20-day and 50-day moving averages): entry timing.

A stock needs to pass all three hard gates to enter my watchlist. The path quality of the return matters as much as its magnitude. The LLM adds a fundamentals-narrative layer without requiring me to read 30 earnings transcripts every weekend.

The goal is not prediction accuracy on any single name. It is improving the hit rate of your momentum screen from roughly 55-60% (typical for raw 12M-1M) to 65-70% by removing the most obvious false positives before they consume capital and attention.

Keeping the LLM Filter Honest

Track your results. I log every LLM score alongside the actual 1-month forward return. After three months of data, you can calculate whether the filter actually improves your hit rate versus raw momentum alone. If scores of 4-5 do not outperform scores of 1-2 in your own results, something is wrong with your prompt, your news sources, or the model you are using.

Resist the temptation to fine-tune prompts to fit past results. Prompt-mining (testing dozens of prompt variations until one backtests well) produces overfitting just like parameter-mining in technical analysis. Pick a reasonable prompt, run it consistently, and evaluate over a meaningful sample (50+ trades minimum).

The Anic et al. paper’s consistency across prompt designs is the strongest evidence that this is a real signal, not a prompt-specific artifact. That gives me confidence the approach works at the category level (LLMs reading news can filter momentum) even if any specific prompt implementation can be improved.

Educational content only. Not investment advice. Trading involves risk. You are responsible for your decisions.