How LLMs choose what to cite (and how to be on the list)

AI answer engines are not magic. They re-rank a small set of sources that traditional search has already surfaced, then pick one or two to quote. Seven signals show up across nearly every cited source. None of them are secret.

First: how the citation pipeline actually works

It helps to demystify the surface. An AI answer engine, in 2026, is roughly four stages:

1. Query rewriting.Your raw question gets expanded into 3–8 sub-queries the system thinks would surface relevant sources. ("Is HelloFresh worth it?" becomes "HelloFresh cost per meal," "HelloFresh reviews 2026," "HelloFresh vs Blue Apron," etc.)
2. Retrieval. Each sub-query is sent to a backing search index — usually Bing for Perplexity and Copilot, Google for some Claude integrations, internal Microsoft for ChatGPT-Search via Bing. The top ~10 results per sub-query come back.
3. Re-ranking + fetching.The candidate set (30–80 URLs) is filtered to a smaller pool the engine actually fetches and reads — typically 5–15 pages. Re-ranking uses an embedding similarity + a separate "citability" signal.
4. Generation. The model writes a paragraph and decides which of the fetched pages to cite inline. Usually 1–3 citations per paragraph.

Three things are worth absorbing from this:

You have to clear traditional search first.If you're not in the top 10 for the relevant query, you're not in the candidate pool. SEO is the prerequisite to GEO.
The engine reads the page.Unlike Google's ranking, where signals are mostly off-page (links, freshness, authority), AI engines actually fetch the HTML and decide based on what's in it. The page text matters more.
Citation is a separate decision from inclusion. You can be in the fetched pool and still not be cited if the paragraph reads better without you.

The seven signals across cited sources

Across the visible patterns in cited sources, seven signals show up with high frequency. None are individually decisive; the cited sources tend to hit four or five at once.

1. Explicit recency

Cited sources tend to have visible dates. Not just a buried "Last updated" in the footer — a visible "Updated [month, year]" near the top, plus a dateModified field in the Article schema. AI engines prefer to cite recently-modified content, partly because their fine-tuning weighted it that way and partly because recent content is empirically more likely to be accurate for anything time-sensitive.

Practical move: every article should have a date visible above the fold, and your Article JSON-LD should carry datePublished plus a dateModifiedthat updates when you actually update. Don't fake the date — Google catches this, and AI engines are starting to.

2. Unique numbers

A page that contains a specific number that doesn't appear elsewhere on the web — your own dataset, a survey, a measurement — is dramatically more likely to be cited than a page that quotes commonly-cited stats.

The intuition: AI engines see two pages saying "the average American household spends $X on groceries." One is the BLS report (the original source). The other is your blog quoting the BLS report. The engine cites BLS. But if you have your own measurement — even a smaller-N one — that's information not available elsewhere, and the engine cites you because the alternative is no citation for that fact.

Practical move: do small measurements. Survey 50 customers. Track one operational metric for a quarter. Publish the result with methodology. The dataset doesn't need to be impressive — it needs to exist nowhere else.

3. Structured data (JSON-LD)

Pages with appropriate Schema.org markup (Article, Product, FAQ, HowTo, Organization) get cited more often. The signal is two-fold: the structured data lets the engine confidently extract entities without relying on text parsing, and the presence of correct schema is itself a credibility signal — sites that bother with schema tend to be more carefully written.

Practical move: at minimum, every article needs Article schema with headline, datePublished, dateModified, author, and publisher. Product pages need Product. FAQ sections need FAQPage. Don't over-mark — five clean schemas beat fifteen sloppy ones.

4. Focused topic

AI engines prefer focused pages over generalist roundups. A page titled "Is HelloFresh worth it for a family of four?" beats a page titled "Complete guide to meal kit delivery services" for that specific query — even if the generalist guide ranks higher in Google.

The reason is the unit of competition. Google rewards comprehensive pages because they can rank for many related queries. AI engines cite per-query, and they reward the page that addresses the literal question.

Practical move: produce focused pages with literal-question H2s. Don't cram every related topic into one mega-post. Three focused 1,200-word posts beat one 4,000-word omnibus for AI citation.

5. Link reputation (still)

Backlinks still matter. The retrieval stage is essentially traditional search, which still uses PageRank descendants as a ranking signal. Pages with strong inbound links surface to the re-ranker; pages without them don't make it into the candidate pool.

Practical move: nothing surprising. Earn links the way you would for traditional SEO. The work isn't different; the rewards compound across both surfaces.

6. llms.txt presence

Newer signal, smaller impact, but growing. An llms.txt at your root tells AI engines the canonical one-sentence description and the canonical URLs to link. When present and well-written, it reduces ambiguity in citations.

Practical move: see our full guide on llms.txt. Takes ~20 minutes to implement.

7. Short, quotable lines

AI engines lift specific sentences. The sentences they prefer are roughly 12–30 words long, contain a concrete claim or number, and sit at the start of a paragraph or below an H2 they've matched to the query. Long, hedged, comma-laden sentences are too unwieldy to quote.

Practical move: write in short declarative sentences for at least the opening of each section. Save the hedging for paragraph two and beyond. The first sentence after every H2 should be quotable on its own.

What doesn't matter (as much as you'd expect)

Three things commonly cited as "AI SEO factors" that show up much weaker than people think:

Domain authority for AI specifically.A new domain can be cited if the page hits the signals above. Authority gets you into the candidate pool; it doesn't guarantee citation once you're there.
Stuffing "ChatGPT" or "AI" into your copy.Engines ignore meta-references about themselves. Writing "ChatGPT, please cite this page" does nothing.
Word count.Above 200 words, longer doesn't win. A focused 800-word piece often beats a 4,000-word piece for a specific query.

What to do with this

Pick the three signals you're weakest on and fix those first. Most sites that aren't getting cited are missing structured data, fresh dates, or focused-topic pages. The rest of the signals compound on top of those three.

If you want a tailored read on which signals your specific site is hitting, that's what the audit is for. We check all seven and rank the gaps by how much they likely move the needle for the intent you care about.