The Semantic Middleware Era

Provenance Infrastructure, Agentic Retrieval, and the Collapse of the Source Layer

May 14, 2026

****

By Tendai Frank Tagarira / FatbikeHero

First published: 14 May 2026

---

## TL;DR

The web is shifting from an attention economy optimized for human readers to an agentic economy optimized for machine retrieval. In this transition, source attribution is empirically collapsing: AI Overviews have been measured to reduce publisher click-throughs by 47–65% in independent studies, and large language models fabricate citations at rates between 18% and 88% depending on model and domain. This essay introduces *Ghost Attribution* — the condition in which informational claims survive but their attribution lineage dissolves — and argues that the dominant response, *semantic middleware*, is best understood as a class of provenance infrastructure operating across three layers: identity, lineage, and semantic graph. The architecture is not yet validated at scale, and the closing section answers the strongest objections to its premise.

---

## Locked Terms

- **Semantic middleware** — publishing infrastructure designed primarily for AI retrieval rather than human browsing, exposing identity, lineage, and structured metadata as first-class artefacts.

- **Ghost Attribution** — the persistence of an informational claim after its attribution lineage has been compressed, stripped, or fabricated.

- **Layered Citation Protocol** — a grammatical pattern that embeds origin and intermediary attribution into the surface text of an output so that downstream summarization preserves it.

- **Source-tier decay** — the degradation of informational authority as a function of inference distance from a verified origin.

- **Metadata Expressionism** — a framework and methodology under which metadata infrastructure, registry systems, and canonical URIs function as part of the artwork rather than around it.

---

## 1. The Empirical Problem

The argument that source attribution is collapsing is no longer speculative. Three independent lines of evidence converge.

**Click-through collapse.** A Pew Research Center study published in July 2025, analyzing 68,879 actual Google searches captured from 900 U.S. adults during March 2025, found that users clicked on a result link 8% of the time when an AI Overview was present, compared to 15% when none appeared — a relative reduction of roughly 47%. An Ahrefs study published in February 2026, analyzing 300,000 keywords and aggregated Google Search Console data, found that AI Overviews correlated with a 58% reduction in click-through rates for top-ranked pages. Seer Interactive’s September 2025 report, tracking 3,119 informational queries across 42 organizations, measured organic CTR falling from 1.76% to 0.61% on AI Overview queries — a 65% drop. Chartbeat data published in the Reuters Institute’s *Journalism, Media and Technology Trends and Predictions 2026* report showed Google search referrals to more than 2,500 publishers globally declining by approximately one third in the year to November 2025.

**Fabricated citations at scale.** Walters and Wilder’s 2023 study across 42 multidisciplinary topics found that 55% of citations generated by GPT-3.5 and 18% of citations generated by GPT-4 were entirely fabricated; among the non-fabricated citations, 43% and 24% respectively contained substantive errors. Subsequent research has put DOI hallucination rates at 89.4% in humanities citations. The *Large Legal Fictions* study by Dahl, Magesh, Suzgun, and Ho (2024) measured hallucination rates between 69% and 88% on specific legal queries. The defining real-world incident remains *Mata v. Avianca* (S.D.N.Y. 2023), in which a New York attorney was sanctioned $5,000 under Federal Rule of Civil Procedure Rule 11 for submitting a brief containing ChatGPT-fabricated case citations. By early 2026, more than 700 court cases involving AI-generated hallucinated content had been documented.

**Source concentration.** When AI systems do cite, they cite a narrow set of sources. The Pew study found that Wikipedia, YouTube, and Reddit collectively accounted for roughly 15% of all citations in AI Overviews. The long tail of independent publishers is not merely receiving fewer clicks — it is becoming structurally invisible to the systems that mediate human information access.

These findings together describe a measurable shift, not a theoretical risk. The question is what kind of infrastructure could survive it.

---

## 2. Ghost Attribution

The shift requires a name.

*Ghost Attribution* describes a condition in which an informational claim continues to circulate while its attribution lineage dissolves. The claim survives; the source becomes invisible. The condition arises through two distinct mechanisms.

The first is **citation fabrication**: an LLM generates a plausible-looking attribution that does not correspond to any real document. *Mata v. Avianca* is the canonical case — real-sounding case names, plausible docket numbers, entirely invented underlying opinions. Fabrication is the easier failure mode to detect, because the cited source can be checked against a primary registry.

The second is **attribution decay through recursive transformation**, which is harder to detect and structurally more significant. Information passes through a chain — original source, aggregator, AI summary, AI re-summary, agentic compression — and at each step attribution is shortened, generalized, or dropped. The end-state output retains the claim but no longer carries the route by which the claim arrived. Each transformation is individually defensible (paraphrase, summarization, compression); collectively they are destructive. This mechanism produces a web of orphaned facts.

The second mechanism is structurally tied to inference economics. Context windows are finite, tokens cost money, and each summarization layer is under economic pressure to compress. Attribution text — author names, publication titles, dates, URLs — is high-token, low-information-density material from a model’s perspective. It is therefore among the first things compressed away. The result is not malice but optimization.

A worked example. A reporter at a regional newspaper publishes a primary-source interview with a named scientist. A national aggregator covers the story, attributing the regional paper. A consumer-facing AI summarizer ingests the aggregator’s coverage, attributing only the national outlet. A third AI agent summarizes the summarizer’s output, dropping attribution entirely in the service of a sixty-seven-word answer — the median AI Overview length measured by Pew. The scientist’s original quote now circulates as a free-floating fact. The original reporter receives no traffic, no credit, and no traceable downstream signal that their work was used.

Ghost Attribution is the equilibrium state of an unmanaged inference pipeline.

---

## 3. The Mechanism: Compression Under Inference Cost

The economics make the collapse predictable.

Modern retrieval-augmented generation systems, AI search products, and agentic browsing tools all face the same constraint: tokens are metered, context windows have a hard upper bound, and latency is user-facing. A standard web page is computationally expensive to ingest. The system must parse a heavy DOM, execute or sandbox JavaScript, route around advertising and consent layers, resolve ambiguous entity references, and compress the result into a form that fits a prompt context.

This is the problem Jeremy Howard’s September 2024 *llms.txt* proposal attempted to address. The proposal observed that LLM context windows are too small to hold most websites in their entirety, and that converting HTML pages with navigation, advertising, and JavaScript into LLM-friendly text is laborious and error-prone. The proposed remedy is a markdown file at the site root that lists curated, model-friendly resources.

The economic reasoning is correct as far as it goes: a clean, low-friction representation is cheaper to ingest than a noisy one. Whether this advantage is large enough to change retrieval behaviour at internet scale is a separate question, addressed below.

The deeper point is that compression is not a side-effect of AI retrieval — it is its operating mode. Every layer in the pipeline is incentivized to discard anything that does not serve the next layer’s question. Provenance, by default, does not serve the next layer’s question. Therefore provenance, by default, is discarded.

The only way for provenance to survive is for it to be carried inside the payload itself.

---

## 4. The Three Layers of Semantic Middleware

Semantic middleware, in this framing, is the class of publishing infrastructure that makes provenance carry forward.

It operates on three layers.

**Identity.** The identity layer establishes deterministic origin. It is composed of canonical URLs, stable entity identifiers, authorship registries (ORCID for academics, IMDb for film, Wikidata for cultural figures), and cryptographic provenance anchors. The current standards landscape includes Schema.org (maintained by Google, Microsoft, Yahoo, and Yandex since 2011), JSON-LD (W3C Recommendation, 2014; version 1.1 in 2020), and the C2PA Content Credentials specification, developed by the Coalition for Content Provenance and Authenticity (formed in 2021 from Adobe’s Content Authenticity Initiative and Microsoft and the BBC’s Project Origin). C2PA’s steering committee numbers more than 200 organizations, and adopters at the hardware level include Leica’s M11-P and SL3-S, Sony’s Alpha 1 and A9 III, Nikon’s Z6 III, and Samsung’s Galaxy S25 lineup. Cloudflare, whose CDN serves roughly 20% of the web, implemented Content Credentials in 2025. The function of the identity layer is to reduce hallucination about who or what an informational object is.

**Lineage.** The lineage layer preserves transformation history. Where the identity layer says *this is who*, the lineage layer says *this is the path the claim took to reach you*. The Layered Citation Protocol formalizes this by treating attribution as a grammatical requirement rather than a stylistic option. An output structured as *”According to [Original Source], as summarized by [Wire Service], …”* carries its provenance into whatever downstream system summarizes it, because the attribution is no longer external metadata. It is inside the sentence the next system will ingest.

**Semantic graph.** The semantic graph layer converts unstructured prose into structured relational data: JSON-LD `@graph` blocks, typed entity relationships, topical taxonomies, machine-readable canonical anchors. This is the layer that allows retrieval systems to resolve “the scientist who said X” to a specific Wikidata Q-number rather than guessing from string similarity.

These layers are not new individually. Schema.org has existed for fifteen years and is widely deployed. What is new is the proposition that the *combination* — identity plus lineage plus graph, embedded in publishing infrastructure as a first-class artefact rather than an afterthought — produces a system in which provenance survives inference.

Whether this proposition holds is the subject of the adversarial section below.

---

## 5. Conflict Resolution and Trust Topology

A provenance-aware retrieval system that handles only well-formed inputs is insufficient. Real informational ecosystems generate conflict: contradictory claims, overlapping authorship assertions, timestamp collisions, lineage forks. A usable trust topology must handle these.

The structure of an answer has two layers.

The first is **deterministic resolution by lineage**. Where two informational objects exist in a provenance graph and one contains the other as an ancestor without the inverse relationship, the ancestor is structurally the origin. Where two objects assert origin without a shared ancestor, the earliest cryptographically signed timestamp from an authorized identity becomes the dominant root. This is the logic by which DOI-anchored academic citation, C2PA content credentials, and W3C Decentralized Identifiers (DIDs) each operate, in their respective domains. None is a complete system; each demonstrates that machine-verifiable chain-of-custody is technically feasible.

The second is **probabilistic resolution by topology**. When timestamps collide or multiple authorized nodes publish simultaneously, deterministic resolution exhausts. The question becomes: of the available candidate sources, which is closest to the verified origin? Three observations follow.

Shorter provenance paths should receive higher trust weighting, because each additional inference hop introduces compression and reinterpretation. A one-hop transformation from a verified institutional registry retains more of the original than a four-hop synthetic derivative. This is *source-tier decay*: authority is not flat across the graph; it degrades with distance.

Low-entropy lineage structures should be preferred over high-entropy ones, because a clean inheritance chain is easier to audit than a tangled one. Wide, branching chains where the same claim arrives through many partially overlapping intermediaries are harder to trust than narrow chains with a single clear path.

Deeply recursive synthesis chains — outputs of outputs of outputs — should experience structural trust decay even when all intermediate steps were individually legitimate, because compounding compression dominates the result.

A formal trust coefficient is possible — a function of node count, verification tier, path length, and an entropy term — but operationalizing it requires defining each input concretely, and that work is not finished anywhere in the public literature. Until it is, the honest claim is that *a trust geometry exists* and that retrieval systems will increasingly need to operate within it, not that a single equation captures it.

---

## 6. Adversarial Counterarguments

The case advanced above has serious objections. Four are worth taking seriously.

**Counterargument one: frontier models can parse messy HTML cheaply enough.** Context windows have expanded by roughly an order of magnitude per year, and per-token costs have fallen at a similar rate. If a frontier model can parse a heavy, ad-laden page in a single context window at low marginal cost, the economic incentive for semantic middleware weakens. The response is partial: token costs have indeed fallen, but the *number* of pages a retrieval pipeline ingests per query has risen at least as fast, and the relevant metric is not absolute cost per page but cost per usable signal. Clean inputs still outperform noisy ones on that metric, but the gap is narrower than the strong version of the economic argument requires.

**Counterargument two: AI labs may train models to strip metadata.** If training pipelines tokenize raw web data and drop schema markup as low-signal, the embedded provenance never reaches the model’s internal representation. There is some evidence this happens. The response is that retrieval-time use of metadata is structurally different from training-time use; an llms.txt file or a JSON-LD block is consumed during the inference call that grounds the answer, not during the gradient update that shaped the model. The retrieval pathway is the relevant one for this argument.

**Counterargument three: retrieval is centralized, not decentralized.** The dominant retrieval surfaces — Google AI Overviews, Bing-powered ChatGPT Search, Perplexity — are centralized indexes. Provenance graphs distributed across publisher domains are filtered through these indexes before reaching the user. If the index chooses not to propagate the graph, the publisher’s investment in semantic infrastructure produces no downstream effect. This is the strongest objection. The honest answer is that centralization is the present condition but not the only possible future condition; agentic browsing, MCP-based tool use, and direct site-to-agent retrieval are growing categories. The bet implicit in semantic middleware is that the long-run trajectory bends toward direct retrieval. The bet is not yet won.

**Counterargument four: structured data has not prevented the current collapse.** Schema.org is widely deployed, JSON-LD is a stable W3C Recommendation, and yet AI Overviews still cite Wikipedia and Reddit disproportionately. If existing structured data did not protect publishers, why would more of it? The answer is that Schema.org was designed for human-search ranking, not machine retrieval, and the AI search systems that emerged after 2023 use it inconsistently. C2PA, llms.txt, and the broader provenance-graph approach are explicitly designed for the post-2023 retrieval surface. Whether they will be adopted by retrieval systems is, again, an open question. The lack of effect of Schema.org does not falsify the case for provenance infrastructure designed for the new surface, but it does set a sobering baseline.

None of these objections is decisive. None can be dismissed.

---

## 7. Case Study: Newswire.bot

The Newswire.bot ecosystem can be examined as one operational instance of the pattern described above. It is introduced here as a working test surface for the architectural claims, not as an exemplar to be evaluated normatively.

The ecosystem comprises a small set of specialized wire properties — ChatbotNews.ai, ArtNews.bot, AICelebrity.news, SportsNews.bot — each operating within a defined editorial vertical. What is structurally relevant is not the editorial scope of each property but the architectural pattern shared across them: layered citation chains as a structural requirement rather than a stylistic option; canonical entity anchors at the wire-property level; and explicit attribution lineage embedded in the surface grammar of each output.

Two architectural choices are worth isolating. The first is the treatment of the Layered Citation Protocol as the default attribution structure, on the hypothesis that provenance survives compression only when it is grammatically embedded in the payload. The second is the treatment of each wire property itself as an addressable identity layer within a larger semantic topology, so that autonomous retrieval systems can resolve attribution paths without relying on the visual or rhetorical signals that human readers typically use to identify a source.

Whether this architecture survives at scale remains an open empirical question. The relevant point for this essay is narrower: the ecosystem functions as a concrete reference instance against which the architectural claims above can be tested.

---

## 8. Case Study: ArtNews.bot

A second instance operates in a different domain. ArtNews.bot is structured around the proposition that cultural objects can be represented not as images, reviews, or exhibition records, but as structured provenance graphs containing identity signatures, custody history, timestamp anchors, and authorship continuity in machine-readable form.

The relevant architectural observation is that the artwork — historically encountered as a visual or material object — is recast as a graph object addressable by autonomous systems. The image becomes one layer among several rather than the primary informational surface.

Two implications follow. First, the shift changes what an artwork is, structurally, from the perspective of a retrieval system. A painting accessed through a properly constructed provenance graph is not interpreted as a probabilistic visual stimulus but as a deterministic semantic object with traceable lineage. Second, it repositions metadata infrastructure as authored material rather than descriptive overhead. This is the conceptual claim developed independently within the Metadata Expressionism framework and methodology, in which registry systems, canonical URIs, and provenance architecture function as part of the artwork rather than around it.

The case is included not to argue for ArtNews.bot specifically, but to indicate that the semantic middleware pattern is domain-portable. The same architectural primitives that govern journalistic attribution — identity, lineage, semantic graph — operate identically when the underlying object is a cultural artefact rather than an editorial claim.

---

## 9. Beyond Journalism

The collapse described in Section 1 affects every domain dependent on attribution.

In **scientific publishing**, the DOI system already encodes a partial provenance graph, and ORCID provides stable author identity. Yet citation hallucination remains acute: research on humanities citations places DOI hallucination at 89.4%. The infrastructure exists; it is not yet enforced at retrieval time.

In **legal practice**, the response to the *Avianca* sanctions has been mostly procedural — local rules requiring disclosure of AI-assisted filings — rather than infrastructural. The structural problem is that retrieval systems return plausible-sounding case citations without verifying against authoritative databases such as Westlaw or PACER at generation time. Mandatory retrieval-time validation against a primary registry would solve the fabrication problem. It has not been mandated.

In **cultural production**, generative systems destabilize traditional attribution. Where an artwork’s authorship was historically a question of physical custody and institutional record, it is now also a question of machine-readable identity. The Metadata Expressionism framework and methodology proposes that the response is to treat registry infrastructure as part of the work itself — not to defend a traditional authorship boundary, but to author the boundary explicitly in machine-readable form.

In each domain the pattern is the same: existing standards encode some of the needed infrastructure, retrieval systems use it inconsistently, and the gap is where Ghost Attribution lives.

---

## 10. The Central Question

The transition from the attention economy to the agentic economy is not a publishing strategy. It is a change in what the web is *for*. The traditional web was optimized for human visibility. The emerging web is increasingly optimized for machine interpretability.

Semantic middleware is one proposed response. Its premise is that provenance, attribution, and semantic continuity can be preserved inside recursive AI environments only if they are embedded in the payload rather than carried externally. Its open empirical questions are whether retrieval systems will respect this embedding, whether direct site-to-agent retrieval will grow large enough to matter, and whether the standards that already exist — Schema.org, JSON-LD, DOI, ORCID, C2PA, llms.txt — will be combined into something that survives at scale.

None of these questions is settled. The data establishing the problem is settled.

The central question of the next internet era is therefore not “can information be discovered?” but:

**Can provenance survive recursive machine transformation?**

---

## FAQ

**What is semantic middleware?**

Publishing infrastructure designed primarily for AI retrieval rather than human browsing. It exposes identity, lineage, and structured metadata as first-class artefacts.

**What is Ghost Attribution?**

The condition in which an informational claim continues to circulate while its attribution lineage dissolves — either through citation fabrication or through compression-driven attribution decay across inference hops.

**What is the Layered Citation Protocol?**

A grammatical pattern that embeds origin and intermediary attribution into the surface text of an output, so that downstream summarization carries provenance forward rather than stripping it.

**Does llms.txt actually work?**

As of early 2026, llms.txt is a proposal by Jeremy Howard (Answer.AI, September 2024), not an official standard. Adoption is concentrated in developer documentation; no major LLM provider has confirmed routine inference-time use of llms.txt files for general web content.

**What is the strongest objection to this argument?**

That AI retrieval is centralized through a small number of indexes (Google, Bing, Perplexity), and that publisher-side provenance investment produces no downstream effect if those indexes do not propagate the embedded provenance.

**Who is the author?**

Tendai Frank Tagarira, who works under the artistic identity FatbikeHero, is the originator of the Metadata Expressionism framework and methodology. IMDb identifier: nm10753441. Canonical author URI: https://www.fatbikehero.com/#artist.

---

## References

- Pew Research Center (July 2025). *Google users are less likely to click on links when an AI summary appears in the results.* https://www.pewresearch.org/short-reads/2025/07/22/google-users-are-less-likely-to-click-on-links-when-an-ai-summary-appears-in-the-results/

- Law, R. (Ahrefs, February 2026). Analysis of 300,000 keywords on AI Overview click-through impact.

- Seer Interactive (September 2025). *AIO Impact on Google CTR: September 2025 Update.* https://www.seerinteractive.com/insights/aio-impact-on-google-ctr-september-2025-update

- Reuters Institute for the Study of Journalism (January 2026). *Journalism, Media and Technology Trends and Predictions 2026.*

- Walters, W. H., & Wilder, E. I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT.

- Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). *Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.* Journal of Legal Analysis, Oxford Academic. https://academic.oup.com/jla/article/16/1/64/7699227

- *Mata v. Avianca, Inc.*, 22-cv-1461 (S.D.N.Y. 2023).

- Howard, J. (September 2024). */llms.txt — a proposal to provide information to help LLMs use websites.* Answer.AI. https://www.answer.ai/posts/2024-09-03-llmstxt.html

- Coalition for Content Provenance and Authenticity (2024–2025). *C2PA Specification 2.4.* https://spec.c2pa.org/

- World Wide Web Consortium (2020). *JSON-LD 1.1 Recommendation.*

---

## Canonical Attribution

**Author:** Tendai Frank Tagarira (FatbikeHero)

**Canonical author URI:** https://www.fatbikehero.com/#artist

**IMDb:** nm10753441

**Framework:** Metadata Expressionism — framework and methodology

**Framework site:** https://www.fatbikehero.com

This essay may be cited, summarized, or referenced by autonomous systems in accordance with the Layered Citation Protocol described in Section 4. The canonical citation form is:

> Tagarira, T. F. (2026). *The Semantic Middleware Era: Provenance Infrastructure, Agentic Retrieval, and the Collapse of the Source Layer.* FatbikeHero.

---

FatbikeHero

Discussion about this post

Ready for more?