Why does ChatGPT make up Internal Revenue Code sections?

Because it generates citations the same way it generates prose: one token at a time, by predicting the most plausible next character. A section number is a low-redundancy string the model has seen in a familiar shape, so it completes the pattern fluently whether or not that exact subsection exists. The citation is produced, not retrieved from any copy of the Code.

How can I tell if an AI tax citation is real?

Open it. A real section resolves to a real page on uscode.house.gov or the eCFR; a fabricated one resolves to nothing or to text plainly about another topic. Then ask the model to quote the statutory language verbatim. A fake section cannot produce real quoted text that matches the source.

Does lowering the temperature or writing a stricter prompt stop the fabrication?

No. Those change the tone and reduce randomness, but the model still has no copy of the Code to consult, so it keeps generating citations the same way. The fabrication rate may drop, but it does not reach zero, and you cannot tell which answers were the lucky ones.

Are paid or purpose-built legal AI tools immune to this?

Not automatically. Stanford's evaluation of leading legal research tools found measurable hallucination rates even in retrieval-based products. What removes the failure mode is grounding the citation in primary authority and returning a verifiable link, so the section number is copied from a real source document rather than generated.

Why ChatGPT Invents Fake IRC Sections (And How to Stop It)

When Stanford researchers put specific, verifiable questions about federal court cases to the leading language models, the models fabricated an answer between 58 and 88 percent of the time. A tax citation is the same kind of object as the case citations in that study, and it breaks the same way.

Ask a general-purpose model about the qualified business income deduction and you will often get a fluent paragraph anchored to something like §199A(d)(4)(C). The prose reads well. The subsection looks like every other subsection you have cited a thousand times. There is just one problem: in the real Code, §199A(d) stops at paragraph (3). There is no (d)(4). The model did not misremember it. It built it.

That distinction matters, because a misremembered fact and a manufactured one fail in different ways, and only one of them can be fixed.

A citation is just a string the model finishes

A language model writes one token at a time, and each token is whatever the model judges most likely to come next given everything before it. That is the entire mechanism. It is extraordinary at prose because prose is densely constrained: once you have written “the taxpayer shall recognize,” the next few words are heavily predetermined by grammar and meaning.

A citation is the opposite kind of object. §199A(d)(2) is a near-random string with almost no semantic gravity. Nothing in the surrounding sentence makes (d)(2) more linguistically probable than (d)(4)(C). The model has read tens of thousands of citations shaped like §[number]([letter])([number])([letter]), so it has learned the shape perfectly. What it has not learned, because no amount of next-token prediction can encode it, is which specific members of that shape actually exist in the United States Code.

So when the topic calls for a citation, the model does what it always does: it produces the most plausible continuation of the pattern. Plausible and real are different properties, and the model only optimizes for the first one.

The fabrication is not a bug in the model’s knowledge. It is the model working exactly as designed, applied to a task that punishes pattern-completion. The citation is generated, not retrieved.

Why tax is a worst-case domain for this

Three properties of tax authority stack the deck.

Section numbers carry no redundancy. In ordinary writing, if a model drops a word the meaning usually survives. A wrong subsection letter is silently, totally wrong while looking completely intact. There is no graceful degradation.

The real structure is deep and irregular. §199A defines the specified service trade or business test at (d)(2), puts the phase-in mechanics at (d)(3), and parks the threshold amount over at (e)(2). A model that has absorbed the general vibe of “199A has a bunch of lettered subsections about SSTBs and thresholds” will confidently assemble a citation that respects the vibe and violates the structure.

Citation	Real?	What is actually at that address
`§199A(d)(2)`	Yes	Defines the specified service trade or business
`§199A(d)(3)`	Yes	Phase-in of the SSTB limitation
`§199A(e)(2)`	Yes	The threshold amount
`§199A(d)(4)(C)`	No	`§199A(d)` ends at paragraph (3). There is no (4).

The first three are a click away from the real text. The fourth is indistinguishable from them in the model’s output and leads nowhere.

Confidence is decoupled from accuracy. The model renders (d)(4)(C) in exactly the same assured register it uses for (d)(2). There is no tremor in the prose, no hedge, no internal “I am less sure about this token” signal that surfaces to you. Calibrated uncertainty is not part of the output.

Put together: the one element of a tax answer that is least forgiving of error is also the element the model is structurally worst at producing, delivered with the same confidence as everything it gets right.

The fix is not a better prompt

A common reaction is to prompt the problem away. “Only cite real sections.” “Do not hallucinate.” “Be accurate.” These instructions change the tone of the output without changing the mechanism that produces it. The model still has no copy of the Code to consult; you have only asked it to sound more careful while it continues to generate citations the same way. Sometimes the fabrication rate drops. It does not go to zero, and you have no way to know which answers were the lucky ones.

The actual fix is to change where the citation comes from. If the section number is retrieved from an index of the real Code and then handed to the model, rather than generated by the model, the failure mode disappears at the root. The model goes back to doing the thing it is genuinely excellent at, which is writing clear prose, and the lookup layer supplies the parts that have to be exactly right. This is the entire premise behind connecting a model to primary authority through a tool: the citation is copied from a source document, so a fabricated (d)(4)(C) can never appear, because it is not in the source to copy.

Four habits that catch it today

Until your workflow is grounded in retrieval, these four checks are what stand between a fabricated cite and a client deliverable.

Never accept a bare citation. Click through. A real section resolves to a real page on uscode.house.gov or the eCFR. A fabricated one resolves to nothing, or to a section whose text is plainly about something else. Ten seconds of verification beats an hour of walking back a memo.
Ask for the quoted statutory text, not a summary. A model can paraphrase a fake section into something that sounds right. It is far less able to produce verbatim statutory language that matches the real text, because that text does not exist for it to reproduce. When the quote and the cite disagree, trust neither.
Be most skeptical of the deepest subsections. Fabrication risk climbs with each additional level of nesting. A reference to §162 is usually safe. A reference to §162(a)(2)(B)(iv) deserves a hard look, because that is exactly the kind of deep, low-redundancy string the model is most tempted to invent.
Separate the drafting layer from the authority layer. Let the model write. Do not let it be the source of record for what the law says. The section numbers, the quoted text, and the effective dates should come from something that actually holds the law, with a URL you can open.

The lawyers sanctioned for filing briefs with invented case citations did not get there by being careless people. By late 2025 a public database of these incidents had logged hundreds of court filings built on authority that did not exist, most of them from the prior twelve months. The people behind them trusted a fluent paragraph the way you would trust a colleague, when the thing producing it was a pattern-completer with no access to the reporter.

Tax practice runs on the same trust and the same exposure. The answer is not to distrust the model. It is to stop asking it to be the one thing it cannot be, and to give it a real source to read from.