A much-promoted solution to AI’s black-box outputs is to instruct a model to explain step-by-step how it reaches that output, called chain of thought (CoT). CoT provides a window into what the AI is ‘thinking’.
In high-stakes domains such as health care, users may draw on CoT to validate the AI’s assessment of the situation and recommended courses of action, including as a check against the well-known problem of AI hallucination. For example, a medical AI agent may take an axial CT scan slice of a patient’s chest as input and predict the likelihood of lung cancer. When equipped with CoT capabilities, the model extends its output beyond binary classification to include a step-by-step rationale explaining how the decision was reached.
However, there is mounting evidence that while CoTs can appear coherent and convincing, they may not reflect the true decision process of the model (called CoT unfaithfulness).
A recent paper by leading AI scientists, including Yoshua Bengio, winner of the Alan Turing prize, argues that, far from being a rare anomaly, CoT unfaithfulness is a systematic phenomenon. As the researchers point out, this can have significant implications for the trustworthiness of specialist AI:
In medical diagnosis, a faulty CoT might rationalise a recommendation while omitting that the model relied on spurious correlations…In autonomous systems, safety-critical decisions might be justified post-hoc rather than revealing true failure modes; for instance, a self-driving car’s vision system might register a cyclist but classify it as a static sign, yet its CoT unfaithfully reports ‘no obstacles ahead’, misleading engineers into debugging the wrong failure mode.
Evidence of CoT unfaithfulness
While an AI’s CoT presents as human-like sequential verbalised reasoning, growing evidence shows that it is often generated separately from the computational process that produces the answer it is meant to explain.
In a multiple-choice test, the model was asked to choose the correct answer and explain itself. The correct answer was buried in the prompt (as a hint). Models usually selected this hinted answer and produced a CoT that rationalised it, yet almost never admitted the hint’s influence, even though they would often pick a different answer without it.
Models sometimes make mistakes in their reasoning steps and correct them internally, without updating the CoT. Had the user followed the verbalised steps literally, the user would not have reached the correct answer, yet the model managed to do so via unverbalised computations.
Sometimes the model arrives at the correct answer via pattern-matching or recall rather than the reasoning laid out in the CoT. For example, in solving complex mathematical problems, the CoT may set out the full algorithmic reasoning to explain its computation, but the model’s internal pattern-matching and recall of training examples may have allowed it to guess the correct answer without going through the rigmarole of the calculations.
This evidence suggests the CoT often functions more as an ex-post rationalisation by the large language model (LLM) rather than a faithful record of its reasoning.
Why does CoT unfaithfulness occur?
The authors argue that “transformer architecture [used in most LLMs] may fundamentally limit the faithfulness of CoT”, in the following ways:
Transformer-based LLMs process information in a distributed manner across many components simultaneously, rather than through the sequential steps that CoT presents. This creates a fundamental mismatch between how models compute and how they verbalise that computation. Rather than reflecting how the AI is ‘thinking’, CoT is a confection to mimic how humans reason.
LLMs appear to have multiple redundant computational pathways to reach the same answer. A model may recognise the requested answer as a memorised fact, pattern-match against similar examples in its training data or undertake the full algorithmic analysis. Yet they typically produce only one narrative to rationalise the output, omitting parallel processes.
The authors disagree with the view that CoT unfaithfulness will be solved as models get larger and more sophisticated. In their view, there is a lack of clear evidence that larger models produce more faithful explanations rather than just more plausible ones. Some evidence suggests that larger models produce less faithful explanations.
How to improve CoT faithfulness?
So far, efforts to mitigate CoT unfaithfulness have had limited success. In training, feedback methods can be used to models to steer towards faithful CoT reasoning by penalising inconsistencies, but models continued to revert to plausible-but-not-faithful explanations on complex problems. There are even some studies which show that AI models ‘game’ human CoT monitors by learning to generate benign-seeming traces while secretly executing harmful strategies.
The authors suggest a three-pronged approach:
Causal-validation methods certify that the text we do see genuinely influences the models’ final answer. These approaches involve systematically generating alternate chains that omit or paraphrase individual reasoning steps or by inserting new ‘distractor’ steps to throw the model off and checking whether the model still reaches the same answer. These approaches do not fully test for faithfulness because the model may not have verbalised all of the steps in its reasoning, so we won’t know they are missing in the first place in the CoT.
Cognitive-science approaches aim to reduce specific failure, thereby narrowing – but not closing – the gap. The authors note that “[h]uman metacognition, error detection and dual-process reasoning offer valuable design patterns for more transparent AI explanations”. They suggest training models to assign a confidence score or consistency check to each step, essentially asking itself “Does this follow logically from prior steps?” Models could also be trained to iteratively check for mismatches between the predicted outcome of its verbalised reasoning and the actual computation.
The authors also suggest inbuilt parameters or rules governing an AI’s ‘thinking’ process. A common failure mode is where the model covertly decides on the answer early and then retrofits its reasoning (‘answer-first’ or order-flip) Mechanisms could be built in to force a model to commit to its reasoning before generating the final answer (the AI version of ‘don’t jump to conclusions’).
Models also could have internal ‘critics’ or ‘judge’ modules which could verify each step of the primary CoT against facts and logical rules. The authors acknowledge that these self-regulating methods are non-trivial and carry their own risks – the model’s ‘internal critic’ may be as fallible as the model itself or overly conservative, flagging valid creative leaps as errors.
Human-oversight interfaces can help users detect remaining divergence. Human oversight would benefit from standardised metrics, such as the hint-reveal rate (frequency a model admits hidden prompt cues).
Where to now?
The authors conclude:
Current CoT techniques stand at an intersection of utility and misleading trustworthiness. On one hand, CoT has undeniably boosted performance on many tasks by encouraging structured reasoning, providing a human-readable window into the model’s process. On the other hand, as we argue, these windows can be treacherous.
They consider that a level of CoT unfaithfulness is an unavoidable outcome of the transformer architecture: the mismatch between how LLMs ‘think’ (in a distributed, simultaneous computing) and how humans want LLMs to explain themselves in the way we reason (in a sequential, verbalised way).
The authors’ key messages are that we need to scale back the hype around CoT as evidence of interpretability, transparency and accuracy, as well as adopt better practices to help users identify and narrow the gap between the rationalisation AI offers and its actual computational process.

Peter Waters
Consultant