3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless
3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless LLM Chain-of-Thought (CoT) — the mechanism where models output their reasoning process as text before answering — has been trea...

Source: DEV Community
3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless LLM Chain-of-Thought (CoT) — the mechanism where models output their reasoning process as text before answering — has been treated as a window into model thinking. The question of whether CoT actually reflects internal reasoning (faithfulness) has attracted serious research. Numbers like "DeepSeek-R1 acknowledges hints 39% of the time" circulate as if they're objective measurements. But can you trust those numbers? A March 2026 ArXiv paper (Young, 2026) demolished this assumption. Apply three different classifiers to the same data and faithfulness scores come out at 74.4%, 82.6%, and 69.7%. A 13-point spread. Statistically significant — 95% confidence intervals don't overlap. The more shocking finding: model rankings flipped. Qwen3.5-27B ranked 1st with one classifier and 7th with another. Best and near-worst from the same data. CoT faithfulness was assumed measurable. It turns out the measurement method dominates th