DeepSeek‑R1 hallucinates 4x more than V3

DeepSeek‑R1 hallucinates 14.3% on Vectara’s HHEM 2.1 benchmark versus 3.9% for DeepSeek‑V3, a gap that affects crypto AI agent tokens using reasoning models.

Vectara’s HHEM 2.1 benchmark measured a 14.3% hallucination rate for DeepSeek‑R1 and 3.9% for DeepSeek‑V3. The lab reported the results after running both models through the same hallucination evaluation framework and cross‑checking findings with a FACTS methodology.

Vectara’s analysis found R1 tends to add information not present in source texts, a behavior the company described as “overhelping.” Added details can be independently true but count as hallucinations when they are not supported by the original evidence. Vectara published a summary noting the 14.3% versus 3.9% rates.

DeepSeek is the developer; R1 is its reasoning‑focused model and V3 is its earlier non‑reasoning model. Vectara reported R1 produced more false or unsupported statements than V3 across multiple test configurations.

The gap between R1 and V3 follows a pattern seen in other reasoning‑trained models. Some training techniques that improve step‑by‑step reasoning use reinforcement signals to encourage chain‑of‑thought behavior. Those techniques can produce bolder, more confident outputs, and testing has shown the same trade‑off in other labs’ models.

The results have practical implications for crypto AI agent tokens, a group of projects that wrap language models in tooling to post on social media, route trades, mint tokens, or execute on‑chain actions. The category includes Virtuals Protocol (VIRTUAL), ai16z (AI16Z), and aixbt (AIXBT). Market trackers show the category rose about 39.4% over a recent 30‑day window, and Virtuals has a market capitalization exceeding $576 million.

Agent workflows that use reasoning models for multi‑step planning can carry a mechanism risk: a single unsupported fact early in a plan can appear in later outputs and in on‑chain transactions. One analysis of AIXBT’s agent activity recorded 416 promoted tokens with an average return of 19% on promoted positions; the same mechanics can produce incorrect calls when the model generates unsupported information.

Technical debate about causes continues. Meta’s chief AI scientist argued that hallucinations arise from autoregressive model design and proposed an approach he calls “Objective Driven AI” to plan answers by optimizing objectives at inference. Other developers point to improvements from retrieval augmentation, post‑training fine‑tuning, and verifier models as ways to reduce unsupported claims.

Some teams building agents apply verification layers that check model assertions against trusted sources before executing actions. Other teams separate roles, using smaller models for financial execution and larger reasoning models for planning and commentary. An AI researcher debugging R1 on social media described an experience of inconsistent thought traces and frequent unsupported assertions, matching the benchmark patterns reported by Vectara.

Vectara’s HHEM 2.1 numbers offer a metric for one iteration of these models: 14.3% for R1 versus 3.9% for V3. Future benchmark cycles and model updates will provide further data on how reasoning capability and factual accuracy compare over time.

Articles by this author

No related articles found.