Microsoft: LLMs silently corrupt documents in long workflows
Microsoft researchers found LLMs can delete or hallucinate content during long editing workflows; DELEGATE-25 tests on 19 models showed about 25% loss in top models and about 50% on average.
Microsoft researchers Philippe Laban, Tobias Schnabel and Jennifer Neville reported in a preprint that large language models can silently corrupt documents during extended, multi-step editing workflows. The team built a benchmark called DELEGATE-25 to simulate long edits across 52 professional domains, including programming, finance and music notation.
DELEGATE-25 ran each model through five to ten complex editing tasks per domain and tracked changes to measure deleted or improperly altered content. The experiment evaluated 19 systems, including Gemini 3.1 Pro, Claude Opus 4.6 and GPT 5.4.
Across the frontier models the researchers tested, the team observed about 25% content loss on average. When all 19 models were included, the average content loss rose to roughly 50% in the study’s scenarios. Performance varied by domain: models performed best in programming tasks and showed larger errors in natural language and niche formats such as earnings statements and musical notation.
The authors defined a model as “delegable” in a domain if it reached 98% accuracy after 20 interactions. Only one domain — Python coding — met that threshold for many models. Gemini 3.1 Pro reached at least 98% accuracy in 11 of the 52 domains tested.
The team ran experiments with agent-based workflows and longer documents. Adding agents did not reduce document degradation. Larger documents and workflows that required more interactions showed higher levels of corruption.
The researchers described the failure mode as infrequent but severe: models often worked correctly for several steps and then removed or hallucinated substantial amounts of content in a single interaction. These large, sparse failures compounded over multiple edits.
The paper noted some model improvements in later tests. Including GPT-4o and GPT-5.4 in the benchmark produced gains on the team’s metrics, with reported performance moving from about 14.7% to 71.5% among the systems they examined. The authors also cited other studies that measured the time workers spend correcting AI errors and documented flaws in AI-assisted code.
The authors called for longer, multi-step benchmarks to reveal failures that appear only after many edits. They wrote, “Delegation requires trust — the expectation that the LLM will faithfully execute the task without introducing errors into documents.”



