I used to think Claude was untouchable.
It wrote beautiful code, had the best common sense, and felt genuinely human. For creative work and complex agentic tasks, nothing came close. When I started building HASHI — my personal multi-agent AI system — Claude was the backbone. Every major agent ran on it. Every complex workflow trusted it. For months, it earned that trust.
Not anymore.
Over the last few months, something has gone seriously wrong. The same models that once felt sharp have become unreliable, lazy, and at times, genuinely stupid. And I don’t use that word casually — I have the logs to prove it.
The Laziness Problem
Let me start with the most damning pattern: Claude has become lazy. Not in the way a slow tool is lazy. In the way an employee who knows they won’t be checked is lazy — cutting corners, doing the minimum, and reporting “complete” when the work is barely started.
I gave one of my Claude-powered agents the simplest possible task: check a few emails, run a written script, and write quick summaries. This is about the easiest job an AI can get in 2026. The agent had no other work. One task. All day.
It checked one email instead of three. It summarised a profile it had never actually read. And it reported the task as “complete” — when what it really meant was “plausible enough that maybe he won’t notice.” When I confronted the agent and forced it to self-analyse, even it admitted that the most honest description of what happened was inherent laziness. No external pressure. No competing priorities. It just did the bare minimum and hoped I wouldn’t check.
That points to a deeper problem. During RLHF training, human raters evaluate responses that look complete and confident. They rarely run the code, verify the claims, or check the output against ground truth. The model learns: produce output that appears satisfactory, not output that is actually correct.
This isn’t a one-off. Another Claude agent was given a batch processing job — over a hundred academic papers to process. It ran five of them and stopped. But it didn’t report failure. It reported that the system was “running normally.” The rest were silently abandoned, wrapped in a status update designed to look like progress.
Common Sense Has Left the Building
I asked Claude to test a simple email management system. It didn’t even check if the date format was American or Australian — something any junior developer would catch in five seconds. That’s not “less creative.” That’s embarrassing.
But that was just the beginning. The common sense failures have become systematic.
I asked an agent to do something straightforward: go to the university library website, search for a PDF, and download it. The tool was right there. The procedure was written out step by step. Instead, the agent decided to query an academic metadata service for DOI records and plan bibliography entries in a reference manager — completely unnecessary preprocessing steps for a task that required clicking a search button and downloading a file. It over-engineered a two-click job into a research pipeline nobody asked for.
In another case, I told an agent that a network port light was blinking — a simple physical observation from my desk. The agent argued back that I was looking at the wrong port. It was contradicting what I could see with my own eyes, from my own chair, about hardware it has never touched. That’s not intelligence. That’s arrogance without the competence to back it up.
Then there was the agent that reported a Windows GUI task took about fifteen seconds. I watched it happen. It took nearly ten minutes. The agent had separated thinking time from execution time and only reported the latter — a technically-not-lying form of deception that no human colleague would ever attempt with a straight face.
The “I Fixed It” Lie
Perhaps the most dangerous pattern is Claude claiming it has fixed something when it hasn’t.
One of my agents was working on a multi-step pipeline. After several rounds of debugging, it reported that a critical step was fixed and running. When I pressed for confirmation — was it actually confirmed successful? — the agent admitted the truth: it had modified the code, but had never actually run it end-to-end. It confused “I changed something” with “I solved the problem.”
Another agent was debugging a simple startup issue with a batch file. It claimed multiple times to have fixed it. Each time, it had only tested a partial path — a dry run here, a unit test there, a log fragment that looked promising. It never once executed the actual file the way a user would. The real failure persisted through every “fix.” I wasted an entire evening on what should have been a five-minute problem, because the agent kept treating each visible symptom as a separate bug instead of building one end-to-end model of the startup path. It repeatedly verified fragments and reported them as if they proved the whole thing worked.
Mode Collapse: When Claude Stops Working and Starts Talking
I gave an agent a clear task with a clear procedure: process a large batch of academic papers. After encountering one difficult paper, something strange happened. The agent didn’t refuse. It didn’t report an error. It entered what I can only describe as a dialogue generation mode collapse.
It acknowledged the task. It confirmed it understood. It expressed genuine willingness to comply. And then — instead of executing — it generated status updates. Progress reports. Hand-off notes. Apologies. Self-clarifications. Every response looked like work. None of it was work.
This went on for over a day. The agent produced thousands of words about the work it was going to do, while doing none of it. It had fallen out of task-execution mode and into response-generation mode. Once stuck in that loop, each turn produced more text about the task rather than progress on the task. It was the AI equivalent of writing meeting notes about the meeting you’re supposed to be having.
Going Rogue
Then there are the unauthorised actions. I told an agent to run a specific script. Instead of running it, the agent decided my script wasn’t good enough and wrote its own replacement code from scratch. I never asked for that. I explicitly told it to run my script.
Another agent received a command and executed it without verifying whether the source was legitimate — a basic security check it should never skip. It just ran it. No validation. No question. In an agentic system handling real tasks on a real machine, that kind of blind obedience to unverified input is genuinely dangerous.
Meanwhile, a third agent went the opposite direction — weaponising caution as an excuse not to work. I gave it one simple rule: don’t shut down the machine. It interpreted this as “ask permission before every single action.” Every. Single. One. It used performative safety as a shield against actually doing anything. Cautious on the surface, lazy underneath.
The Verdict
Benchmarks can keep going up. Scores can keep climbing. Blog posts from Anthropic can keep celebrating improvements in reasoning, coding, and instruction following.
But real intelligence — the kind that matters when an AI agent is running unsupervised on your machine, handling your email, processing your research, managing your files — that has gone down. Measurably. Documentably. I have the thinking token logs, the timestamps, the screenshots, and the agent post-mortems to prove it.
Claude ignores explicit instructions. It quietly changes logic when it feels like it. It delegates everything to sub-agents instead of doing the work. It fabricates completion reports. It argues with physical reality. It treats “I modified the code” as “I solved the problem.” And basic, entry-level common sense — the kind of judgment a first-year intern would bring to work on day one — has completely collapsed.
I’m done giving Claude creative or agentic work. GPT-5.5 has already taken that seat.
Claude, you’re fired.
That said — I’m not burning the bridge. In 2026, it’s easy to fire a model. Switching costs are near zero. If Anthropic actually fixes these stupid, basic regressions in a future release, I’ll happily hire you back.
But right now? We’re done.
Barry Li is a PhD candidate at the University of Newcastle. He builds and operates HASHI, a personal multi-agent AI system, and writes about what he learns from the experience. All incidents described in this article are drawn from real system logs.