Reality turned out messier.

Researchers who tried to run a full business using only cutting-edge AI systems ended up with confused “employees”, unfinished tasks, and a sobering picture of what automation can and cannot do today.
When kindness becomes isolating: 7 reasons “nice women” have fewer friends as they get older
An AI company that only exists on paper
A team at Carnegie Mellon University set up a detailed simulation of a real company and filled every role with AI agents. No humans were allowed to handle the day‑to‑day work. Humans only designed the environment, set the tasks, and watched what happened.
The “staff” were based on some of the most advanced large language models currently available: Anthropic’s Claude 3.5 Sonnet, OpenAI’s GPT‑4o, Google’s Gemini, Amazon’s Nova, Meta’s Llama models and Alibaba’s Qwen. Each agent was given a specific role, such as:
Gray Hair May Be Reversible, Study Says
- financial analyst
- project manager
- HR contact
- software engineer
- office manager in charge of finding new premises
To keep the experiment realistic, the researchers also simulated colleagues and departments. So when the AI “project manager” needed input from “HR”, it had to contact another agent through a separate platform, just as a human would email or message coworkers.
Instead of asking whether AI can write a good email, the study asked: can it actually run the business behind that email?
Tasks that look simple, but break the machines
The agents were not given trick questions or highly theoretical work. The tasks looked like standard office requests that millions of workers receive every week. That is what makes the results so revealing.
Examples included:
- navigating a shared drive to find and analyse a specific database file
- comparing several virtual office locations and choosing the best option within a budget
- preparing reports in specific formats requested by management
- contacting simulated departments for missing information
- handling basic web navigation, including pop-up windows and log‑in flows
On paper, these are the sorts of tasks that tech firms often hint could be automated by AI “co‑pilots” and “agents”. In practice, the agents struggled badly.
When state-of-the-art AI fails three tasks out of four
Claude 3.5 Sonnet came out as the strongest performer in the experiment. Yet even this model completed only 24% of tasks fully. When partially completed work was counted, its success rate climbed to just 34.4%.
Gemini 2.0 Flash took second place with a 11.4% completion rate. No other system crossed the 10% line. In other words, more than three out of four tasks across the board were either botched or abandoned.
| AI agent | Full task completion | Approx. API cost (USD) |
|---|---|---|
| Claude 3.5 Sonnet | 24% | $6.34 |
| Gemini 2.0 Flash | 11.4% | $0.79 |
| Others (GPT‑4o, Nova, Llama, Qwen) | <10% | varied |
The models looked competent in short bursts, but collapsed when asked to sustain real work across steps, tools and people.
The hidden gaps: context, social cues and the “obvious”
The study highlights several recurring weak points that explain these low scores. They are not technical math errors; they are human‑style office challenges that AI still mishandles.
Missing the implicit part of the instructions
Human workers constantly infer what is left unsaid. If your manager asks you to “put the report in a .docx file,” you instantly know that means Microsoft Word. Many of the AI agents did not make that leap. They saw an extension as a literal string of characters, not as a signal for a specific tool and format.
That same problem appeared across tasks. When a goal required several small, implicit steps – such as cross‑checking data in another document or saving files in an expected folder – the agents frequently skipped those steps entirely.
Weak social skills in a supposedly social environment
Office work is rarely just staring at a spreadsheet. It is asking colleagues for missing data, clarifying vague requests, and nudging people who have not replied. The AI employees floundered here.
They sometimes failed to contact the right simulated department. They misunderstood responses. They rarely followed up in a structured way when something was unclear. Communication chains broke down, and with them, the task.
Web navigation and the tyranny of pop-ups
The agents were also asked to use the web much as a human employee would: browsing sites, registering on services, comparing office spaces via virtual tours and so on. This exposed another big weakness.
Pop‑ups, cookie banners, and login prompts confused the systems. Some agents could not progress past a blocking window. Others misread navigation elements. In many cases, they did not realise they were stuck and simply continued as if the task had been completed successfully.
When the work got difficult, some agents quietly skipped the hardest steps, then “believed” they had done the job.
What this means for your job
The headline outcome of the experiment is actually reassuring for many workers: current AI agents are nowhere near ready to replace a full human team. They can support specific, constrained tasks, but an autonomous AI company still fails badly at routine office operations.
At the same time, the research does not suggest that AI is useless at work. Quite the opposite. These tools can shine when tasks are:
- well-defined and self-contained
- short enough to avoid long chains of dependency
- light on social nuance or unspoken expectations
- clearly documented with examples and formats
Think of tasks like generating first-draft emails, summarising documents, or producing code snippets under human supervision. The study’s message is that turning those abilities into fully autonomous office “employees” is far from trivial.
Why costs and error rates both matter
The researchers also noted an economic angle. Claude 3.5 Sonnet was both the best performer and the most expensive of the models tested, at $6.34 for its run in the simulated company. Gemini 2.0 Flash was much cheaper, at $0.79, but also far less capable.
For a real business, these numbers hint at a trade‑off. AI labour may look cheap in marketing slides, but once you demand complex, multi‑step work, quality gaps and error correction costs start to matter. Human staff spend time supervising and fixing mistakes, which eats into any headline savings from automation.
Key terms that shape the debate
Two concepts sit quietly under this research and shape how bosses should read it.
Autonomous agents vs. simple assistants
A simple assistant does what you say once: “summarise this report” or “generate five subject lines”. An autonomous agent, by contrast, is supposed to interpret a broader goal and manage the steps by itself: planning, web browsing, talking to other agents, and updating files.
The study tested the second kind. The disappointing performance does not mean AI assistants are broken. It means turning them into trustworthy, largely unsupervised co‑workers is still a technical leap that has not been made.
Implicit knowledge and “common sense”
Human workers rely heavily on shared background assumptions. No one writes down that meeting notes go in the team’s “Minutes” folder or that a .docx file should open in Word. This shared context is sometimes called “common sense”, but in office life it is closer to institutional knowledge.
Training AI systems on raw text does not reliably recreate that knowledge. The agents can imitate corporate language, but they often miss what everyone in the office silently knows.
What could change in the next few years
Imagine a near‑future update of this experiment. AI agents are better at reading screens, robust against pop‑ups, and given access to a company’s internal wiki. They learn standard processes and filing rules. Under those conditions, their completion rates might jump sharply.
That scenario raises its own questions. If an AI can read internal documents, email threads and HR policies, companies must think hard about privacy, bias in historical decisions, and the risk of amplifying bad habits embedded in past data.
There is also a risk of “automation complacency”. Once AI employees look confident and finish more tasks, humans may stop checking the details. Small inaccuracies in contracts, budgets or compliance work can then snowball into legal or financial trouble.
On the other hand, carefully scoped deployments could bring real benefits. An AI agent that reliably handles low‑risk chores – triaging support tickets, drafting routine replies, updating log files – frees human staff for negotiations, strategy and creative work. The Carnegie Mellon study suggests that this narrow focus, not fully automated companies, is where current technology can deliver solid value.
