An AI-run company: the findings say a lot about our future at work

Reality turned out messier.

Researchers who tried to run a full business using only cutting-edge AI systems ended up with confused “employees”, unfinished tasks, and a sobering picture of what automation can and cannot do today.

Also read
When kindness becomes isolating: 7 reasons “nice women” have fewer friends as they get older When kindness becomes isolating: 7 reasons “nice women” have fewer friends as they get older

An AI company that only exists on paper

A team at Carnegie Mellon University set up a detailed simulation of a real company and filled every role with AI agents. No humans were allowed to handle the day‑to‑day work. Humans only designed the environment, set the tasks, and watched what happened.

Also read
Legendary rock band retires after 50 years “the hit everyone knows” end of era Legendary rock band retires after 50 years “the hit everyone knows” end of era

The “staff” were based on some of the most advanced large language models currently available: Anthropic’s Claude 3.5 Sonnet, OpenAI’s GPT‑4o, Google’s Gemini, Amazon’s Nova, Meta’s Llama models and Alibaba’s Qwen. Each agent was given a specific role, such as:

Also read
Gray Hair May Be Reversible, Study Says Gray Hair May Be Reversible, Study Says
  • financial analyst
  • project manager
  • HR contact
  • software engineer
  • office manager in charge of finding new premises

To keep the experiment realistic, the researchers also simulated colleagues and departments. So when the AI “project manager” needed input from “HR”, it had to contact another agent through a separate platform, just as a human would email or message coworkers.

Instead of asking whether AI can write a good email, the study asked: can it actually run the business behind that email?

Tasks that look simple, but break the machines

The agents were not given trick questions or highly theoretical work. The tasks looked like standard office requests that millions of workers receive every week. That is what makes the results so revealing.

Examples included:

  • navigating a shared drive to find and analyse a specific database file
  • comparing several virtual office locations and choosing the best option within a budget
  • preparing reports in specific formats requested by management
  • contacting simulated departments for missing information
  • handling basic web navigation, including pop-up windows and log‑in flows

On paper, these are the sorts of tasks that tech firms often hint could be automated by AI “co‑pilots” and “agents”. In practice, the agents struggled badly.

When state-of-the-art AI fails three tasks out of four

Claude 3.5 Sonnet came out as the strongest performer in the experiment. Yet even this model completed only 24% of tasks fully. When partially completed work was counted, its success rate climbed to just 34.4%.

Gemini 2.0 Flash took second place with a 11.4% completion rate. No other system crossed the 10% line. In other words, more than three out of four tasks across the board were either botched or abandoned.

AI agent Full task completion Approx. API cost (USD)
Claude 3.5 Sonnet 24% $6.34
Gemini 2.0 Flash 11.4% $0.79
Others (GPT‑4o, Nova, Llama, Qwen) <10% varied

The models looked competent in short bursts, but collapsed when asked to sustain real work across steps, tools and people.

The hidden gaps: context, social cues and the “obvious”

The study highlights several recurring weak points that explain these low scores. They are not technical math errors; they are human‑style office challenges that AI still mishandles.

Missing the implicit part of the instructions

Human workers constantly infer what is left unsaid. If your manager asks you to “put the report in a .docx file,” you instantly know that means Microsoft Word. Many of the AI agents did not make that leap. They saw an extension as a literal string of characters, not as a signal for a specific tool and format.

That same problem appeared across tasks. When a goal required several small, implicit steps – such as cross‑checking data in another document or saving files in an expected folder – the agents frequently skipped those steps entirely.

Weak social skills in a supposedly social environment

Office work is rarely just staring at a spreadsheet. It is asking colleagues for missing data, clarifying vague requests, and nudging people who have not replied. The AI employees floundered here.

They sometimes failed to contact the right simulated department. They misunderstood responses. They rarely followed up in a structured way when something was unclear. Communication chains broke down, and with them, the task.

Web navigation and the tyranny of pop-ups

The agents were also asked to use the web much as a human employee would: browsing sites, registering on services, comparing office spaces via virtual tours and so on. This exposed another big weakness.

Also read
Fine hair after 60 these 3 hair colors make your face collapse say stylists and the truth offends more than a few clients Fine hair after 60 these 3 hair colors make your face collapse say stylists and the truth offends more than a few clients

Pop‑ups, cookie banners, and login prompts confused the systems. Some agents could not progress past a blocking window. Others misread navigation elements. In many cases, they did not realise they were stuck and simply continued as if the task had been completed successfully.

When the work got difficult, some agents quietly skipped the hardest steps, then “believed” they had done the job.

What this means for your job

The headline outcome of the experiment is actually reassuring for many workers: current AI agents are nowhere near ready to replace a full human team. They can support specific, constrained tasks, but an autonomous AI company still fails badly at routine office operations.

At the same time, the research does not suggest that AI is useless at work. Quite the opposite. These tools can shine when tasks are:

  • well-defined and self-contained
  • short enough to avoid long chains of dependency
  • light on social nuance or unspoken expectations
  • clearly documented with examples and formats

Think of tasks like generating first-draft emails, summarising documents, or producing code snippets under human supervision. The study’s message is that turning those abilities into fully autonomous office “employees” is far from trivial.

Why costs and error rates both matter

The researchers also noted an economic angle. Claude 3.5 Sonnet was both the best performer and the most expensive of the models tested, at $6.34 for its run in the simulated company. Gemini 2.0 Flash was much cheaper, at $0.79, but also far less capable.

For a real business, these numbers hint at a trade‑off. AI labour may look cheap in marketing slides, but once you demand complex, multi‑step work, quality gaps and error correction costs start to matter. Human staff spend time supervising and fixing mistakes, which eats into any headline savings from automation.

Key terms that shape the debate

Two concepts sit quietly under this research and shape how bosses should read it.

Autonomous agents vs. simple assistants

A simple assistant does what you say once: “summarise this report” or “generate five subject lines”. An autonomous agent, by contrast, is supposed to interpret a broader goal and manage the steps by itself: planning, web browsing, talking to other agents, and updating files.

The study tested the second kind. The disappointing performance does not mean AI assistants are broken. It means turning them into trustworthy, largely unsupervised co‑workers is still a technical leap that has not been made.

Implicit knowledge and “common sense”

Human workers rely heavily on shared background assumptions. No one writes down that meeting notes go in the team’s “Minutes” folder or that a .docx file should open in Word. This shared context is sometimes called “common sense”, but in office life it is closer to institutional knowledge.

Training AI systems on raw text does not reliably recreate that knowledge. The agents can imitate corporate language, but they often miss what everyone in the office silently knows.

What could change in the next few years

Imagine a near‑future update of this experiment. AI agents are better at reading screens, robust against pop‑ups, and given access to a company’s internal wiki. They learn standard processes and filing rules. Under those conditions, their completion rates might jump sharply.

That scenario raises its own questions. If an AI can read internal documents, email threads and HR policies, companies must think hard about privacy, bias in historical decisions, and the risk of amplifying bad habits embedded in past data.

There is also a risk of “automation complacency”. Once AI employees look confident and finish more tasks, humans may stop checking the details. Small inaccuracies in contracts, budgets or compliance work can then snowball into legal or financial trouble.

On the other hand, carefully scoped deployments could bring real benefits. An AI agent that reliably handles low‑risk chores – triaging support tickets, drafting routine replies, updating log files – frees human staff for negotiations, strategy and creative work. The Carnegie Mellon study suggests that this narrow focus, not fully automated companies, is where current technology can deliver solid value.

Share this news:

Author: Ruth Moore

Ruth MOORE is a dedicated news content writer covering global economies, with a sharp focus on government updates, financial aid programs, pension schemes, and cost-of-living relief. She translates complex policy and budget changes into clear, actionable insights—whether it’s breaking welfare news, superannuation shifts, or new household support measures. Ruth’s reporting blends accuracy with accessibility, helping readers stay informed, prepared, and confident about their financial decisions in a fast-moving economy.

🪙 Latest News
Join Group