AI agents at work couldn't even find their colleagues in a chat
As talk of the "singularity" and mass unemployment due to AI becomes louder and louder, a group of researchers from Carnegie Mellon University decided to test these fears in practice. They modelled TheAgentCompany, where all jobs — from financial analyst to project manager — were taken by autonomous AI agents from Google, OpenAI, Anthropic, Meta, and Amazon. The "staff" also included a virtual HR department and even a digital CTO.
Futurism writes about it.
What are the results of the study?
The researchers recreated the typical daily routine of a software company: the algorithms had to inspect file storages, virtually "inspect" new office premises, and make assessments of programmers' efficiency based on the collected feedback. The results cited by Business Insider are telling: Anthropic Claude 3.5 Sonnet was the most successful, but it completed only 24% of the tasks, spending an average of almost thirty steps and more than USD 6 per task.
Google Gemini 2.0 Flash, which took second place, took even longer to complete tasks — about forty steps per job — and managed to complete just over 10% of them. The absolute "anti-record holder" was Amazon Nova Pro v1: less than 2% of tasks were successfully completed, with an average of twenty steps per task.
The authors of the study attribute the failures to agents' lack of common sense, poor social skills, and lack of understanding of how to navigate web resources. The report also mentions funny episodes of self-blinding: when the bot could not find the right colleague in a corporate chat, it simply renamed the other user under the "correct" name, considering the problem solved.
Despite the ability to perform individual small tasks, modern agents demonstrate a clear inability to maintain a holistic picture of complex processes, accumulate experience, and apply it in new situations. So, even though tech giants promise rapid office automation, real-world experiments so far show the opposite: your desktop is unlikely to be taken over by AI in the near future.
As a reminder, the Australian radio station broadcast a programme hosted by AI for months. His look was based on the real employee, and listeners were not informed about the replacement either in the programme description or on air.
We also wrote that Duolingo, the developer of the popular language app, announced that it intends to switch to the AI-first model. It implies refusing to use external contractors in favor of AI.
Read Novyny.LIVE!