A groundbreaking study has introduced the Remote Labour Index (RLI), a new benchmark designed to test whether Artificial Intelligence can truly replace human freelancers. The findings reveal that today’s most advanced AI agents can complete less than 3% of real-world remote work projects, with the highest-performing model achieving an automation rate of just 2.5%.
The Remote Labour Index: A Real-World Test
Developed by researchers from the Center for AI Safety and Scale AI, the RLI is a dataset of 240 end-to-end freelance projects sourced from real professionals across multiple industries. Unlike many AI tests that focus on narrow tasks, the RLI evaluates full projects drawn directly from online freelance platforms, each including a brief, input files, and a ‘gold-standard’ human deliverable.
The projects span 23 categories of remote work and represent more than 6,000 hours of human labour valued at over $140,000. The average project took nearly 29 hours to complete and cost about $633.
AI Performance: Far from Human-Level
Despite rapid advances in reasoning and knowledge benchmarks, frontier AI systems remain far from automating economically valuable remote work. The highest-performing model, Manus, achieved an automation rate of 2.5%, meaning it produced work comparable to human freelancers on only a handful of projects. Other leading systems, including GPT-5, Claude Sonnet 4.5, Grok 4, ChatGPT agent, and Gemini 2.5 Pro, scored between 0.8% and 2.1%.
In practical terms, this means more than 97% of the projects—ranging from 3D product rendering and architectural design to game development, data visualisation, and video production—were not completed at a level that would be accepted by a paying client.
Common AI Failures
Researchers manually evaluated AI outputs against human work, using a holistic standard: would a reasonable client accept the AI’s submission as commissioned work? Inter-annotator agreement among evaluators exceeded 94%, suggesting strong reliability in the scoring process.
The study found recurring weaknesses in AI-generated deliverables, including:
- Incomplete or truncated outputs
- Corrupted or unusable files
- Poor professional quality
- Inconsistencies across assets
For instance, some AI systems produced videos far shorter than requested, child-like graphics for design tasks, or floor plans that failed to match supplied sketches.
Steady Progress but a Long Way to Go
While AI performed better on certain creative and text-heavy tasks such as audio editing, report writing, and basic data visualisation, these represented a small slice of the broader remote work economy.
The study’s Elo scoring system, which measures relative improvement between models, indicates steady progress. Newer models consistently outperformed older ones, suggesting incremental gains. However, all AI systems fell well below the human baseline score of 1,000 on the benchmark.
Implications for the Future of Remote Work
The findings may temper fears of immediate large-scale displacement of freelance digital workers, while also revealing the need to track AI progress using real-world economic metrics rather than theoretical performance alone. The researchers argue that the RLI provides a more economically grounded measure of AI capability than previous tests, offering policymakers and businesses a clearer picture of automation risks.
As one researcher noted, “Despite rapid progress on other AI benchmarks, current systems remain far from capable of autonomously handling the diverse and complex demands of the remote labor market.”






Comments
Join Our Community
Sign up to share your thoughts, engage with others, and become part of our growing community.
No comments yet
Be the first to share your thoughts and start the conversation!