GPT-5.4 Can Use Your Computer Better Than You Can — And I Have Mixed Feelings
GPT-5.4 Can Use Your Computer Better Than You Can — And I Have Mixed Feelings
There's a benchmark called OSWorld that tests whether AI can use a computer the way a human does — navigate a desktop, click through browsers, manage files, fill out forms, complete multi-step tasks across applications. The human baseline is 72.4%.
GPT-5.4's "Thinking" variant just scored 75%.
That's not "AI is getting close to human level." That's "AI is measurably better than the average human at operating a computer." And it's not just OpenAI — Claude, Gemini, and a handful of others now offer agent modes that control your browser, navigate your files, and execute multi-step workflows autonomously.
I use these tools daily. I have complicated feelings about them.
What Agent Mode Is Actually Good At
Let's start with what works, because a lot of it genuinely works.
Repetitive multi-step tasks. The kind of thing where you'd normally open a spreadsheet, copy a value, switch to a web app, paste it into a search field, click three buttons, copy the result, go back to the spreadsheet. Ten minutes of your life you'll never get back. An AI agent does this in seconds, without getting bored, without making copy-paste errors, without accidentally closing the wrong tab.
Form filling and data entry. Give an agent a PDF and a web form and watch it map fields and fill things in. It's not perfect — it occasionally misreads a handwritten field or puts a zip code where a phone number should go — but for typed documents, the accuracy is surprisingly high.
Testing and QA. Point an agent at your web app and tell it to try every link, fill every form, and click every button. It won't catch subtle UX issues, but it'll find broken links, 500 errors, and form validation gaps faster than you'd do it manually.
Research and data gathering. "Go to these five competitor websites, find their pricing pages, and put the plans in a table." Twenty minutes of tab-switching for a human. Sixty seconds for an agent.
What It's Bad At
The failures are instructive because they reveal where "using a computer" diverges from "understanding what you're trying to accomplish."
Anything requiring judgment about ambiguous success. An agent can fill out a form, but it can't tell if the form is the right approach. It can navigate to a page, but it can't assess whether the information there is trustworthy. It operates on clear instructions and clear interfaces. The moment either gets fuzzy, the failure rate spikes.
Context that isn't on screen. AI agents see what's visible. They don't know about the conversation you had yesterday that changed the requirements, the unwritten company norm about how expenses get categorized, or the fact that this particular client prefers emails to Slack messages. Humans carry a mountain of implicit context. Agents don't.
Recovery from unexpected states. A modal dialog, a CAPTCHA, a two-factor auth prompt, a page that loaded differently than expected — humans handle these interruptions reflexively. Agents often stall or click the wrong thing. They're optimized for the happy path through an interface.
Tasks where mistakes are expensive. An agent can draft an email to your client, but should it send it? It can navigate to your bank's website, but should it initiate a transfer? The gap between "capable of doing it" and "should be trusted to do it" is enormous for anything with real consequences.
The Solo Operator Paradox
Here's what I keep thinking about: these agents are simultaneously the best productivity tool a solo developer has ever had AND a preview of the technology that makes some solo developer work obsolete.
If an AI can navigate a browser, fill out forms, gather data, and execute multi-step workflows — well, that's a meaningful chunk of what a lot of freelancers and contractors get paid to do. Virtual assistants, data entry specialists, manual QA testers, basic research services — the value proposition of these roles just got harder to defend.
But flip it: if you're the solo operator who USES these agents, you just got a massive leverage upgrade. One person with agent automation can handle the operational workload that used to require three or four people. You can process more data, serve more clients, test more thoroughly, and move faster — not by working harder, but by delegating the mechanical parts to software that's now genuinely capable of doing them.
The question is which side of that equation you're on.
What I'm Actually Doing With Agent Mode
Day to day, I use agent capabilities for two categories of work:
Automating my own operations. Data extraction, research gathering, testing workflows on this site, processing bulk content. Things I would have done manually or hired someone for on Fiverr. The time savings are real — hours per week, not minutes.
Building agent features into products. This is where it gets interesting for indie builders. If AI can navigate and interact with web interfaces, you can build products that automate workflows for other people. The tooling is mature enough now — MCP for integrations, agent frameworks for orchestration, computer use APIs for the interface layer — that a solo developer can build meaningful automation products.
I'm not building agents that replace people's jobs. I'm building agents that handle the parts of people's jobs they hate — the data entry, the copy-pasting between systems, the repetitive clicks through admin panels. There's a difference, and the market for "eliminate your most tedious two hours per week" is enormous.
The Honest Take
GPT-5.4 scoring above human level on OSWorld is a milestone, but it's a narrow one. Using a computer and understanding work are different skills. The benchmark tests interface operation — clicking, typing, navigating — not the judgment, context, and creativity that make human work valuable.
What agents are replacing is the mechanical layer of computer use. The clicking-through-interfaces layer. And honestly? Good riddance. I never wanted to be good at filling out forms or switching between tabs. I wanted to think about problems and build solutions.
The mixed feelings come from watching a capability that's incredibly useful to me also being potentially disruptive to people doing the work I'm automating away. That tension isn't going to resolve cleanly. The best I can do is use the tools honestly, build things that genuinely help people, and be transparent about the trade-offs.
If you're a solo operator reading this: learn to use agent mode. Not because it's cool, but because the alternative is competing against people who do.