Control Chrome From Hermes to bushwhack APIs

I send a voice message to Telegram, and a bot on my server opens Chrome, clicks around a few websites, reads my email, and sends me back a summary. That's the whole thing in one sentence — the rest of this post is how and why.

Why I built this

The idea — a universal remote for the web

Every AI agent I tried hit the same wall. They work through APIs and MCP servers, which means they can only do what someone's already built a connector for. The moment you ask it to do something without an API, or something slightly outside what the connector was designed for, you're stuck.

I ran into this constantly. Want the agent to check your Woolworths order? No API for that. Sign up for a newsletter using a specific email? No MCP server exists. Sort through your inbox and actually read what's in each email? You'd need a custom integration for every provider. Every new task meant either waiting for someone to ship a connector or building one yourself.

The fix was dead simple — just control Chrome. Every website already has a web interface, and if the agent can drive a real browser, click buttons, fill forms, read what's on the page, then there's basically nothing it can't do. No API to wait for, no connector to build. If you can do it in a browser, the agent can do it.

So I wired Telegram to a Playwright-controlled Chrome instance on my server. One message in, real browser work happens, results come back. The browser is the universal interface — it's already the thing you'd use to do all of this yourself, except now it works for you while you get on with your day.

Browser automation from chat

Browser automation puppet-master fresco

Playwright controls a real Chrome instance on the server. The agent navigates, clicks, types, and extracts data — all triggered by a Telegram message.

The browser lives on a server behind a virtual display. I fire off a message through Telegram, the agent spins up Playwright, and from there it's driving a live Chrome instance — not an API call, not some brittle scraper script, but an actual browser that renders JavaScript, handles logins, and works on any site you point it at.

So it'll navigate to Woolworths, log in, check the status of my order, and send back a screenshot of the delivery tracker. Or take something like "sign up for Harvey Norman's newsletter using my secondary email" — it opens the site, hunts down the signup form, fills in the details, and confirms the subscription. If a CAPTCHA pops up, it sends me a link to solve on my phone and picks up right where it left off once I'm done.

Ask it to "generate an image of Hogwarts characters in designer fashion" and it'll open an image-gen site, type the prompt, wait around for the result, grab the image, and send it back.

The important bit is that it can actually see the page. It screenshots, reads the DOM, finds buttons by their ref IDs, and clicks them — and when it gets well and truly stuck, it just tells me rather than flailing on.

See it in action

The agent executing tasks

I send a message asking it to place my usual Woolworths order, and the agent takes it from there — it checks my order history, rebuilds the cart, and runs the checkout without me babysitting it.

The spinning gear bar up the top tells you which tool is doing the work at any given moment: Playwright navigating to Woolworths, building the cart, pushing through checkout. Start to finish, it's about 30 seconds.

Sorting through the noise

Email sorting as a Greek fresco

Morning digest: 100 emails triaged into action items, orders, bills, and newsletters. Read in 2 minutes instead of 20.

My inbox fills up faster than I can keep up with — personal messages, order confirmations, bills, and a ridiculous amount of promotional spam. I used to burn twenty minutes every morning just working out what actually mattered. Now I get a summary instead.

The agent reads every email and sorts the pile into buckets:

Action required — things that need a response or decision
Orders and deliveries — tracking numbers, shipping updates
Bills and statements — financial items that need filing
Newsletters — content I might want to read later

Each bucket comes with a short summary. If an order's got a tracking number, the number's right there. If a bill's due soon, it flags the amount and the deadline. If a newsletter's actually worth reading, I get the headline and a line of context to decide on.

It all runs on a schedule — every morning at 8am it checks the inbox and sends through the digest, which I read over coffee.

The workflow

The workflow as a Greek fresco

The loop: ask → agent inspects → executes → returns results → you decide what happens next.

The loop itself is dead simple:

I send a message — voice or text
The agent figures out what to do
It does the work — opens browsers, reads emails, runs scripts
It sends me back the results
I decide what happens next

What makes it work is context. The agent remembers my preferences, knows my login details (which live in the config, never in the chat), and carries a stack of skills it's picked up from previous tasks. When I ask it to check my email, it doesn't have to rediscover how to connect to my provider — it already knows.

And every time it works through something new — wrestling a fiddly website, getting past a CAPTCHA, learning to sort a new flavour of email — it saves the approach as a skill. Next time the same thing comes up, it loads that skill and just gets on with it.

After a few months of this it genuinely knows my setup. It isn't any single feature that makes this worth running — it's the compounding.

The bits I had to wrestle with

The bits I had to wrestle with as a Greek fresco

It doesn't look like much from the outside. Send a message, get a result. Getting there was a fair bit of holding things together with duct tape.

APIs were basically a dead end. Most services either paywall their API or rate-limit you until you give up. The Twitter API was blocked entirely. My email provider has no REST endpoint; it's just a web app with no IMAP access. Browser automation wasn't even the clever choice, just the only one left. If you can't talk to a site through its API, you drive a real browser instead.

Getting a browser to actually automate was painful. Playwright's bundled Firefox kept crashing on my system — turned out to be incompatible with my setup. Had to fall back to system Firefox with a compatibility patch, and even then navigation didn't work properly so I needed a workaround just to load pages. A virtual display so I could see the browser through VNC when things inevitably went wrong.

Deploying the blog was its own adventure. Next.js on a hosting platform, behind a reverse proxy worker, with a custom path prefix. Sounds fine until you discover the proxy deploys to one account but the DNS zone lives on another, so your changes "deploy" but traffic routes to the old version. Took multiple rounds to figure out the account mismatch. Then the hosting platform's edge cache creates separate entries for every combination of request headers your proxy sends, so redeploying doesn't clear stale cache — you hit a completely different cache key and get yesterday's version.

MDX rendering broke half the site. The serializer silently strips inline style attributes from JSX. Everything you style gets dropped without a warning. Custom components imported in MDX files don't render at runtime because the serializer drops them. Add a CSS style string on an iframe and it crashes the whole page because React treats it as JSX. And client-side hydration strips the path prefix from every image tag, so all local images break the moment React takes over.

Animated demo iframes freeze when you scroll past them. The JavaScript inside keeps running in the background. Timers drift, the animation finishes its loop, and when you scroll back you're staring at a frozen demo. Solved it with an observer in the parent page that sends a message to restart the animation when the demo scrolls back into view.

CAPTCHAs are the worst. The bot hits a signup form, a CAPTCHA appears, and it can't solve it. So it sends me a link to solve on my phone and waits. Every CAPTCHA is a human-in-the-loop pause. Not elegant, but it works.

What actually changed

The compound effect — chaos to serenity

None of the pieces here are new. Playwright's been around for years, Telegram bots are straightforward, and email APIs are about as well-documented as anything gets. What changed for me was the interface.

I used to open five apps to do five tasks; now I open one chat. I used to keep a mental map of which website hid which setting; now I just tell the agent what I want and let it go find it. The twenty-minute morning email sort is two minutes of reading a summary.

And because every task it learns makes the next one a little easier, the skills keep stacking up and the whole workflow quietly tightens over time. I use it every single day. The boring stuff didn't go away — it just got a lot cheaper to deal with.