← Back to The Markdown

My OpenClaw platform engineer

· 8 min read

OpenClaw agents are very good at brute forcing their way through tasks and being hyper proactive. Give them a task, they'll smash their head against it, get stuck, and then somehow come back five minutes later with exactly what you wanted. It's like watching a dog figure out a door handle. Ugly, but it works.

This is great for the day-to-day stuff. Small tasks, medium tasks, general operational grunt work. Especially if you have harnessed them with the right skills. The agents handle it.

But for the big stuff. Major version upgrades, building new agents from scratch, wiring up complex skill chains, infrastructure that touches everything. The agents aren't the right tool. They don't hold enough context. They can't see the whole board. They do one thing well and then lose the plot when it cascades into three other things.

Maybe it was out of habit, maybe it was because I'm lazy, but I've been using Claude Code more than OpenClaw itself to manage and run my OpenClaw instance.

Not instead of the agents. Alongside them. Claude Code became the thing I'd open when I needed to think across the whole system — read the config, understand the state of all eleven agents, trace why a cron job was silently failing, work out which of the 43 ways my setup could break was actually breaking it this time.

Without going too deep into it, my OpenClaw setup has five business units with GMs for each unit and a total of eleven agents. 91 cron jobs. 7 Telegram groups with 40+ topic bindings.

One of the agents is a platform engineer called Cyclawps. Cyclawps runs routine health monitoring and is well scaffolded for the rinse and repeat tasks for new agents, skills or discovery. But when I need to do deep work or new developments that need me to think across the whole system; I open Claude Code.

Over four months of doing this, the notes, gotchas, and workflows I'd accumulated turned into something worth packaging. So I did. cyCLAWps. Forty-three gotchas, thirteen guided workflows, and everything I've learned from running a production instance since day one.

And the timing is interesting. As of today, Anthropic has essentially banned OpenClaw from Claude subscriptions. Your agents now need to run on pay-as-you-go or cheaper models. Which means the big, meaty platform tasks. Upgrades, debugging cascades, fleet-wide audits. The stuff that actually needs Opus-level thinking? You can still do all of that through Claude Code. Cyclawps is a Claude Code skill. Your agents run on whatever model makes economic sense (GLM-5 and MiniMax M2.7 are genuinely strong for daily agent work; stay away from Gemini models), and you open /cyclawps when you need the heavy lifting.

Let me cut to the chase and show you three of many ways I actually use it.

1. Production upgrades across breaking changes

This morning I needed to upgrade from v2026.3.24 to v2026.4.2. Three intermediate releases. Breaking changes in each one.

I typed /cyclawps, pasted the release link, and said "what breaks if I upgrade?"

It fetched the release notes, mapped every breaking change against my specific config (11 agents, 91 cron jobs, 7 Telegram groups with 40+ topic bindings), and came back with a risk matrix. Three critical risks. One pre-existing config error that had to be fixed before we could even start.

Then it built a plan. 17 tasks across 5 phases.

Phase 0 was pre-flight. Four subagents launched simultaneously. One removed the pre-existing config error and migrated 21 CLI commands to a new syntax. One cleaned stale entries from the security sandbox. One verified plugin compatibility by tracing every import path. One backed up everything. All four finished in 140 seconds because they ran in parallel.

Phase 1 was the actual upgrade. Cyclawps handled this directly; npm install, doctor validation, gateway restart. This phase needed judgment at each step, so no delegation. 30 seconds of downtime.

Phase 2 was smoke tests. Four subagents. Telegram routing, cron execution, WhatsApp health, subagent infrastructure. All parallel. All passed.

Phase 3 was deep verification. Three more subagents. Every Telegram bot binding, every background service, the lossless context plugin (814MB database, 952 conversations). All healthy.

Then Phase 4. This is the one that surprised me. Four fresh Opus-model agents, each with zero context from the build session, each auditing a different scope. Config integrity, documentation accuracy, TOOLS.md correctness, plan coverage gaps. They found 3 critical gaps that the in-session work missed. A stale version string in the main CLAUDE.md. An unstaged memory file. A documentation cross-reference that was wrong.

15 subagents. 312 tool calls. 30 seconds of downtime. Zero rollbacks.

The QA audit pattern is the interesting bit. Fresh agents with no context bias catching things the build agents couldn't see. Same reason you get someone else to proofread your work. They don't have the "I just wrote this so it must be right" blindspot.

2. Cron reliability surgery

A week ago, I had 107 cron jobs. 46 of them were in an error state.

Some were obvious. Stale one-shot reminder jobs that should have been deleted months ago. Some were hitting timeout ceilings because the model was too slow for the task. But the biggest category, 25 jobs, shared one root cause that took me weeks to identify.

The symptom: "Outbound not configured for channel: telegram." The natural assumption: Telegram outbound is broken. I spent days chasing that hypothesis. Restarting gateways. Checking bot tokens. Reviewing channel configs.

Wrong. The actual root cause: certain skills were calling the message tool directly during execution, but isolated cron sessions have no Telegram channel registered. The cron delivery system works fine; it's the skills trying to bypass it that break.

The evidence was right there the whole time. Two jobs with identical delivery config. Same agent, same account ID, same target topic. One succeeded, one failed. The difference: the successful one wrote its output and let the cron system deliver it. The failing one tried to send a Telegram message itself.

Cyclawps helped me trace this by cross-referencing 107 jobs against their execution logs and categorising every failure by root cause. The remediation plan: 20 stale jobs removed, 4 timeout limits bumped, and a pre-batch outbound probe deployed at 00:45 AWST that tests all bot tokens before the morning cron rush starts.

The probe runs every morning now. If outbound is dead, it auto-restarts the gateway before any real jobs fire. If 3+ jobs fail in the same batch, the watchdog flags it as systemic rather than individual. Different diagnosis, different fix.

From 46 errors to single digits.

3. Fleet-wide health audits

When you're running 11 agents, things drift. Slowly. Files end up in the wrong directory. Memory documents bloat past size limits. Agent knowledge goes stale. No single failure; just gradual entropy.

Cyclawps runs a daily health check. But the real value is the deep audits I do every few weeks through Claude Code.

Last month I ran a bootstrap audit across all 11 agents. Every .md file in an agent's workspace root gets loaded into context at session start. That's by design; it's how agents get their identity and operating knowledge. But a 117KB deep-research output had accidentally landed in Cyclawps's workspace root. It was being loaded into every single session, wasting tokens, degrading performance. No error. No warning. Just silently eating context budget on every conversation.

Same audit found Bobo (my advisory COO) had three research files totalling 224KB sitting in his root. Every time Bobo started a session, it was loading a quarter-megabyte of research output as if it were identity instructions. That's not a bug you notice until you wonder why Bobo's responses are getting slower and his context window keeps filling up.

Then the hippocampus audit. Every agent has a rolling 14-day context file (we call it HIPPOCAMPUS; you can download it at hippocampus.lovabo.com). The audit found 5 agents with domain violations; platform data leaking into advisory files, CRM details duplicating across agents, one agent reporting a bug as active that had been fixed 10 days earlier. Three agents were echoing each other's operational data verbatim instead of synthesising it.

None of these are catastrophic. All of them compound. An agent with stale context makes worse decisions. An agent with bloated bootstrap wastes money on every interaction. An agent absorbing another agent's raw data instead of its own summary starts giving confused, contradictory answers.

The fleet audit pattern is: scan everything, categorise by severity, fix the criticals, schedule the rest. Cyclawps does the scanning. I make the judgment calls on what matters.

So how can you use it?

Just the skill. Clone the repo, run ./setup, type /cyclawps in Claude Code. This is how most people will use it. Talk to your platform engineer. It'll figure out what you need.

As a plugin. openclaw plugins install @sacheeperera/cyclawps. Registers health check tools and CLI commands inside OpenClaw. Any agent on your instance can call them programmatically.

As an agent. Register Cyclawps on your instance. It runs health checks on a schedule, alerts when something drifts, and shuts up when everything's fine.

Clone it, try /cyclawps, tell me what breaks.

Something will. And when it does, I'll add it to the gotchas file.

Sachee Perera runs a GTM advisory practice and an unreasonable number of AI agents from a Mac Mini in Perth. He writes at The Markdown.

Get new posts delivered to your inbox