CODY AI TRAINING · 16 APRIL 2026

Building an AI coding toolbox
that actually works

A year of lessons from coding agents, context curation, and a personal memory bank — tailored for Coodyans and friends.

Magnus Gille · Magnus Gille Consulting
Embedded systems · Web · AI workflows

About me2 22

Who’s talking

A bit about me

Magnus Gille

AI practitioner · Magnus Gille Consulting

BackgroundSystems architect and product owner — Ericsson, Scania, Adage

FocusPractical AI — making it work, not just talking about it

Day-to-dayI use AI tools intensively every day — this is experience from practice, not theory

Fun factReigning AI Prompt Champion

Contact[email protected] · gille.ai

Agenda3 22

What we’re doing today

One hour, not a monologue.

~40 minutes of talking: what’s happened, how I work, what actually works
Interrupt whenever. This isn’t a stage lecture — it’s a dialogue

Timeline4 22

From reasoning to long-horizon agents

18 months — and a 36× leap in AI time horizon

2024-08-12

OpenAI releases its first reasoning model, O1

2024-11-25

Anthropic releases the MCP protocol

2025-02-03

“Vibe Coding” coined by Karpathy

2025-02-24

Claude Code CLI ships as a research preview

2025-04-24

“30% of our code is written by AI” — Pichai

2025-05-18

“80% of my code is Codex” — McLaughlin, OpenAI

2025-10-03

44% of devs mainly use agents — Karpathy poll

2025-10-06

“Almost all new OpenAI code is Codex” — Altman

2025-12-26

“I’ve never felt so behind as a programmer.” — Karpathy

2026-01-12

“It was basically just Claude Code” — Cherny

time (log₂)

1 h

2 h

4 h

8 h

20 min

39 min

1 h

2 h

3.4 h

4.9 h

5.9 h

12 h

METR time horizon · task length at 50% success · doubles every ~4 months

And my own kick-started journey

GitHub contribution graph showing acceleration late 2025

1 210 commits in the last year — almost all of them in the last three months.

Nuance5 22

But does it actually make us faster?

METR, July 2025: −19 %

What the study found

Experienced OSS developers on their own codebases
They predicted: +24 % faster
Reality: −19 % slower
Large, mature projects → the agent has to read a lot of context

Why it’s not the end of the story

The study measured Feb–Jun 2025. The Claude Sonnet 3.5 & 3.7 era
Q4 2025 tooling has completely different context mechanics
Stack Overflow 2025: trust fell 70 % → 60 %
The lesson: generic “use AI” ≠ value. Workflow is what decides.

We’re past the hype peak. That’s where the real craft starts — how you actually extract value in a real project.

Concept6 22

Where the value actually gets created

Model ≠ Harness

The model

Claude Opus 4.6, GPT 5.4, Gemini 3.1
A frozen file of 100–1000 GB
Knows nothing about your project
Knows nothing about your filesystem, terminal, or test suite
Without a harness: a very knowledgeable buddy in a chat window

The harness

Claude Code, Codex, Cursor, OpenCode
Gives the model tools: read file, run command, search, edit
Manages context, memory, agent loops
Defines how MCP servers, skills, and subagents work
Ongoing debate on where the value sits — models, harness, or both

The industry still talks a lot about models — who has the highest MMLU score. The real difference between getting 2× or 20× lives in the harness and how you use it.

Surfaces7 22

Where are you running it?

One model, four surfaces

Terminal (CLI) — claude, codex. Right inside your dev environment, sees git, files, processes. My primary surface.
Desktop app — Claude Desktop, Cursor. Good when you want to chat with documents or kick off jobs in parallel.
Cloud / web — claude.ai, chatgpt.com. Good for research, quick questions.
Mobile — Claude iOS, ChatGPT. I mostly use it to log into my own memory system on the go.
Pi / always-on — I also run an agent on a Raspberry Pi that does background work while I sleep.

The point: pick the surface by task, not by habit. Same model, completely different tool reach.

Paradigm shift8 22

A mental shift

From craft to factory

Craft (where we came from)

One developer, one task, full attention
The code is a personal expression
“It takes the time it takes”
Quality lives in the craftsman’s head

Factory (where we’re heading)

The developer orchestrates multiple agents in parallel
Quality lives in the process: tests, CI, context, loops
Repetitive work becomes automation, not grind
Time freed up for architecture, design, decisions

This is not a value judgement — it’s a mode of production. Craft doesn’t disappear. It moves up the stack.

Embedded9 22

The frustration is real

Why embedded has felt left behind

Training-data skew. Ratio of public React : STM32L4 DMA examples ≈ 1000 : 1. The model is on thin ice.
Hardware context. The LLM can’t see your wiring, your missing pull-up resistor, your actual logic analyzer.
Toolchain lock-in. CubeIDE, Keil µVision, IAR. Your web colleagues have Cursor and Claude Code.
NDA / IP / air-gap. Reference manuals can’t leave the network. ChatGPT is banned outright.
Real-time, safety, resources. AI-generated code misses race conditions, timing deadlines, and is almost always too heavy.

Embedded devs aren’t “behind”. They’ve been burned in areas where the tools really were bad. Six months ago, opting out was perfectly reasonable.

Embedded10 22

What’s changed in the last six months

The conversation has shifted

Context engineering. A claude.md with target MCU, memory map, RTOS, critical never’s — a completely different output.
Hardware-aware agents. Embedder (YC S25) ingests datasheets, refuses to generate code for registers not cited in the docs, runs air-gapped.
MCP servers for hardware loops. Serial Console MCP, probe-rs MCP — the agent flashes, reads RTT, tests on real hardware.
Chalmers / Software Center paper (Jan 2026): Swedish industry partners demanding MCP-compliant APIs on their tools.
Beningo’s insight: ~20 % of firmware is register manipulation. The rest is state machines, protocols, business logic — that’s what AI is good at.

You don’t need to change careers. You need to change tools — and write your first claude.md.

Practice11 22

This is the heart of everything

Context → skills → subagents

CLAUDE.md / AGENTS.md — auto-loaded every session. 20–50 lines. Architecture, conventions, do’s and don’ts, where things live.
Skills — reusable playbooks. I have 19: /commit, /close, /deploy, /debate-codex, /review-pr. Write once, use a hundred times.
Subagents — model tiering per task. Haiku for grep, Sonnet for implementation, Opus for architecture. Protects root context when one task dumps 100k tokens of logs.
MCP servers — the agent gets access to your APIs: Fortnox, Microsoft 365, Google Workspace, Playwright, your own memory system.

Rule of thumb: anything you explain to the model twice should move into a CLAUDE.md or a skill.

Architecture12 22

Build for a dual audience

AI-ready architecture

Docs as a runnable surface — not a PDF collecting dust. A README, an ARCHITECTURE.md, a CLAUDE.md that the agent actually reads.
Self-describing APIs — clear names, good error messages, examples. For humans and for models alike.
AI agent surface — MCP or CLI, one or the other. If an internal tool has no safe way for an agent to reach it today, it’s behind. Pick the surface that fits — but pick one.
Bounded modules — small surfaces, clear contracts. Lets you hand off a slice of code to an agent without loading the whole project.
Testability — a fast test suite, clear “it works” signals. Without it, the agent has no feedback loop.

Interactive13 22

A quick vote › click to vote

Team CLI vs Team MCP

If you could only pick one abstraction to extend your agent’s abilities — which one do you take?

Team CLI

The command line

Full control. Scriptable. No abstraction to leak through. The agent learns the same tools you already know.

0

—

Team MCP

The protocol

Language-agnostic. Composable. Write once, works in Claude, Codex, ChatGPT, Cursor. Future-proof interface.

0

—

Waiting for votes · click a card to vote (you can switch)

Principle14 22

CLI first — even when GUI is the goal

Make everything scriptable

Example: Sagascript — a writing tool; the audience is a human using a GUI. I still build the CLI first, functionally complete.
Why? Debug & test is 5× faster. The agent can run the CLI in loops. No click robot needed.
Bonus: my own usage keeps migrating to the terminal. Just transcribed a call — I prompted Claude Code to transcribe the file using sagascript, and the result landed straight in the chat.
Consequence: all my internal tools are CLI-first. noxctl for Fortnox. sagascript for voice-to-text transcription. Same surface, same habits — agent or me, no difference.

A CLI that covers the whole domain is the cheapest MCP server you’ll ever build — the agent can already use it.

UI/UX15 22

Agent-first, human-first, or both?

I actually prefer a clean terminal.

My 9-year-old, iterating on her game — in a terminal.

The game she’s building — top-down adventure, browser-based.

Claude Code in my terminal isn’t just a coding tool anymore. Email, calendar, files, transcription — and, since recently, noxctl for Fortnox. Same surface for writing code, reading mail, or booking an invoice.

When a 9-year-old prefers the terminal, something has shifted. The real question isn’t “CLI or GUI?” — it’s what interface works when half the user is an agent?

Technique16 22

The cheapest regression engine you’ll ever own

Red/green TDD with the agent

1. Ask the agent to first write a failing test that describes what you want.
2. Run the test — verify it’s red.
3. Let the agent implement until the test goes green.
4. “Run the tests first” — four words that anchor every session to reality.
5. Bake it in. Put the literal line Use red/green TDD in your CLAUDE.md — one sentence, every session, no re-explaining.

“A significant risk with coding agents is that they write code that doesn’t work, or build code that never gets used — or both.” — Simon Willison, Agentic Engineering Patterns

Principle17 22

Measure everything you own

Hoard the things you build yourself

Every skill, tool, agent you write should produce data — not just perform a task.
Logs → analytics → evidence of what works → improvements.
Example: my /debate-codex skill logs which critiques Codex finds vs. what the model caught by itself.
After 23 debates, 294 critiques: single-model self-review misses 77 % of what cross-model debate catches — two models reading the same text see different things.

“‘Storage is manageable with retention policies’ is not a plan.” — Codex, from my debate logs, March 2026

If your tool can’t answer “is it getting better?” with a number, you don’t have a tool — you have a habit.

Thesis18 22

Karpathy, autoresearch · 2026

The bigger thesis

AI can improve itself — as long as you have enough compute and a clear signal for “what’s good”.

Karpathy’s autoresearch: autonomous agents run ML experiments overnight, measure val_bpb, keep or discard changes.
Exactly 5 minutes training budget → directly comparable results.
The human writes a markdown file that steers the exploration — not Python directly.
Translates to your day job: hoard signals + agent loops + clear metric = self-improving system.

It’s also the thesis behind my whole toolbox: if I have a clear signal, the agent can do the work while I sleep.

Scope19 22

Same toolbox, wider problem set

It’s not just code

Regulation & compliance. EU AI Act, GDPR, DORA, CRA, NIS2. Feed the agent the regulation + your architecture → a gap analysis in an hour, not a week.
Risk & threat modelling. STRIDE walkthroughs, dependency / supply-chain review, CVE triage. Two models disagreeing on the same system beats one human guessing.
Documentation & knowledge. Architecture docs, runbooks, onboarding guides, handovers — generated from the code that actually runs, not from memory.
Research & evaluation. Vendor comparisons, framework shortlists, competitor landscapes. A well-prompted afternoon replaces a consultant-week of desk research.
Proposals, specs, client comms. RFP responses, tech specs, RFC drafts. Translate between engineer-speak and client-speak without losing precision.

A consultancy that uses AI only inside the IDE is leaving most of the value at the door. The same harness that edits code also reads 400-page regulations and writes the memo for the customer.

Frontier20 22

One agent or many?

Multi-agent orchestration

This is a very new area. Nobody has landed the pattern yet — and new model releases keep re-opening the question.

Expert agents. One per domain — security, docs, tests, refactor. Cheap to write, hard to coordinate.
Agent swarms. Many peers on the same task, debating or voting. Catches blind spots (see Hoard slide: 77 %). Expensive; fan-out is real.
Coordinator + workers. A main agent plans, dispatches, integrates. Simple to reason about, bottlenecked by the planner’s context.
Scaffolding & harnesses. Hand-built pipelines — explicit graph of who does what. More control, less magic, more maintenance.
…or just wait. Every six months, a smarter single model solves in one shot what last quarter needed a swarm. Betting on orchestration can be betting against the model curve.

My own view is pragmatic: if it works, it works — don’t overdo it. Where I actually feel friction is data & trust: I want a coordination layer that routes tasks by sensitivity — frontier models for the public stuff, self-hosted for what can’t leave the building. That’s the orchestration problem worth solving.

Links21 22

Take home

Resources

Simon Willison — Agentic Engineering Patterns
simonwillison.net/guides/agentic-engineering-patterns · read the whole thing, it’s worth it.

Karpathy — autoresearch
github.com/karpathy/autoresearch · the thesis on self-improving loops.

Beningo — Why Claude Code for Firmware Development Matters
beningo.com/why-claude-code-for-firmware-development-matters · the best embedded-specific piece right now.

Chalmers / Software Center — Agentic Pipelines in Embedded SW Engineering
arXiv 2601.10220 · Swedish industry partners, relevant to you.

METR, July 2025 — that −19 % study. Read it before you hype.

noxctl — my Fortnox CLI + MCP server: github.com/Magnus-Gille/noxctl.

All today’s material lands at coody.gille.ai (soon).

Thank you.

Questions, thoughts, things we didn’t get to — let’s hear them now.

Magnus Gille
[email protected]
linkedin.com/in/magnusgille
gille.ai

Building an AI coding toolboxthat actually works

A bit about me

One hour, not a monologue.

18 months — and a 36× leap in AI time horizon

METR, July 2025: −19 %

What the study found

Why it’s not the end of the story

Model ≠ Harness

The model

The harness

One model, four surfaces

From craft to factory

Craft (where we came from)

Factory (where we’re heading)

Why embedded has felt left behind

The conversation has shifted

Context → skills → subagents

AI-ready architecture

Team CLI vs Team MCP

Make everything scriptable

I actually prefer a clean terminal.

Red/green TDD with the agent

Hoard the things you build yourself

The bigger thesis

It’s not just code

Multi-agent orchestration

Resources

Thank you.

Building an AI coding toolbox
that actually works