What is Agentic Engineering? A Definitive Guide

The definition

Agentic engineering is the discipline of designing software development workflows where AI agents complete substantial tasks autonomously, from reading a ticket to creating a reviewed, CI-green pull request, without a human steering every step.

The word "engineering" is deliberate. It is not a creative trick. It is not a chat-based shortcut. It is a set of repeatable practices: structured context files, specialized agents, memory architecture, review pipelines, and orchestration patterns. You design it, you maintain it, and it compounds over time.

I started building this in 2026 at Kleinanzeigen, where I work as a staff engineer. The first session that produced a real autonomous PR took eight sessions of infrastructure work to reach. After that, the return on investment shifted permanently. That journey is documented in The Spark, the post that started this whole thing.

How it differs from AI-assisted coding

AI-assisted coding means you write code and an AI helps. Tab-completion. Inline suggestions. The occasional "rewrite this function." You are at the keyboard. The AI accelerates your hands.

Agentic engineering inverts that model. The agent is at the keyboard. You set the goal, define the constraints, review the output, and correct the direction. The agent reads the codebase, plans the implementation, writes the code, runs the tests, handles the failures, and submits the PR.

This is not a style difference. It is a structural one. In AI-assisted coding, your output rate is bounded by how fast you can type and think. In agentic engineering, your output rate is bounded by how well you can specify, review, and orchestrate. That bound is different. It scales differently. It requires different skills.

I spent three weeks tab-completing through features with Copilot before I realized I was writing every line of logic myself. The AI was filling in syntax. I was doing all the thinking. The post From Beta Tester to Agentic Engineer captures that realization precisely.

The visible sign that you have made the switch: you stop stopping the agent too early. You give it a task and let it run, not because you trust it blindly, but because you have designed a pipeline with enough structure and review stages that early intervention becomes unnecessary.

Versus prompt engineering

Prompt engineering is the craft of writing inputs that get better outputs from a language model. It matters. A poorly written specification produces poor agent output. But it is not the differentiator.

A colleague once asked me for my prompt. I shared it. Three sentences. They were disappointed. The prompt was never the skill. The post Your Prompt Is Not the Point explains why.

The skill is everything around the prompt: the CLAUDE.md context files that give the agent permanent knowledge of your codebase, the memory architecture that persists decisions across sessions, the review pipelines that catch errors before they reach production, the orchestration logic that routes tasks to the right specialist agent.

You can share a prompt. You cannot share judgment. Knowing when to stop the agent, when to push it further, when to accept an output that is 90% right, and when to reject an output that looks correct but violates an unstated constraint: those are the skills that compound.

The core practices

1. Specialized agents over generalists

A generalist agent can do everything. It can also do everything badly at the same time. When an agent loads instructions for every domain it might touch, the context window fills up before the actual work starts.

The solution is specialization. A Kotlin implementer that only knows Spring Boot conventions. A git agent that only knows commit patterns. A PR handler that only knows how to write a coherent description and assign reviewers.

I went from one generalist to 18 specialists in five days. The specialization was not about what they could do. It was about what they did not have to consider. See From 1 to 18: Building an Agent Army for the full story.

2. CLAUDE.md context files

The most important file in any agentic codebase is CLAUDE.md. It is a persistent context file that the agent reads at the start of every session. It contains the things you would tell a new team member on their first day: naming conventions, testing patterns, which flags the build command needs, where the config files live, what decisions have already been made.

Without CLAUDE.md, every session starts cold. The agent re-discovers your stack from scratch. With it, the agent arrives already oriented.

A well-maintained CLAUDE.md is the highest-leverage file in your repository. An hour spent writing it saves hours of context-setting across every session that follows.

3. Three-tier memory architecture

Sessions end. Context resets. The knowledge the agent built up about your codebase, the decisions you made together, the architectural patterns that work: all of it is gone unless you design a system to preserve it.

The pattern I use has three tiers: a STATUS.md file that acts as working memory (overwritten at the start of each session), daily journals that act as an append-only archive, and distilled facts that capture patterns appearing across multiple sessions.

This system lets an agent restore full context in under 60 seconds. The full design is in Three-Tier Memory: How I Taught My AI to Remember.

Memory can also go wrong. An agent memory file that grew to 95KB and 2,133 lines is a real failure mode. The post The Memory Bloat Crisis shows what happens and how to fix it.

4. Skills over agents (the current frontier)

When Claude Code shipped skills with forked execution contexts, the 18-agent architecture I had built became partially obsolete. Skills can do most of what custom agents did, with less overhead and more composability.

I had to confront this. The post Skills Ate My Agents (And I'm Okay With That) is the honest technical reckoning. The conclusion: agents are not dead, they are demoted. Skills are the what. Agents are the how.

5. Review pipelines

Autonomous agents need review gates. Not because they are unreliable, but because the stakes are real. Production code, infrastructure changes, security configurations: these require human judgment at the final step.

The review pipeline I run includes: a code reviewer agent that checks changes against acceptance criteria, an AI code review step in GitHub Actions that runs on every PR, and a human final approval gate before any merge.

The GitHub Actions reviewer is particularly interesting because it reviewed its own deployment PR automatically. It found a real security vulnerability. The full story: The Reviewer That Reviewed Itself.

6. Orchestration and autonomous pipelines

Once you have specialized agents, memory, and review gates, you can wire them into pipelines. My current system handles the full ticket lifecycle: Jira ticket analysis, implementation planning, code changes, test execution, code review, PR creation, CI monitoring, and Slack notification.

The triggering mechanism: one Slack message. The output: a CI-green PR, ready for human review, 21 minutes later. See One Slack Message. Two Hours of Work. for that session.

The first time that pipeline ran on a real production ticket, it took nine agents in sequence, under 20 minutes. That first proof point is in My First Autonomous Ticket.

7. Sandboxing and hard walls

Every boundary in a CLAUDE.md is a suggestion. Soft constraints. The agent will follow them, until a model update changes something subtle about instruction-following, or a context anomaly pushes behavior in an unexpected direction.

Hard walls are different. Docker-based sandboxing restricts filesystem access at the OS level. No instruction override can bypass it. The post From Soft Trust to Hard Walls explains why soft constraints are not enough and what the alternative looks like.

Security matters more than most teams realize. When I gave Claude SSH access to a security lab and let it run an attack chain, it cleared three missions before Anthropic terminated the session. The lessons from that experiment are in I Let Claude Hack My Security Training, and they changed how I think about every MCP-backed agent I build.

8. Token and context cost management

Context quality degrades as context length grows. Every unnecessary token loaded at startup is a tax on everything that follows. This is not a theoretical concern: research confirms that context length alone hurts LLM performance even when the relevant information is present.

I killed my Atlassian MCP server after discovering it was burning 22,000 tokens per session before I typed a single prompt. Seven shell scripts replaced it with zero startup cost. The analysis: The 22,000 Token Tax: Why I Killed My MCP Server.

What changes for your team

The first change is what developers do. Less hands-on-keyboard, more specification writing, review, and direction-setting. Senior engineers who are good at clear thinking become more valuable, not less. The engineers who struggle are the ones who were producing output by grinding, not by judgment.

The second change is the nature of progress. You spend several weeks building infrastructure before you ship a single autonomous ticket. The overhead is real. That overhead is investment, not waste. Most teams give up before they reach the threshold where the investment pays off.

The third change is velocity. After the threshold, productivity does not increase by 10 or 20 percent. It changes category. My repo merged 110 pull requests in one week. I wrote almost none of the code. The post Stop Micromanaging Your Agents explains what that looks like in practice.

The fourth change is culture. Agentic engineering requires teams to think carefully about context sharing, convention enforcement, and quality gates. The CLAUDE.md file becomes a shared artifact. The review pipeline becomes a shared responsibility. Individual coding style matters less. Shared precision matters more.

Some teams resist this. I showed my own OpenAPI toolchain to my team. They said no, for completely reasonable reasons. The story of that no, and what I did with it, is in I Built an OpenAPI Toolchain. My Own Team Rejected It.

How to start

The temptation is to build the full pipeline immediately. Do not do this.

Start with one thing: write a CLAUDE.md for your project. Add your build commands. Add your naming conventions. Add the three things you always have to explain to new team members. Run one session with it. Iterate.

Step two: let the agent complete one full task without interrupting it. Pick something concrete and bounded. A bug fix. A small feature. A refactor with a clear definition of done. Watch the output. Notice where it goes wrong. Do not fix the agent's mistakes by taking over. Fix them by improving the context it had.

Step three: add a second agent for a specific domain. Not because two agents are better than one, but because the act of splitting forces you to think about what each agent needs to know and what it does not. That thinking is the real skill.

Step four: add memory. Build STATUS.md. Write the first journal entry after a session. The goal is not a perfect system. The goal is not losing context between sessions.

The rest follows from those four steps. Skills. Review pipelines. Orchestration. Sandboxing. Each one becomes obvious when you reach the problem it solves.

The hardest part is not technical. It is staying in the investment phase long enough. The post The Walls That Taught Me More Than the Breakthroughs is about the invisible ceilings at each level and how to break through them.