How to Build Your
Agentic AI Roadmap
in 2026
From first idea to production-grade agentic system. A phase-by-phase roadmap with frameworks, worksheets, and actionable phase gates.
How to Use This Playbook
This playbook serves two different readers. Identify which one you are before going further. Your track determines which sections are your primary outputs and which are reference material.
You have a use case and a rough idea. You have not built an agent before and need to understand what you’re building before you commit to building it.
- Read every section front to back
- The Build chapter is your primary output
- Complete all worksheets before moving to Build
- Use the First 48 Hours section as your literal starting point
You know how to build agents. You need an organizational framework to align stakeholders, justify budget, and run the project without it stalling.
- Agent Design Canvas is your primary output
- Business Case Template for stakeholder alignment
- Phase Gate Checklists to prevent sequencing errors
- Build chapter is a reference, not a tutorial
Three Prerequisites
Before going further, answer these honestly. If you cannot answer all three, the relevant phases address each directly.
-
1
Can you state your AI agent problem as a number? Not “improve customer service.” Something specific: “First-response time is 48 hours and needs to be under 4.”
-
2
Do you have one named person with P&L accountability who owns this? Not a committee. One person who can say yes to spend.
-
3
Do you know the difference between your current process and the ideal process if a human were not the bottleneck? If no, start at Phase 1 regardless of anything else.
The technology is not failing.
The approach is.
There is a version of this moment told as a success story. AI agents are everywhere. Funding is flowing. Boards are demanding AI strategies. Here is the version that does not make the keynotes.
The teams that reach production are not smarter or better-funded. They are more structured. They ask different questions at the start. They phase their work in a specific order. They build governance before they need it.
The market opportunity is real. The global agentic AI market sits at $7.3–7.6 billion in 2025 and is projected to reach $139–199 billion by 2034 (40–44% CAGR). Google Cloud and BCG have identified approximately $1 trillion in global systems integrator services driven by agentic AI adoption.
What an AI Agent Actually Is
Gartner estimates that of the thousands of vendors claiming agentic capabilities, only approximately 130 offer genuine agentic features. The rest are rebranded chatbots, RPA tools, and AI assistants, what Gartner calls “agent washing.” You cannot build a roadmap for something you cannot accurately define.
A true AI agent has five capabilities. The first four define what it is. The fifth defines how it works in practice.
Perception
Receives and interprets inputs from its environment: natural language, structured data, API responses, file contents, system events. A chatbot that only responds to typed text is not perceiving. It is matching patterns.
Action
Takes actions that affect the world beyond generating text: calls APIs, updates databases, sends messages, executes code. An agent that only produces outputs for a human to act on is an assistant. An agent that acts directly is an agent.
Memory
Retains relevant information across interactions: short-term session memory, long-term persistent memory, or domain knowledge in a retrieval layer. Without memory, every interaction starts from zero.
Autonomy
Can pursue a goal across multiple steps without human input at each step. The level of autonomy varies. Autonomy without governance is a risk. Autonomy with governance is the goal.
Orchestration
The coordination layer that sequences multiple steps, routes between tools, handles conditional logic, and manages retries when something fails. Even a simple Level 1 agent typically has at least two steps: an input parsing step and an output generation step. Orchestration is what connects them. Without it, you do not have an agent. You have a prompt with extra steps. This is the component most first-time builders underestimate, and the one most responsible for agents that work in demos but break in production.
The 5-Phase Framework
Most agent projects fail because teams do things in the wrong order. They build before validating the problem. They deploy before designing governance. They scale before a single agent has proven value. Each phase produces something the next phase depends on. Skip one and the dependency breaks.
Why Agent Projects Fail:
The 5 Structural Root Causes
Each failure mode includes a self-diagnostic. Run these against your current situation before you start. The “Problem” column tells you what goes wrong. The “Diagnostic” column tells you whether it is already going wrong for you.
Organizations commit budget because leadership read an article or saw a competitor announcement, without identifying a specific problem with a measurable cost. Gartner: “Most agentic AI projects right now are driven by hype and are often misapplied.”
The business problem must be stated as a number before any technology is touched. Not “improve customer service.” Something like: “First-response time is 48 hours and needs to be under 4.”
Can you name the specific person whose job gets measurably easier if this works?
Can you state the current cost of the problem in hours or dollars per month?
Has someone said “we should do something with AI” without a specific process in mind?
If you answered no, no, yes: you are in hype-driven territory. Do not proceed to Architecture until you have a number.
Organizations that succeed are more than twice as likely to have redesigned their workflows before selecting technology (MIT NANDA, 2025). Agentic AI does not improve a broken process. It automates it. The broken parts run faster and create problems at higher volume.
Map the current state. Design the ideal state assuming no human bottleneck. Build the agent for the redesigned process, not the existing one.
Can you draw the current process on a whiteboard, every step, every handoff, every system?
Are there steps in the current process that exist only because of human limitations (scheduling, manual lookup, copy-paste)?
Does the process produce consistent outputs, or does it depend heavily on who is doing it?
If you cannot draw the current process, or if it only works when specific people do it: redesign first. Build second.
Gartner names three causes for its 40%+ cancellation prediction: escalating costs, unclear business value, and inadequate risk controls. All three are governance failures. Costs escalate without cost architecture. Value is unclear without SLOs. Risk controls fail when RBAC is bolted on after the fact.
RBAC, audit trails, cost caps, and approval gates are designed in Phase 2, not added after deployment. Governance is not friction. It is how agents earn the organizational trust needed to expand.
Do you know what a failed agent run will cost you in LLM tokens?
Have you defined what “working correctly” means as a number, before writing any code?
If the agent sends a wrong message to a customer tomorrow, who finds out and how?
If you cannot answer all three, governance is not designed. Fill in the Phase 2 canvas before starting Build.
70% of developers cite integration problems as a primary challenge. 42% of enterprises need access to 8+ data sources to deploy agents successfully, with 79% expecting data challenges to impact rollouts. An agent with a perfect prompt but unreliable data retrieval is not a production agent.
Integration mapping is Phase 2 work, not Phase 3. Every data source is identified, access is confirmed, and auth is resolved before the first line of agent code is written.
Can you list every system the agent needs to read from or write to?
For each system, do you have confirmed API access, or are you assuming you can get it?
Do any of those systems require IT procurement approval that has not started?
Assumed access is not confirmed access. If any system says “we think we can get that,” stop and verify before scoping the build.
Mid-market companies move from pilot to production in an average of 90 days. Large enterprises average 9 months or more. The difference is not resources. It is decision authority (MIT NANDA, 2025). Without someone with P&L accountability invested in the outcome, the first real obstacle pauses the project permanently.
Before scoping begins: name one person who can say yes to spend without a committee, who feels the cost of the problem in their own metrics, and who will still care in 90 days.
If the project hits a real obstacle in month two, is there one specific person who will fight to keep it moving?
Does that person’s team directly feel the pain of the problem this agent solves?
Can that person approve spend without going to a committee?
Three yeses = you have a champion. Anything less = you have enthusiasm, not ownership. Do not start scoping without a champion.
The Champion-Budget-Scope
Framework
Before any agent is scoped, before any tool is selected, before any prompt is written: three things must be true. If any one of them is missing, the project will either never start or never finish.
What it is
One named person with P&L accountability who owns the outcome and can approve spend without a committee.
What it is NOT
A senior person who is “supportive.” A team that is “interested.” A steering group that will “review progress.”
Red Flags
- “The team is very excited about this”
- Sponsor changes quarterly
- No single name when asked who owns it
- Champion’s team is not the primary user
The Conversation to Have
Ask the champion directly: “Is this in your current FY budget or does it need a new approval?” In budget: proceed. Needs approval, champion can give it: proceed with timeline. Needs approval above champion: you need a co-sponsor.
Cost Range for First Agent
- Internal developer: $15K–45K (3–6 weeks, 1–2 devs)
- External implementation: $25K–75K
- Platform infrastructure: $500–3K/month ongoing
A project without a committed budget number is a conversation, not a project.
The Test
YES: “Our support team handles 2,400 tier-1 tickets per month. 68% require no human judgment. We want an agent to resolve that 68% without escalation.”
NO: “We want to use AI to improve customer experience.”
First Scope Sits Where
- Current-state cost is measurable
- Agent handles a meaningful % without human judgment
- Data is accessible (not locked in legacy systems)
- Failure is visible and recoverable, not catastrophic
The 30-Minute Discovery Conversation
Run this with your champion before anything else. These questions separate real projects from wishful thinking. Take notes. The answers are the inputs to your business case.
| Problem | “If this agent works perfectly, which number in your business changes, and by how much?” |
| Current cost | “How many hours per month does your team spend on this today? What is the error rate? What is the escalation volume?” |
| Data access | “What systems hold the data this agent would need? Do you have API access or would that require IT approval?” |
| Timeline | “What does success look like in 30 days? 90 days? Is there a business event this needs to land before?” |
| Constraints | “What would stop this? Who in the organization would push back, and what would they say?” |
| Measure | “How will we know, on a Tuesday afternoon three months from now, whether this was worth doing?” |
Use Case Selection &
The Opportunity Matrix
Start with a structured inventory of where repetitive, structured, high-volume work already exists.
Where to Look: The Six Categories
| Category | Agent-Ready Examples |
|---|---|
| Customer Operations | Tier-1 support tickets, FAQ resolution, returns processing, onboarding steps that follow a decision tree |
| Finance & Compliance | Invoice matching, expense categorization, reconciliation, audit trail generation, KYC checks |
| Sales & GTM | Lead qualification scoring, outbound research, CRM data enrichment, proposal generation |
| HR & Internal Ops | Employee FAQ handling, onboarding document routing, leave request processing, policy lookups |
| Data & Reporting | Weekly report generation, data normalization, dashboard population, alert triage |
| Supply Chain & Ops | Inventory status queries, supplier communication drafting, shipment tracking, exception flagging |
The 4-Quadrant Opportunity Matrix
Plot your candidates. Two axes: Business Impact (what does this cost today, or what does it unlock?) and Implementation Complexity (data accessibility, integration count, governance requirements).
Plan for Phase 2
High value but too complex for a first build. Document it. Revisit after Level 1 proves value.
Build First ✓
Clear value + achievable scope. This is your first agent. Quantify the baseline cost today.
Skip
Neither the value nor the complexity justifies the distraction. Leave it off the roadmap.
Consider for Quick Win
Low complexity, lower returns. Only pursue if you need an early internal demonstration of value.
Use Case Scoring Worksheet
Score your top 3–4 use cases across 5 dimensions. Each dimension scored 1–5. Maximum 25 points. The use case with the highest score AND a committed champion is your first build. Score ≥18 with a committed champion = proceed to Phase 2.
| Dimension | What It Measures | Use Case A /5 | Use Case B /5 | Use Case C /5 |
|---|---|---|---|---|
| Measurable Cost | Can you quantify what this costs today in time, errors, or revenue impact? (1=no data, 5=precise numbers) | __ /5 | __ /5 | __ /5 |
| Data Accessibility | Is the required data in accessible systems with existing APIs? (1=locked legacy, 5=clean API ready) | __ /5 | __ /5 | __ /5 |
| Agent Coverage | What % of total volume can the agent handle without human judgment? (1=<20%, 5=>70%) | __ /5 | __ /5 | __ /5 |
| Low Governance Risk | Is failure recoverable? Can humans review before permanent action? (1=irreversible/public, 5=internal/reversible) | __ /5 | __ /5 | __ /5 |
| Champion Commitment | Does your champion personally feel this problem and own the outcome? (1=indirect, 5=primary pain owner) | __ /5 | __ /5 | __ /5 |
| Total Score | Maximum: 25 points | __ /25 | __ /25 | __ /25 |
Building Your Business Case
A business case is not a slide deck with market projections. It is a one-page document that answers three questions: what does this cost today, what will the agent change, and when do we break even? If you cannot answer all three, you are not ready to build.
| Current-state baseline | Volume handled per month × average handling time × fully-loaded cost per hour = Monthly cost before agent | $______ |
| Agent-handled volume | Total volume × coverage % (use conservative estimate) | ___ units |
| Cost per intervention | Remaining human-handled cases × handling time × hourly cost + agent infrastructure cost per month | $______ |
| Gross monthly savings | Current-state baseline minus Cost per intervention (agent-handled + remaining human) | $______ |
| Implementation cost | All-in: development, infrastructure setup, integration work, testing. Typical range: $25K–75K | $______ |
| Break-even timeline | Implementation cost ÷ Gross monthly savings = break-even months. Target: <6 months for first agent | __ months |
LLM Cost Estimation &
Integration Patterns
Two things that consistently surprise first-time builders: how much LLM calls cost at real volume, and how long integration confirmation actually takes. Both need to be in your business case and your timeline before Build starts.
Approximate LLM Cost at 1,000 Queries Per Month
These are order-of-magnitude estimates for a standard support agent handling queries of 200–500 tokens each, with a response of similar length. Actual costs vary significantly by prompt length, response length, and task complexity.
| Model | Provider | Approx. Monthly Cost | Best For | Trade-off |
|---|---|---|---|---|
| GPT-4o mini | OpenAI | $2–8 | Simple classification, FAQ, routing | Lower reasoning quality on complex tasks |
| Claude Haiku 3.5 | Anthropic | $2–6 | Document processing, structured extraction | Less capable on open-ended generation |
| GPT-4o | OpenAI | $25–80 | Complex reasoning, multi-step workflows | Cost grows fast at high volume |
| Claude Sonnet 4.5 | Anthropic | $20–70 | Analysis, long documents, nuanced tasks | Cost grows fast at high volume |
| Llama 3.1 (self-hosted) | Meta / your infra | Infra only | Sensitive data, high volume, cost control | Requires engineering to deploy and maintain |
| GPT-4o (10K queries) | OpenAI | $250–800 | Same as above, 10x volume | Choose a cheaper model or self-host first |
| LyzrGPT (platform) | Lyzr · multi-model | $0.03–0.08/run + LLM cost |
Teams that want automatic model routing without managing separate API contracts per provider | Platform fee on top of underlying LLM cost. Best value above 5K runs/month where routing savings offset the platform layer. |
Rule of thumb: start with the cheapest model that passes your quality test. For most first agents, GPT-4o mini or Claude Haiku handles the task adequately and costs 10–20x less than flagship models at the same volume.
Three Integration Patterns to Know Before You Build
Every first agent involves at least one integration. Here are the three patterns that cover 80% of cases, what you need to confirm before Build starts, and what breaks in production when you skip the verification step.
The most common pattern. The agent calls an external API (CRM, ticketing system, database service) using a key or OAuth token.
- What you need before Build: API documentation, a test API key, and a confirmed sandbox environment
- Typical timeline to confirm: 1–5 days if IT owns the API
- What breaks in production: tokens expire (usually 30–90 days) without automatic refresh logic. Auth breaks silently and the agent fails without a clear error.
The agent queries a SQL or NoSQL database directly. Most common for internal reporting tasks, data lookup, or enrichment tasks.
- What you need before Build: a read-only service account, confirmed schema access, and a test query that returns real data
- Typical timeline to confirm: 1–3 weeks if DBA approval is required
- What breaks in production: schema changes in the source database break agent queries with no warning. Add schema monitoring to your SLO check.
The agent processes uploaded documents (PDFs, CSVs, DOCX). Most common for document review tasks, contract analysis, or report generation tasks.
- What you need before Build: sample documents in production format, confirmed file size limits, and a clear understanding of the document’s internal structure
- Typical timeline to confirm: 1–2 days
- What breaks in production: real documents are messier than sample documents. Tables, embedded images, and scanned PDFs break extraction logic that worked perfectly in testing.
All three integration patterns are handled through native tool connectivity. You authenticate once per integration at the platform level. Token refresh, schema-change monitoring, and document preprocessing are managed by the platform rather than built per agent — which removes the most common single cause of production failures from your Build scope entirely. Start at architect.new.
The Agent Design Canvas
Complete all 9 sections before writing any code. If you cannot fill in a section, that section is your next task, not something to figure out during Build. An incomplete canvas is a risk register. The right column shows a worked example for a tier-1 support agent.
Goal Statement
Trigger
Inputs
Actions
Tools & Integrations
Memory Requirements
Handoff Conditions
SLO (Success Metric)
Failure Mode
Building Your First Agent
Build starts only after the Agent Design Canvas is complete and signed. Everything in Phase 2 was the work that makes Build predictable. This chapter covers what to build with, how to choose a model, how the components fit together, and what to do on your literal first day.
Choosing Your Model
The most consequential early decision is which model to use. The right answer depends on your task type, your volume, and your cost constraints, not on which model sounds most impressive. Start with the cheapest model that passes your quality test. Upgrade only when you can measure the quality gap.
If you are unsure: start with GPT-4o mini or Claude Haiku for your first build. Run 50 real test cases. If quality is insufficient, upgrade to GPT-4o or Claude Sonnet. The quality gap is usually smaller than expected, and the cost gap is always larger.
Choosing How to Build
There are three categories of build approach. The right one depends on your team’s technical capability, not on which sounds most sophisticated. The goal is a working agent in production, not an impressive architecture diagram.
A Level 1 agent has one trigger, one primary flow, and one output. Resist scope expansion during build. Every feature added before the base case works is a failure mode that is harder to debug. Build the happy path first. Add edge case handling second. Add features third.
Use Guided Mode to generate your Plan Document and agent architecture from your Design Canvas. The platform selects models and populates prompts from enterprise templates. Recommended for teams without dedicated AI engineering.
Define “working” in numbers before a single line is written. “Error rate below 3% on first-pass resolution. Latency under 8 seconds P95. Volume threshold: handles 100+ queries/day without degradation.” These numbers are your exit criteria for Build and your entry criteria for Deploy.
Push to Agents creates your agent architecture with locked-in prompts, KB integration, and tool connections. Run logs are visible in the platform. Edit agent parameters in Studio before deployment.
Before any user-facing deployment: run 50+ test scenarios, including adversarial inputs. Test the handoff condition: confirm the agent actually escalates when it should. Test the failure mode: confirm the failure notification fires. A production agent with an untested handoff path is a liability, not a product.
Use the Live Preview in the App tab to test agent responses in real time. Deploy to a staged URL on Netlify before sharing externally. The staged URL is your test environment.
Agent Architecture:
The Four Components
Every agent, regardless of platform or framework, is made of four components. Understanding what each one does tells you what to configure, what to test, and what breaks when something goes wrong.
The model that reasons, plans, and decides what to do next. It reads the input, retrieves from memory if needed, chooses which tools to call, and generates the output. The LLM does not take actions directly. It decides what actions to take, then the orchestration layer executes them.
What the agent knows and remembers. Short-term memory holds the current session: the conversation so far, retrieved documents, intermediate results. Long-term memory persists across sessions: customer history, learned preferences, past outcomes. Most first agents need only short-term memory. Add long-term memory only when you can measure the quality improvement it provides.
The list of actions the agent can take. Each tool is a function with a name, a description, and an input/output schema. The LLM reads the tool descriptions and decides which one to call based on the task. Tools are what turn a chatbot into an agent. Without tools, the agent can only generate text. With tools, it can search a database, send an email, update a CRM record, or call any API you have defined.
The engine that keeps the agent working. It runs a cycle: observe the current state, reason about what to do next, take an action, check if the goal is reached. It continues until either the goal is complete, a stop condition fires (max steps, error threshold), or the agent decides to hand off to a human. The run loop is also where retry logic, error handling, and fallback behavior live. A well-designed run loop is the difference between an agent that fails gracefully and one that fails silently.
Your First 48 Hours
This is the section that converts a completed Design Canvas into something running. Follow these steps in order. Do not skip to step 4 because it sounds more interesting. The point of steps 1–3 is to eliminate variables before you add complexity.
-
1Get API access to your chosen model
Create an account with your model provider (OpenAI, Anthropic, or your platform of choice). Get an API key. Set a spending limit before you do anything else: set it low, around $20. You will hit it and need to raise it. You will never accidentally spend $500 on a misconfigured loop.
If you are using a no-code platform like Architect, create your account and run one of the default example agents before building your own. Confirm it works end-to-end before touching your use case.
-
2Build a version with no tools and no integrations
Write a system prompt that describes your agent’s role and goal. Send it a real example input from your use case. Look at the output. Is the reasoning coherent? Is the format correct? Fix the prompt until you get output you would be comfortable showing a user. This step has zero integration risk and tells you how hard your prompt engineering job is going to be.
Time to complete: 2–4 hours. If it takes longer, your task is more complex than your Design Canvas suggests. Revisit section 08 (SLO) before continuing.
-
3Add one tool. Test it in isolation.
Add the first tool from your Design Canvas section 04. Call it manually with a test input. Confirm the output is what you expect. Then add it to the agent and run the same test input you used in step 2. Confirm the agent uses the tool correctly and that the output improves. Add tools one at a time. Never add two at once.
The most common mistake: adding all tools at once, getting a failure, and not knowing which tool caused it.
-
4Test your handoff condition before anything else goes to production
Before adding more tools or more complexity: deliberately trigger the handoff condition from your Design Canvas section 07. Send an input that should cause escalation. Confirm the agent escalates correctly and that the right person is notified. This is the most commonly skipped test and the most consequential one.
-
5Run 20 real inputs, document every failure
Pull 20 real examples from your use case (not invented test cases). Run them through the agent. For every failure, note what went wrong: wrong tool call, wrong format, wrong reasoning, or missed handoff trigger. Fix the most common failure pattern before adding more test cases. Repeat until pass rate is above 80% on 20 cases, then scale to 50.
-
6Deploy to a staged URL and test with 3 internal users
Get the agent running at a URL that real users can access. Ask 3 people from the target team to use it for one week with real inputs. Collect every failure. Fix the top 3 most common issues. Only after this step should you consider production deployment.
On Architect: use the Deploy button to get a live Netlify URL. Share it with your internal testers before announcing it broadly.
-
7Activate governance before opening to more users
Before more than 5 people have access: turn on audit logging, set your cost cap, configure RBAC, and confirm the SLO monitoring dashboard is live. These are not optional extras to add later. They are Phase Gate 3 requirements. If they are not active, you are not ready to deploy broadly.
Common First Agent Mistakes
These are the patterns that consistently kill agents between demo and production. They are not edge cases. They are the norm.
Your prompt works perfectly on a 200-word test input and breaks silently on a 2,000-word real document. Set a hard limit on input size and add a truncation or chunking step before the agent runs. Test with your longest real inputs, not your shortest.
The model does exactly what your prompt says. The problem is that your prompt describes what to do in the happy path, not what to do when inputs are ambiguous, incomplete, or adversarial. Add explicit instructions for what the agent should do when it is uncertain. “If you cannot determine X with confidence, escalate” beats leaving it to the model’s judgment.
Tool A works. Tool B works. When the agent calls Tool A and passes its output to Tool B, the output format from A does not match the input format B expects, and the whole chain fails. Always test the full sequence end-to-end, not just individual tools. Define the input and output schema for every tool explicitly and test handoffs between them.
You authenticate, the agent works, you deploy. 30 days later, the auth token expires and the agent fails on every call with a cryptic error. Implement token refresh logic before production or set a calendar reminder to rotate tokens before they expire. The first expiry usually happens when you are not watching.
You ask the model to return JSON. It returns JSON most of the time. On 3% of calls, it wraps the JSON in markdown backticks, adds an explanation before it, or returns slightly different field names. Your downstream system breaks. Use structured output mode (function calling / JSON mode) where available. Never parse free-text output with regex in production.
API calls fail. Networks have timeouts. Model providers have brief outages. An agent with no retry logic treats a 500ms network hiccup the same as a genuine error. Implement exponential backoff with a maximum of 3 retries on any external API call. Log every retry. Alert on any call that exhausts all retries.
You test with 5 clean PDFs. Production has 500 PDFs, 30% of which are scanned images with no text layer, have tables in non-standard formats, or are password-protected. Test your document processing pipeline on your 20 messiest real documents before build is complete. If any fail, solve that before deployment.
The agent handles the happy path in testing. Nobody tests whether the escalation path actually works. The first time a production case should escalate, the escalation fails silently and the customer gets no response. Test your handoff condition before anything else goes to production. It is step 4 in the First 48 Hours section for this reason.
Governance: The Layer That
Determines Whether Agents Scale
Governance is not a compliance checkbox. It is the mechanism by which agents earn the organizational trust needed to expand. Build it before you need it. By the time you need it, it is already too late to add it cleanly.
| Layer | What It Is | Minimum Requirement |
|---|---|---|
| RBAC | Role-Based Access Control | Four roles defined before deployment: Admin, Operator, Viewer, Override. No agent accessible without authentication. |
| Audit Trail | Every action, logged | Every agent action logged with timestamp, agent ID, input received, output produced, and any human override. Logs retained minimum 90 days. |
| Cost Caps | Per-run and monthly limits | Maximum cost per agent run + monthly budget ceiling. Alert at >2x normal token consumption per session. |
| Approval Gates | Human-in-the-loop for high stakes | Any irreversible action (external communication, financial record modification, data deletion) requires human approval until error rate is below SLO for 30+ consecutive days. |
| SLO Monitoring | Numbers, not feelings | Monitor: error rate (% of runs with incorrect output), latency (P95 response time), volume throughput. Alert at 1.5x baseline. Escalate at 2x. |
| Prompt Versioning | Change control for AI | Every change to prompt, model, or tool configuration version-controlled with timestamp and author. Rollback possible within 10 minutes. No prompt changes in production without signed review. |
Pre-Deployment Checklist
Every box must be checked before production deployment. If any box remains unchecked, it is a risk, not a detail.
- Agent Design Canvas complete and signed by champion and project lead
- Integration Map: all data sources confirmed with API access and auth
- SLOs defined in writing: error rate, latency, volume thresholds
- RBAC configured: all four roles defined and assigned
- Audit logging active: test log entries verified
- Cost caps set: per-run limit and monthly ceiling configured
- Approval gates configured for all irreversible actions
- 50+ test scenarios run including adversarial inputs
- Handoff condition tested: agent escalates correctly
- Failure notification tested: alert fires correctly
- Staged deployment URL tested by 3+ internal users
- Rollback procedure documented and tested
- Auth token expiry dates noted and refresh schedule set
- SLO monitoring dashboard live and verified
What a Real Agent Looks Like
at 30 Days
The most common unknown for first-time builders is not how to build the agent. It is how to know whether the agent is working after deployment. This section tells you what a healthy agent looks like, what a struggling one looks like, and what to do about the difference.
- Log every run. Review logs daily, not weekly.
- Record actual error rate, latency, and volume
- Note every case that escalated to a human and why
- Document every case that should have escalated but didn’t
- Compare actual cost per run to estimate from business case
- Your SLO targets are aspirational at this stage. Your job is measurement, not optimization.
- Categorize failure modes: are they prompt failures, tool failures, or integration failures?
- Identify the most common input type that causes failures
- Check whether escalation rate is trending up or down
- Confirm cost per run is stable and predictable
- Fix the top 2 failure patterns. Version-control every change.
- Do not fix more than 2 things at once. You need to know which fix worked.
- Total volume handled vs. baseline estimate
- Actual resolution rate vs. SLO target
- Actual cost per run vs. business case estimate
- Top 3 failure modes identified and status (fixed / in progress / accepted risk)
- Recommendation: proceed to scale or extend Level 1 stabilization
- This report is Phase Gate 4 input. No scale decision without it.
Drift and Degradation: When to Worry
Agents degrade over time even when nothing in the agent changes. Source data drifts, user input patterns shift, and the real-world distribution diverges from the test distribution. Here is what to watch for and what to do about it.
| Warning Sign | Severity | Likely Cause | First Response |
|---|---|---|---|
| Error rate up 25–50% from baseline | WATCH | Input pattern shift, prompt edge case surfacing | Review last 20 failed cases. Identify common pattern. Update prompt. |
| Error rate up 2x or more from baseline | ACT NOW | Integration failure, source data change, model behavior shift | Pause agent. Check integrations. Check source data schema. Roll back last prompt change. |
| Latency up 50% with no volume change | WATCH | Provider API slowdown, tool timeout, context window growth | Check provider status. Check tool response times. Check if prompt has grown. |
| Escalation rate trending steadily upward | WATCH | New input types not covered by training distribution | Categorize escalation reasons. Add handling for top 2 uncovered cases. |
| Cost per run trending upward unexpectedly | WATCH | Input length growth, prompt chain getting longer, retry rate increasing | Check average input token count. Check retry logs. Check if any tool is failing and triggering retries. |
| Integration auth failure | ACT NOW | Token expiry, API key rotation, endpoint change | Check auth token expiry. Rotate and update. Check API changelog for endpoint changes. |
When to Expand:
Scale Signals and Level 2 Readiness
Scaling before Level 1 proves value is not ambition. It is risk amplification. Multi-agent orchestration inherits every governance gap from Level 1 and multiplies it. The scale signals below are your unlock criteria. None of them are optional.
| Signal | Requirement | Verified By |
|---|---|---|
| Production stability | 30+ days in production without a P1 incident | Champion sign-off |
| SLO performance | Error rate below threshold for 2 consecutive weeks | Monitoring dashboard review |
| Cost predictability | Per-run cost within 15% of estimate for 4 consecutive weeks | Finance sign-off |
| Volume proof | Agent handled planned volume target at planned accuracy | Ops lead sign-off |
| Next use case identified | Level 2 use case scoped and champion named | Project lead sign-off |
What the Three Levels Actually Mean
Your 90-Day Roadmap Grid
Fill in your specific use case, champion name, and target dates. Each phase gate must be signed before the next phase begins. Amber rows are Phase Gates: no phase begins without the gate signed.
| Timeline | Phase | Actions | Owner | Date |
|---|---|---|---|---|
| Days 1–10 | Discovery: Foundation | Complete 30-min discovery conversation with champion | _________ | ___/___ |
| State the problem as a number (current-state baseline) | _________ | ___/___ | ||
| Confirm budget: in FY budget or approval path identified | _________ | ___/___ | ||
| Map use case candidates against the 4-quadrant matrix | _________ | ___/___ | ||
| Days 11–20 | Discovery: Use Case Lock | Score top 3 use cases on the 5-dimension scorecard | _________ | ___/___ |
| Select one use case with champion sign-off | _________ | ___/___ | ||
| Map the current process: every step, every system | _________ | ___/___ | ||
| Design the ideal process (no human bottleneck) | _________ | ___/___ | ||
| Days 21–30 | Phase Gate 1 | Business case drafted: current cost, projection, break-even | _________ | ___/___ |
| All data sources identified + API access confirmed | _________ | ___/___ | ||
| Phase Gate 1 signed: champion + project lead | _________ | ___/___ | ||
| Kick off Architecture phase | _________ | ___/___ | ||
| Days 31–45 | Architecture: Design | Agent Design Canvas: all 9 sections complete with worked example reviewed | _________ | ___/___ |
| Integration Map: every source confirmed with auth method | _________ | ___/___ | ||
| LLM cost estimate complete: model chosen, volume projected | _________ | ___/___ | ||
| Governance design: RBAC, audit, cost caps, approval gates | _________ | ___/___ | ||
| Days 46–55 | Phase Gate 2 | Architecture review with technical lead + champion | _________ | ___/___ |
| All integrations confirmed accessible (not assumed) | _________ | ___/___ | ||
| Phase Gate 2 signed: champion + tech lead | _________ | ___/___ | ||
| Begin Build sprint (First 48 Hours checklist active) | _________ | ___/___ | ||
| Days 56–70 | Build: Level 1 Agent | Build core agent flow per canvas spec (no-tools version first) | _________ | ___/___ |
| Run 50+ test scenarios including adversarial cases | _________ | ___/___ | ||
| Test handoff conditions and failure notifications | _________ | ___/___ | ||
| Staged deployment tested by 3+ internal users for 1 week | _________ | ___/___ | ||
| Days 71–80 | Phase Gate 3 | Full pre-deployment checklist signed (14 items) | _________ | ___/___ |
| RBAC and audit logging confirmed active | _________ | ___/___ | ||
| Cost caps and SLO monitoring dashboard live | _________ | ___/___ | ||
| Production deployment: champion notified | _________ | ___/___ | ||
| Days 81–90 | Monitor & Scale Signals | Daily log review (Week 1), weekly thereafter | _________ | ___/___ |
| Identify top 2 failure patterns and fix | _________ | ___/___ | ||
| 30-day production report delivered to champion | _________ | ___/___ | ||
| Scale signal checklist initiated (Phase Gate 4) | _________ | ___/___ |
Phase Gate Checklists
No gate opens until every box is checked and signed by both the project lead and the executive champion. These are not bureaucratic formalities. They are the structural mechanism that prevents the five failure modes from taking hold.
- Problem stated as a number (current-state cost quantified)
- One named champion with P&L accountability confirmed
- Budget confirmed: in FY or approval path named
- One use case selected with scorecard score ≥ 18
- Current-state process mapped: every step and system identified
- Ideal-state process designed: agent role clearly defined
- Business case drafted: cost, savings projection, break-even < 6 months
- Agent Design Canvas: all 9 sections filled and reviewed
- Integration Map: every data source confirmed with API access (not assumed)
- LLM cost estimate complete: model chosen, volume projected
- Integration pattern identified for each source (REST, DB, or file-based)
- SLOs defined in writing: error rate, latency, volume
- Governance design complete: RBAC roles, audit trail, cost caps, approval gates
- Rollback procedure documented
- Technical lead sign-off on architecture feasibility
- Base agent (no tools) tested and producing correct outputs
- All tools tested in isolation before being added to agent
- 50+ test scenarios run including adversarial inputs
- Handoff condition tested: agent escalates correctly
- Failure notification tested: alert fires correctly
- Auth token expiry dates logged and refresh schedule set
- All governance infrastructure live in staging: RBAC, audit, cost caps
- Staged URL tested by 3+ internal users for minimum 5 days
- SLO baseline established from staging data
- 30+ days in production without a P1 incident
- Error rate below SLO threshold for 2 consecutive weeks
- Cost per run within 15% of estimate for 4 consecutive weeks
- Volume target met at target accuracy
- Top 2 failure patterns identified and addressed
- 30-day production report delivered to champion
- Level 2 use case identified and champion named
Start building at
architect.new
The fastest path from a completed Agent Design Canvas to a deployed agent. Plan Mode generates your PRD and agent architecture. Push to Agents brings it to life. One-click deployment puts it in production. Recommended for teams without dedicated AI engineering.
Open Architect → architect.new