How to Build Your Agentic AI Roadmap in 2026 | Architect by Lyzr

How to Build Your
Agentic AI Roadmap
in 2026

From first idea to production-grade agentic system. A phase-by-phase roadmap with frameworks, worksheets, and actionable phase gates.

95%
of enterprise AI pilots deliver no measurable P&L impact
MIT Project NANDA, July 2025
40%+
of agentic AI projects predicted to be cancelled by end of 2027
Gartner, June 2025
$1T
in SI services opportunity driven by agentic AI adoption
Google Cloud / BCG, 2025
Before You Begin

How to Use This Playbook

This playbook serves two different readers. Identify which one you are before going further. Your track determines which sections are your primary outputs and which are reference material.

Track A
First-Time Builder

You have a use case and a rough idea. You have not built an agent before and need to understand what you’re building before you commit to building it.

  • Read every section front to back
  • The Build chapter is your primary output
  • Complete all worksheets before moving to Build
  • Use the First 48 Hours section as your literal starting point
Track B
Experienced Builder, New Project

You know how to build agents. You need an organizational framework to align stakeholders, justify budget, and run the project without it stalling.

  • Agent Design Canvas is your primary output
  • Business Case Template for stakeholder alignment
  • Phase Gate Checklists to prevent sequencing errors
  • Build chapter is a reference, not a tutorial

Three Prerequisites

Before going further, answer these honestly. If you cannot answer all three, the relevant phases address each directly.

  1. 1

    Can you state your AI agent problem as a number? Not “improve customer service.” Something specific: “First-response time is 48 hours and needs to be under 4.”

  2. 2

    Do you have one named person with P&L accountability who owns this? Not a committee. One person who can say yes to spend.

  3. 3

    Do you know the difference between your current process and the ideal process if a human were not the bottleneck? If no, start at Phase 1 regardless of anything else.

The State of AI Agents in 2026

The technology is not failing.
The approach is.

There is a version of this moment told as a success story. AI agents are everywhere. Funding is flowing. Boards are demanding AI strategies. Here is the version that does not make the keynotes.

95%
of enterprise AI pilots deliver no measurable P&L impact despite an estimated $30–40 billion in investment
MIT Project NANDA, July 2025
42%
of companies abandoned most AI initiatives in 2025, up from just 17% the year before
S&P Global, 2025
23%
of enterprises had actually integrated agentic AI into operations. 72–79% report adoption or testing. That gap is the opportunity.
McKinsey, State of AI 2025, November 2025

The teams that reach production are not smarter or better-funded. They are more structured. They ask different questions at the start. They phase their work in a specific order. They build governance before they need it.

The market opportunity is real. The global agentic AI market sits at $7.3–7.6 billion in 2025 and is projected to reach $139–199 billion by 2034 (40–44% CAGR). Google Cloud and BCG have identified approximately $1 trillion in global systems integrator services driven by agentic AI adoption.

Foundations

What an AI Agent Actually Is

Gartner estimates that of the thousands of vendors claiming agentic capabilities, only approximately 130 offer genuine agentic features. The rest are rebranded chatbots, RPA tools, and AI assistants, what Gartner calls “agent washing.” You cannot build a roadmap for something you cannot accurately define.

A true AI agent has five capabilities. The first four define what it is. The fifth defines how it works in practice.

P

Perception

Receives and interprets inputs from its environment: natural language, structured data, API responses, file contents, system events. A chatbot that only responds to typed text is not perceiving. It is matching patterns.

A

Action

Takes actions that affect the world beyond generating text: calls APIs, updates databases, sends messages, executes code. An agent that only produces outputs for a human to act on is an assistant. An agent that acts directly is an agent.

M

Memory

Retains relevant information across interactions: short-term session memory, long-term persistent memory, or domain knowledge in a retrieval layer. Without memory, every interaction starts from zero.

A

Autonomy

Can pursue a goal across multiple steps without human input at each step. The level of autonomy varies. Autonomy without governance is a risk. Autonomy with governance is the goal.

O

Orchestration

The coordination layer that sequences multiple steps, routes between tools, handles conditional logic, and manages retries when something fails. Even a simple Level 1 agent typically has at least two steps: an input parsing step and an output generation step. Orchestration is what connects them. Without it, you do not have an agent. You have a prompt with extra steps. This is the component most first-time builders underestimate, and the one most responsible for agents that work in demos but break in production.

The Roadmap

The 5-Phase Framework

Most agent projects fail because teams do things in the wrong order. They build before validating the problem. They deploy before designing governance. They scale before a single agent has proven value. Each phase produces something the next phase depends on. Skip one and the dependency breaks.

01
Discovery
Quantified problem, named champion, locked use case. Prevents hype-driven selection.
02
Architecture
Agent Design Canvas, Integration Map, SLOs, governance design. Prevents integration surprises.
03
Build
Level 1 agent: one trigger, one flow, one output. Prevents over-scoping and untested handoffs.
04
Deploy & Govern
Production agent with RBAC, audit, cost caps, monitoring live from day one.
05
Scale
Level 2 multi-agent architecture only after Level 1 proves value in production.
Root Causes

Why Agent Projects Fail:
The 5 Structural Root Causes

Each failure mode includes a self-diagnostic. Run these against your current situation before you start. The “Problem” column tells you what goes wrong. The “Diagnostic” column tells you whether it is already going wrong for you.

01
Hype-Driven Selection

Organizations commit budget because leadership read an article or saw a competitor announcement, without identifying a specific problem with a measurable cost. Gartner: “Most agentic AI projects right now are driven by hype and are often misapplied.”

The Fix

The business problem must be stated as a number before any technology is touched. Not “improve customer service.” Something like: “First-response time is 48 hours and needs to be under 4.”

Self-Diagnostic

Can you name the specific person whose job gets measurably easier if this works?

Can you state the current cost of the problem in hours or dollars per month?

Has someone said “we should do something with AI” without a specific process in mind?

If you answered no, no, yes: you are in hype-driven territory. Do not proceed to Architecture until you have a number.

02
Automating a Broken Process

Organizations that succeed are more than twice as likely to have redesigned their workflows before selecting technology (MIT NANDA, 2025). Agentic AI does not improve a broken process. It automates it. The broken parts run faster and create problems at higher volume.

The Fix

Map the current state. Design the ideal state assuming no human bottleneck. Build the agent for the redesigned process, not the existing one.

Self-Diagnostic

Can you draw the current process on a whiteboard, every step, every handoff, every system?

Are there steps in the current process that exist only because of human limitations (scheduling, manual lookup, copy-paste)?

Does the process produce consistent outputs, or does it depend heavily on who is doing it?

If you cannot draw the current process, or if it only works when specific people do it: redesign first. Build second.

03
Governance as an Afterthought

Gartner names three causes for its 40%+ cancellation prediction: escalating costs, unclear business value, and inadequate risk controls. All three are governance failures. Costs escalate without cost architecture. Value is unclear without SLOs. Risk controls fail when RBAC is bolted on after the fact.

The Fix

RBAC, audit trails, cost caps, and approval gates are designed in Phase 2, not added after deployment. Governance is not friction. It is how agents earn the organizational trust needed to expand.

Self-Diagnostic

Do you know what a failed agent run will cost you in LLM tokens?

Have you defined what “working correctly” means as a number, before writing any code?

If the agent sends a wrong message to a customer tomorrow, who finds out and how?

If you cannot answer all three, governance is not designed. Fill in the Phase 2 canvas before starting Build.

04
Underestimating Integration

70% of developers cite integration problems as a primary challenge. 42% of enterprises need access to 8+ data sources to deploy agents successfully, with 79% expecting data challenges to impact rollouts. An agent with a perfect prompt but unreliable data retrieval is not a production agent.

The Fix

Integration mapping is Phase 2 work, not Phase 3. Every data source is identified, access is confirmed, and auth is resolved before the first line of agent code is written.

Self-Diagnostic

Can you list every system the agent needs to read from or write to?

For each system, do you have confirmed API access, or are you assuming you can get it?

Do any of those systems require IT procurement approval that has not started?

Assumed access is not confirmed access. If any system says “we think we can get that,” stop and verify before scoping the build.

05
No Named Champion

Mid-market companies move from pilot to production in an average of 90 days. Large enterprises average 9 months or more. The difference is not resources. It is decision authority (MIT NANDA, 2025). Without someone with P&L accountability invested in the outcome, the first real obstacle pauses the project permanently.

The Fix

Before scoping begins: name one person who can say yes to spend without a committee, who feels the cost of the problem in their own metrics, and who will still care in 90 days.

Self-Diagnostic

If the project hits a real obstacle in month two, is there one specific person who will fight to keep it moving?

Does that person’s team directly feel the pain of the problem this agent solves?

Can that person approve spend without going to a committee?

Three yeses = you have a champion. Anything less = you have enthusiasm, not ownership. Do not start scoping without a champion.

Phase 01 · Discovery

The Champion-Budget-Scope
Framework

Before any agent is scoped, before any tool is selected, before any prompt is written: three things must be true. If any one of them is missing, the project will either never start or never finish.

Champion
What it is

One named person with P&L accountability who owns the outcome and can approve spend without a committee.

What it is NOT

A senior person who is “supportive.” A team that is “interested.” A steering group that will “review progress.”

Red Flags
  • “The team is very excited about this”
  • Sponsor changes quarterly
  • No single name when asked who owns it
  • Champion’s team is not the primary user
Budget
The Conversation to Have

Ask the champion directly: “Is this in your current FY budget or does it need a new approval?” In budget: proceed. Needs approval, champion can give it: proceed with timeline. Needs approval above champion: you need a co-sponsor.

Cost Range for First Agent
  • Internal developer: $15K–45K (3–6 weeks, 1–2 devs)
  • External implementation: $25K–75K
  • Platform infrastructure: $500–3K/month ongoing

A project without a committed budget number is a conversation, not a project.

Scope
The Test

YES: “Our support team handles 2,400 tier-1 tickets per month. 68% require no human judgment. We want an agent to resolve that 68% without escalation.”

NO: “We want to use AI to improve customer experience.”

First Scope Sits Where
  • Current-state cost is measurable
  • Agent handles a meaningful % without human judgment
  • Data is accessible (not locked in legacy systems)
  • Failure is visible and recoverable, not catastrophic
Phase 01 · Discovery, Continued

The 30-Minute Discovery Conversation

Run this with your champion before anything else. These questions separate real projects from wishful thinking. Take notes. The answers are the inputs to your business case.

Problem“If this agent works perfectly, which number in your business changes, and by how much?”
Current cost“How many hours per month does your team spend on this today? What is the error rate? What is the escalation volume?”
Data access“What systems hold the data this agent would need? Do you have API access or would that require IT approval?”
Timeline“What does success look like in 30 days? 90 days? Is there a business event this needs to land before?”
Constraints“What would stop this? Who in the organization would push back, and what would they say?”
Measure“How will we know, on a Tuesday afternoon three months from now, whether this was worth doing?”
Phase 01 · Discovery

Use Case Selection &
The Opportunity Matrix

Start with a structured inventory of where repetitive, structured, high-volume work already exists.

Where to Look: The Six Categories

CategoryAgent-Ready Examples
Customer OperationsTier-1 support tickets, FAQ resolution, returns processing, onboarding steps that follow a decision tree
Finance & ComplianceInvoice matching, expense categorization, reconciliation, audit trail generation, KYC checks
Sales & GTMLead qualification scoring, outbound research, CRM data enrichment, proposal generation
HR & Internal OpsEmployee FAQ handling, onboarding document routing, leave request processing, policy lookups
Data & ReportingWeekly report generation, data normalization, dashboard population, alert triage
Supply Chain & OpsInventory status queries, supplier communication drafting, shipment tracking, exception flagging

The 4-Quadrant Opportunity Matrix

Plot your candidates. Two axes: Business Impact (what does this cost today, or what does it unlock?) and Implementation Complexity (data accessibility, integration count, governance requirements).

High Complexity ↑ Low Complexity
Plan for Phase 2

High value but too complex for a first build. Document it. Revisit after Level 1 proves value.

Build First ✓

Clear value + achievable scope. This is your first agent. Quantify the baseline cost today.

Skip

Neither the value nor the complexity justifies the distraction. Leave it off the roadmap.

Consider for Quick Win

Low complexity, lower returns. Only pursue if you need an early internal demonstration of value.

Low Impact
High Impact →
Phase 01 · Discovery: Worksheet

Use Case Scoring Worksheet

Score your top 3–4 use cases across 5 dimensions. Each dimension scored 1–5. Maximum 25 points. The use case with the highest score AND a committed champion is your first build. Score ≥18 with a committed champion = proceed to Phase 2.

DimensionWhat It MeasuresUse Case A
/5
Use Case B
/5
Use Case C
/5
Measurable CostCan you quantify what this costs today in time, errors, or revenue impact? (1=no data, 5=precise numbers)__ /5__ /5__ /5
Data AccessibilityIs the required data in accessible systems with existing APIs? (1=locked legacy, 5=clean API ready)__ /5__ /5__ /5
Agent CoverageWhat % of total volume can the agent handle without human judgment? (1=<20%, 5=>70%)__ /5__ /5__ /5
Low Governance RiskIs failure recoverable? Can humans review before permanent action? (1=irreversible/public, 5=internal/reversible)__ /5__ /5__ /5
Champion CommitmentDoes your champion personally feel this problem and own the outcome? (1=indirect, 5=primary pain owner)__ /5__ /5__ /5
Total ScoreMaximum: 25 points__ /25__ /25__ /25
Phase 02 · Architecture

Building Your Business Case

A business case is not a slide deck with market projections. It is a one-page document that answers three questions: what does this cost today, what will the agent change, and when do we break even? If you cannot answer all three, you are not ready to build.

Current-state baselineVolume handled per month × average handling time × fully-loaded cost per hour = Monthly cost before agent$______
Agent-handled volumeTotal volume × coverage % (use conservative estimate)___ units
Cost per interventionRemaining human-handled cases × handling time × hourly cost + agent infrastructure cost per month$______
Gross monthly savingsCurrent-state baseline minus Cost per intervention (agent-handled + remaining human)$______
Implementation costAll-in: development, infrastructure setup, integration work, testing. Typical range: $25K–75K$______
Break-even timelineImplementation cost ÷ Gross monthly savings = break-even months. Target: <6 months for first agent__ months
Phase 02 · Architecture: Technical Grounding

LLM Cost Estimation &
Integration Patterns

Two things that consistently surprise first-time builders: how much LLM calls cost at real volume, and how long integration confirmation actually takes. Both need to be in your business case and your timeline before Build starts.

Approximate LLM Cost at 1,000 Queries Per Month

These are order-of-magnitude estimates for a standard support agent handling queries of 200–500 tokens each, with a response of similar length. Actual costs vary significantly by prompt length, response length, and task complexity.

ModelProviderApprox. Monthly CostBest ForTrade-off
GPT-4o miniOpenAI$2–8Simple classification, FAQ, routingLower reasoning quality on complex tasks
Claude Haiku 3.5Anthropic$2–6Document processing, structured extractionLess capable on open-ended generation
GPT-4oOpenAI$25–80Complex reasoning, multi-step workflowsCost grows fast at high volume
Claude Sonnet 4.5Anthropic$20–70Analysis, long documents, nuanced tasksCost grows fast at high volume
Llama 3.1 (self-hosted)Meta / your infraInfra onlySensitive data, high volume, cost controlRequires engineering to deploy and maintain
GPT-4o (10K queries)OpenAI$250–800Same as above, 10x volumeChoose a cheaper model or self-host first
LyzrGPT (platform) Lyzr · multi-model $0.03–0.08/run
+ LLM cost
Teams that want automatic model routing without managing separate API contracts per provider Platform fee on top of underlying LLM cost. Best value above 5K runs/month where routing savings offset the platform layer.

Rule of thumb: start with the cheapest model that passes your quality test. For most first agents, GPT-4o mini or Claude Haiku handles the task adequately and costs 10–20x less than flagship models at the same volume.

Three Integration Patterns to Know Before You Build

Every first agent involves at least one integration. Here are the three patterns that cover 80% of cases, what you need to confirm before Build starts, and what breaks in production when you skip the verification step.

Pattern 01
REST API with Auth Token

The most common pattern. The agent calls an external API (CRM, ticketing system, database service) using a key or OAuth token.

  • What you need before Build: API documentation, a test API key, and a confirmed sandbox environment
  • Typical timeline to confirm: 1–5 days if IT owns the API
  • What breaks in production: tokens expire (usually 30–90 days) without automatic refresh logic. Auth breaks silently and the agent fails without a clear error.
Confirm token expiry policy and build refresh logic before go-live, not after.
Pattern 02
Database with Read-Only Service Account

The agent queries a SQL or NoSQL database directly. Most common for internal reporting tasks, data lookup, or enrichment tasks.

  • What you need before Build: a read-only service account, confirmed schema access, and a test query that returns real data
  • Typical timeline to confirm: 1–3 weeks if DBA approval is required
  • What breaks in production: schema changes in the source database break agent queries with no warning. Add schema monitoring to your SLO check.
DBA approval cycles are the most common cause of Build delays. Start the request in Phase 2, not Phase 3.
Pattern 03
File-Based Input via Document Upload

The agent processes uploaded documents (PDFs, CSVs, DOCX). Most common for document review tasks, contract analysis, or report generation tasks.

  • What you need before Build: sample documents in production format, confirmed file size limits, and a clear understanding of the document’s internal structure
  • Typical timeline to confirm: 1–2 days
  • What breaks in production: real documents are messier than sample documents. Tables, embedded images, and scanned PDFs break extraction logic that worked perfectly in testing.
Test with your 10 messiest real documents, not your 10 cleanest sample files.
On Architect

All three integration patterns are handled through native tool connectivity. You authenticate once per integration at the platform level. Token refresh, schema-change monitoring, and document preprocessing are managed by the platform rather than built per agent — which removes the most common single cause of production failures from your Build scope entirely. Start at architect.new.

Phase 02 · Architecture: Worksheet

The Agent Design Canvas

Complete all 9 sections before writing any code. If you cannot fill in a section, that section is your next task, not something to figure out during Build. An incomplete canvas is a risk register. The right column shows a worked example for a tier-1 support agent.

One sentence: what does this agent do, for whom, and how is success measured?
ExampleResolve tier-1 support tickets for SaaS customers without human intervention. Success: 65%+ resolution rate with <3% error rate.
What event starts the agent? (Incoming message / scheduled time / API event / manual)
ExampleIncoming support ticket via API webhook from Zendesk. Fires on every new ticket tagged “tier-1”.
What data does the agent receive? Where does it come from? What format?
ExampleTicket body (plain text), customer ID, account tier. Source: Zendesk webhook JSON. Also retrieves account history from CRM via REST API call.
What does the agent do? List every action it can take: API calls, DB writes, messages sent.
Example1. Query knowledge base for relevant articles. 2. Draft response using retrieved context. 3. Post reply to Zendesk ticket. 4. Mark ticket as resolved or escalate tag.
Every external system. Confirm API access and auth method for each.
ExampleZendesk API (OAuth, confirmed). Internal knowledge base (REST API, read-only key, confirmed). CRM (service account, DBA approval pending).
Short-term (session only) or long-term (persist across sessions)? What must it remember?
ExampleShort-term only for v1. Agent holds ticket context within a single session. Long-term (customer history retrieval) is Phase 2 scope.
When does the agent escalate to a human? Under what conditions does it stop?
ExampleEscalate if: ticket contains billing dispute keyword, customer is enterprise tier, confidence score below 0.7, or agent has attempted 2 responses with no resolution.
Numbers: error rate ≤ X%, response latency ≤ Y seconds, volume threshold ≥ Z/day.
ExampleError rate ≤ 3%. Latency ≤ 8 seconds P95. Resolution rate ≥ 65% of assigned tickets. Volume: handles 80+ tickets/day without degradation.
What does a bad outcome look like? Is it reversible? Who is notified?
ExampleAgent sends incorrect resolution. Reversible (customer can reopen). Alert fires to support lead via Slack within 5 minutes of error detection. Weekly error report to champion.
Phase 03 · Build

Building Your First Agent

Build starts only after the Agent Design Canvas is complete and signed. Everything in Phase 2 was the work that makes Build predictable. This chapter covers what to build with, how to choose a model, how the components fit together, and what to do on your literal first day.

Phase 03 · Build: Model Selection

Choosing Your Model

The most consequential early decision is which model to use. The right answer depends on your task type, your volume, and your cost constraints, not on which model sounds most impressive. Start with the cheapest model that passes your quality test. Upgrade only when you can measure the quality gap.

Task Type
Reasoning
Default Model
Classification, routing, simple FAQ
Pattern-matching tasks. Response quality difference between cheap and expensive models is minimal. Cost difference is 10–20x.
GPT-4o mini or Claude Haiku
Structured data extraction from documents
Consistent output format matters more than reasoning depth. Cheaper models handle this well when the prompt is precise.
Claude Haiku or GPT-4o mini
Multi-step reasoning, complex analysis
Tasks where the model needs to hold multiple variables, weigh trade-offs, or follow complex instructions benefit from a stronger model.
GPT-4o or Claude Sonnet
Long document processing (>50 pages)
Context window size becomes the constraint. You need a model with a large, reliable context window, not just a large one.
Gemini 1.5 Pro or Claude Sonnet
Sensitive data (PII, healthcare, finance)
Data must not leave your infrastructure. Self-hosted open-source models are the only option that guarantees this at the infrastructure level.
Llama 3.1 (self-hosted)
High-volume production (>50K queries/month)
At this volume, model cost dominates your infrastructure cost. A 10x cost reduction per query is worth significant engineering investment.
Self-hosted or fine-tuned cheaper model

If you are unsure: start with GPT-4o mini or Claude Haiku for your first build. Run 50 real test cases. If quality is insufficient, upgrade to GPT-4o or Claude Sonnet. The quality gap is usually smaller than expected, and the cost gap is always larger.

Phase 03 · Build: Platform Choice

Choosing How to Build

There are three categories of build approach. The right one depends on your team’s technical capability, not on which sounds most sophisticated. The goal is a working agent in production, not an impressive architecture diagram.

Start Narrow, Not Broad

A Level 1 agent has one trigger, one primary flow, and one output. Resist scope expansion during build. Every feature added before the base case works is a failure mode that is harder to debug. Build the happy path first. Add edge case handling second. Add features third.

Tool Category
No-Code
Architect, n8n, Relevance AI. Best for teams without a dedicated developer. Visual workflow builders with pre-built integrations.
SDK
LangChain, LlamaIndex, CrewAI. Best for teams with a developer who can write Python. More control, more setup.
Raw API
Direct API calls to OpenAI, Anthropic, etc. Best for production-grade systems where you need full control. Highest engineering overhead.
On Architect (architect.new)

Use Guided Mode to generate your Plan Document and agent architecture from your Design Canvas. The platform selects models and populates prompts from enterprise templates. Recommended for teams without dedicated AI engineering.

Write the SLO Before the Code

Define “working” in numbers before a single line is written. “Error rate below 3% on first-pass resolution. Latency under 8 seconds P95. Volume threshold: handles 100+ queries/day without degradation.” These numbers are your exit criteria for Build and your entry criteria for Deploy.

What SLO Monitoring Looks Like
No-Code
Most platforms have built-in run logging. Export logs to a spreadsheet or dashboard. Check weekly minimum.
SDK
LangSmith (LangChain), Helicone, or Langfuse for tracing. Add a logging wrapper to every agent call.
Raw API
Build your own logging layer. Log input, output, latency, and token count for every call. Store in your database of choice.
On Architect (architect.new)

Push to Agents creates your agent architecture with locked-in prompts, KB integration, and tool connections. Run logs are visible in the platform. Edit agent parameters in Studio before deployment.

Test Before Trust

Before any user-facing deployment: run 50+ test scenarios, including adversarial inputs. Test the handoff condition: confirm the agent actually escalates when it should. Test the failure mode: confirm the failure notification fires. A production agent with an untested handoff path is a liability, not a product.

What Testing Looks Like
No-Code
Run test cases manually through the platform UI. Document pass/fail. Use your 10 most common real inputs and your 5 most problematic edge cases.
SDK
Write a test suite using pytest or equivalent. Run it before every deployment. Add adversarial cases as you discover edge cases in production.
Raw API
Build an evaluation harness. Run the same 50 cases through every prompt iteration. Track pass rate across versions.
On Architect (architect.new)

Use the Live Preview in the App tab to test agent responses in real time. Deploy to a staged URL on Netlify before sharing externally. The staged URL is your test environment.

Phase 03 · Build: How It Works

Agent Architecture:
The Four Components

Every agent, regardless of platform or framework, is made of four components. Understanding what each one does tells you what to configure, what to test, and what breaks when something goes wrong.

LLM
Component 01
The LLM (The Brain)

The model that reasons, plans, and decides what to do next. It reads the input, retrieves from memory if needed, chooses which tools to call, and generates the output. The LLM does not take actions directly. It decides what actions to take, then the orchestration layer executes them.

What you configure: model choice, system prompt, temperature, max tokens, and stop conditions.
MEM
Component 02
Memory

What the agent knows and remembers. Short-term memory holds the current session: the conversation so far, retrieved documents, intermediate results. Long-term memory persists across sessions: customer history, learned preferences, past outcomes. Most first agents need only short-term memory. Add long-term memory only when you can measure the quality improvement it provides.

What you configure: context window size, retrieval method (RAG, vector search, SQL lookup), what gets stored and for how long.
TOOL
Component 03
Tools

The list of actions the agent can take. Each tool is a function with a name, a description, and an input/output schema. The LLM reads the tool descriptions and decides which one to call based on the task. Tools are what turn a chatbot into an agent. Without tools, the agent can only generate text. With tools, it can search a database, send an email, update a CRM record, or call any API you have defined.

What you configure: tool name, description (the LLM reads this to decide when to use it), input parameters, output format, and error handling.
ORCH
Component 04
The Run Loop (Orchestration)

The engine that keeps the agent working. It runs a cycle: observe the current state, reason about what to do next, take an action, check if the goal is reached. It continues until either the goal is complete, a stop condition fires (max steps, error threshold), or the agent decides to hand off to a human. The run loop is also where retry logic, error handling, and fallback behavior live. A well-designed run loop is the difference between an agent that fails gracefully and one that fails silently.

What you configure: max iterations, retry logic, fallback model, stop conditions, and handoff triggers.
Phase 03 · Build: Getting Started

Your First 48 Hours

This is the section that converts a completed Design Canvas into something running. Follow these steps in order. Do not skip to step 4 because it sounds more interesting. The point of steps 1–3 is to eliminate variables before you add complexity.

  • 1
    Get API access to your chosen model

    Create an account with your model provider (OpenAI, Anthropic, or your platform of choice). Get an API key. Set a spending limit before you do anything else: set it low, around $20. You will hit it and need to raise it. You will never accidentally spend $500 on a misconfigured loop.

    If you are using a no-code platform like Architect, create your account and run one of the default example agents before building your own. Confirm it works end-to-end before touching your use case.

  • 2
    Build a version with no tools and no integrations

    Write a system prompt that describes your agent’s role and goal. Send it a real example input from your use case. Look at the output. Is the reasoning coherent? Is the format correct? Fix the prompt until you get output you would be comfortable showing a user. This step has zero integration risk and tells you how hard your prompt engineering job is going to be.

    Time to complete: 2–4 hours. If it takes longer, your task is more complex than your Design Canvas suggests. Revisit section 08 (SLO) before continuing.

  • 3
    Add one tool. Test it in isolation.

    Add the first tool from your Design Canvas section 04. Call it manually with a test input. Confirm the output is what you expect. Then add it to the agent and run the same test input you used in step 2. Confirm the agent uses the tool correctly and that the output improves. Add tools one at a time. Never add two at once.

    The most common mistake: adding all tools at once, getting a failure, and not knowing which tool caused it.

  • 4
    Test your handoff condition before anything else goes to production

    Before adding more tools or more complexity: deliberately trigger the handoff condition from your Design Canvas section 07. Send an input that should cause escalation. Confirm the agent escalates correctly and that the right person is notified. This is the most commonly skipped test and the most consequential one.

  • 5
    Run 20 real inputs, document every failure

    Pull 20 real examples from your use case (not invented test cases). Run them through the agent. For every failure, note what went wrong: wrong tool call, wrong format, wrong reasoning, or missed handoff trigger. Fix the most common failure pattern before adding more test cases. Repeat until pass rate is above 80% on 20 cases, then scale to 50.

  • 6
    Deploy to a staged URL and test with 3 internal users

    Get the agent running at a URL that real users can access. Ask 3 people from the target team to use it for one week with real inputs. Collect every failure. Fix the top 3 most common issues. Only after this step should you consider production deployment.

    On Architect: use the Deploy button to get a live Netlify URL. Share it with your internal testers before announcing it broadly.

  • 7
    Activate governance before opening to more users

    Before more than 5 people have access: turn on audit logging, set your cost cap, configure RBAC, and confirm the SLO monitoring dashboard is live. These are not optional extras to add later. They are Phase Gate 3 requirements. If they are not active, you are not ready to deploy broadly.

Phase 03 · Build: What to Avoid

Common First Agent Mistakes

These are the patterns that consistently kill agents between demo and production. They are not edge cases. They are the norm.

Context windows that overflow on real data

Your prompt works perfectly on a 200-word test input and breaks silently on a 2,000-word real document. Set a hard limit on input size and add a truncation or chunking step before the agent runs. Test with your longest real inputs, not your shortest.

Prompts that work in testing and break on edge cases

The model does exactly what your prompt says. The problem is that your prompt describes what to do in the happy path, not what to do when inputs are ambiguous, incomplete, or adversarial. Add explicit instructions for what the agent should do when it is uncertain. “If you cannot determine X with confidence, escalate” beats leaving it to the model’s judgment.

Tools that work in isolation but fail when chained

Tool A works. Tool B works. When the agent calls Tool A and passes its output to Tool B, the output format from A does not match the input format B expects, and the whole chain fails. Always test the full sequence end-to-end, not just individual tools. Define the input and output schema for every tool explicitly and test handoffs between them.

Auth tokens that expire after 30 days

You authenticate, the agent works, you deploy. 30 days later, the auth token expires and the agent fails on every call with a cryptic error. Implement token refresh logic before production or set a calendar reminder to rotate tokens before they expire. The first expiry usually happens when you are not watching.

No structured output enforcement

You ask the model to return JSON. It returns JSON most of the time. On 3% of calls, it wraps the JSON in markdown backticks, adds an explanation before it, or returns slightly different field names. Your downstream system breaks. Use structured output mode (function calling / JSON mode) where available. Never parse free-text output with regex in production.

No retry logic on transient failures

API calls fail. Networks have timeouts. Model providers have brief outages. An agent with no retry logic treats a 500ms network hiccup the same as a genuine error. Implement exponential backoff with a maximum of 3 retries on any external API call. Log every retry. Alert on any call that exhausts all retries.

Real documents are messier than sample documents

You test with 5 clean PDFs. Production has 500 PDFs, 30% of which are scanned images with no text layer, have tables in non-standard formats, or are password-protected. Test your document processing pipeline on your 20 messiest real documents before build is complete. If any fail, solve that before deployment.

Skipping the handoff test

The agent handles the happy path in testing. Nobody tests whether the escalation path actually works. The first time a production case should escalate, the escalation fails silently and the customer gets no response. Test your handoff condition before anything else goes to production. It is step 4 in the First 48 Hours section for this reason.

Phase 04 · Deploy & Govern

Governance: The Layer That
Determines Whether Agents Scale

Governance is not a compliance checkbox. It is the mechanism by which agents earn the organizational trust needed to expand. Build it before you need it. By the time you need it, it is already too late to add it cleanly.

LayerWhat It IsMinimum Requirement
RBACRole-Based Access ControlFour roles defined before deployment: Admin, Operator, Viewer, Override. No agent accessible without authentication.
Audit TrailEvery action, loggedEvery agent action logged with timestamp, agent ID, input received, output produced, and any human override. Logs retained minimum 90 days.
Cost CapsPer-run and monthly limitsMaximum cost per agent run + monthly budget ceiling. Alert at >2x normal token consumption per session.
Approval GatesHuman-in-the-loop for high stakesAny irreversible action (external communication, financial record modification, data deletion) requires human approval until error rate is below SLO for 30+ consecutive days.
SLO MonitoringNumbers, not feelingsMonitor: error rate (% of runs with incorrect output), latency (P95 response time), volume throughput. Alert at 1.5x baseline. Escalate at 2x.
Prompt VersioningChange control for AIEvery change to prompt, model, or tool configuration version-controlled with timestamp and author. Rollback possible within 10 minutes. No prompt changes in production without signed review.
Phase 04 · Deploy: Pre-Flight

Pre-Deployment Checklist

Every box must be checked before production deployment. If any box remains unchecked, it is a risk, not a detail.

  • Agent Design Canvas complete and signed by champion and project lead
  • Integration Map: all data sources confirmed with API access and auth
  • SLOs defined in writing: error rate, latency, volume thresholds
  • RBAC configured: all four roles defined and assigned
  • Audit logging active: test log entries verified
  • Cost caps set: per-run limit and monthly ceiling configured
  • Approval gates configured for all irreversible actions
  • 50+ test scenarios run including adversarial inputs
  • Handoff condition tested: agent escalates correctly
  • Failure notification tested: alert fires correctly
  • Staged deployment URL tested by 3+ internal users
  • Rollback procedure documented and tested
  • Auth token expiry dates noted and refresh schedule set
  • SLO monitoring dashboard live and verified
Phase 04 · Deploy: What Comes Next

What a Real Agent Looks Like
at 30 Days

The most common unknown for first-time builders is not how to build the agent. It is how to know whether the agent is working after deployment. This section tells you what a healthy agent looks like, what a struggling one looks like, and what to do about the difference.

Week 1
Establish Baseline
  • Log every run. Review logs daily, not weekly.
  • Record actual error rate, latency, and volume
  • Note every case that escalated to a human and why
  • Document every case that should have escalated but didn’t
  • Compare actual cost per run to estimate from business case
  • Your SLO targets are aspirational at this stage. Your job is measurement, not optimization.
Weeks 2–3
Identify Patterns
  • Categorize failure modes: are they prompt failures, tool failures, or integration failures?
  • Identify the most common input type that causes failures
  • Check whether escalation rate is trending up or down
  • Confirm cost per run is stable and predictable
  • Fix the top 2 failure patterns. Version-control every change.
  • Do not fix more than 2 things at once. You need to know which fix worked.
Week 4
30-Day Report to Champion
  • Total volume handled vs. baseline estimate
  • Actual resolution rate vs. SLO target
  • Actual cost per run vs. business case estimate
  • Top 3 failure modes identified and status (fixed / in progress / accepted risk)
  • Recommendation: proceed to scale or extend Level 1 stabilization
  • This report is Phase Gate 4 input. No scale decision without it.

Drift and Degradation: When to Worry

Agents degrade over time even when nothing in the agent changes. Source data drifts, user input patterns shift, and the real-world distribution diverges from the test distribution. Here is what to watch for and what to do about it.

Warning SignSeverityLikely CauseFirst Response
Error rate up 25–50% from baselineWATCHInput pattern shift, prompt edge case surfacingReview last 20 failed cases. Identify common pattern. Update prompt.
Error rate up 2x or more from baselineACT NOWIntegration failure, source data change, model behavior shiftPause agent. Check integrations. Check source data schema. Roll back last prompt change.
Latency up 50% with no volume changeWATCHProvider API slowdown, tool timeout, context window growthCheck provider status. Check tool response times. Check if prompt has grown.
Escalation rate trending steadily upwardWATCHNew input types not covered by training distributionCategorize escalation reasons. Add handling for top 2 uncovered cases.
Cost per run trending upward unexpectedlyWATCHInput length growth, prompt chain getting longer, retry rate increasingCheck average input token count. Check retry logs. Check if any tool is failing and triggering retries.
Integration auth failureACT NOWToken expiry, API key rotation, endpoint changeCheck auth token expiry. Rotate and update. Check API changelog for endpoint changes.
Phase 05 · Scale

When to Expand:
Scale Signals and Level 2 Readiness

Scaling before Level 1 proves value is not ambition. It is risk amplification. Multi-agent orchestration inherits every governance gap from Level 1 and multiplies it. The scale signals below are your unlock criteria. None of them are optional.

SignalRequirementVerified By
Production stability30+ days in production without a P1 incidentChampion sign-off
SLO performanceError rate below threshold for 2 consecutive weeksMonitoring dashboard review
Cost predictabilityPer-run cost within 15% of estimate for 4 consecutive weeksFinance sign-off
Volume proofAgent handled planned volume target at planned accuracyOps lead sign-off
Next use case identifiedLevel 2 use case scoped and champion namedProject lead sign-off

What the Three Levels Actually Mean

1
Level 1: Start Here
ArchitectureSingle agent
CharacteristicsOne trigger, one flow, defined SLOs, human oversight high
Build this first. Prove value. Then earn Level 2.
2
Level 2: After Scale Signals
ArchitectureManager + specialist agents
CharacteristicsParallel execution, cross-agent handoffs, governance layer critical
Only after all scale signals confirmed. New governance review required.
3
Level 3: Enterprise Only
ArchitectureCross-functional agent networks
CharacteristicsAgents across departments, shared memory + knowledge graph
Enterprise-only. Requires dedicated AI ops function.
Operating Plan

Your 90-Day Roadmap Grid

Fill in your specific use case, champion name, and target dates. Each phase gate must be signed before the next phase begins. Amber rows are Phase Gates: no phase begins without the gate signed.

TimelinePhaseActionsOwnerDate
Days 1–10Discovery:
Foundation
Complete 30-min discovery conversation with champion____________/___
State the problem as a number (current-state baseline)____________/___
Confirm budget: in FY budget or approval path identified____________/___
Map use case candidates against the 4-quadrant matrix____________/___
Days 11–20Discovery:
Use Case Lock
Score top 3 use cases on the 5-dimension scorecard____________/___
Select one use case with champion sign-off____________/___
Map the current process: every step, every system____________/___
Design the ideal process (no human bottleneck)____________/___
Days 21–30Phase Gate 1Business case drafted: current cost, projection, break-even____________/___
All data sources identified + API access confirmed____________/___
Phase Gate 1 signed: champion + project lead____________/___
Kick off Architecture phase____________/___
Days 31–45Architecture:
Design
Agent Design Canvas: all 9 sections complete with worked example reviewed____________/___
Integration Map: every source confirmed with auth method____________/___
LLM cost estimate complete: model chosen, volume projected____________/___
Governance design: RBAC, audit, cost caps, approval gates____________/___
Days 46–55Phase Gate 2Architecture review with technical lead + champion____________/___
All integrations confirmed accessible (not assumed)____________/___
Phase Gate 2 signed: champion + tech lead____________/___
Begin Build sprint (First 48 Hours checklist active)____________/___
Days 56–70Build:
Level 1 Agent
Build core agent flow per canvas spec (no-tools version first)____________/___
Run 50+ test scenarios including adversarial cases____________/___
Test handoff conditions and failure notifications____________/___
Staged deployment tested by 3+ internal users for 1 week____________/___
Days 71–80Phase Gate 3Full pre-deployment checklist signed (14 items)____________/___
RBAC and audit logging confirmed active____________/___
Cost caps and SLO monitoring dashboard live____________/___
Production deployment: champion notified____________/___
Days 81–90Monitor &
Scale Signals
Daily log review (Week 1), weekly thereafter____________/___
Identify top 2 failure patterns and fix____________/___
30-day production report delivered to champion____________/___
Scale signal checklist initiated (Phase Gate 4)____________/___
Master Checklist

Phase Gate Checklists

No gate opens until every box is checked and signed by both the project lead and the executive champion. These are not bureaucratic formalities. They are the structural mechanism that prevents the five failure modes from taking hold.

Phase Gate 01
Discovery Complete
Champion: _________________    Date: _______    Sign: _________
  • Problem stated as a number (current-state cost quantified)
  • One named champion with P&L accountability confirmed
  • Budget confirmed: in FY or approval path named
  • One use case selected with scorecard score ≥ 18
  • Current-state process mapped: every step and system identified
  • Ideal-state process designed: agent role clearly defined
  • Business case drafted: cost, savings projection, break-even < 6 months
Phase Gate 02
Architecture Complete
Champion: _________________    Date: _______    Sign: _________
  • Agent Design Canvas: all 9 sections filled and reviewed
  • Integration Map: every data source confirmed with API access (not assumed)
  • LLM cost estimate complete: model chosen, volume projected
  • Integration pattern identified for each source (REST, DB, or file-based)
  • SLOs defined in writing: error rate, latency, volume
  • Governance design complete: RBAC roles, audit trail, cost caps, approval gates
  • Rollback procedure documented
  • Technical lead sign-off on architecture feasibility
Phase Gate 03
Build Complete / Pre-Deployment
Champion: _________________    Date: _______    Sign: _________
  • Base agent (no tools) tested and producing correct outputs
  • All tools tested in isolation before being added to agent
  • 50+ test scenarios run including adversarial inputs
  • Handoff condition tested: agent escalates correctly
  • Failure notification tested: alert fires correctly
  • Auth token expiry dates logged and refresh schedule set
  • All governance infrastructure live in staging: RBAC, audit, cost caps
  • Staged URL tested by 3+ internal users for minimum 5 days
  • SLO baseline established from staging data
Phase Gate 04
Production Validated / Scale Unlock
Champion: _________________    Date: _______    Sign: _________
  • 30+ days in production without a P1 incident
  • Error rate below SLO threshold for 2 consecutive weeks
  • Cost per run within 15% of estimate for 4 consecutive weeks
  • Volume target met at target accuracy
  • Top 2 failure patterns identified and addressed
  • 30-day production report delivered to champion
  • Level 2 use case identified and champion named
Where to Go From Here

Start building at
architect.new

The fastest path from a completed Agent Design Canvas to a deployed agent. Plan Mode generates your PRD and agent architecture. Push to Agents brings it to life. One-click deployment puts it in production. Recommended for teams without dedicated AI engineering.

Open Architect → architect.new
GuidedMode: step by step
One ShotMode: single prompt
5 PhasesPlan → Agents → App
Share this: