← Back to blog

Guardrails for AI Agents: How to Keep Autonomous Systems Safe

An AI agent without guardrails is a liability. An agent with too many guardrails is useless. The engineering challenge is finding the right constraints — boundaries that keep the system safe while preserving the autonomy that makes agents valuable in the first place.

Here’s how we think about it.

Input validation: control what goes in

Every agent interaction starts with input. That input might come from a user, an upstream system, or another agent. Before the agent reasons about anything, validate it.

  • Schema enforcement. If the agent expects structured input, reject anything that doesn’t conform. Don’t let the agent try to interpret garbage data.
  • Content filtering. Screen for prompt injection attempts, personally identifiable information that shouldn’t be processed, or input that’s outside the agent’s defined scope.
  • Size limits. An unexpectedly large payload can blow out context windows and costs. Set hard limits and fail cleanly.

This is basic defensive programming, applied to agents. Nothing exotic, but skipping it is how you get agents processing data they were never designed to handle.

Output filtering: control what comes out

The agent will produce outputs — text responses, API calls, generated documents. Every output should pass through validation before it reaches users or downstream systems.

  • Format validation. If the agent is supposed to produce JSON, validate the schema before passing it along. If it’s generating a customer-facing email, check that required fields are populated and tone markers are within bounds.
  • Content screening. Check for hallucinated data, leaked internal information, or content that violates your policies. This isn’t theoretical — production agents will occasionally produce outputs that are factually wrong or contextually inappropriate.
  • Confidence thresholds. When the agent’s confidence in its output is low, route to human review instead of sending it out the door.

Action scoping: limit what the agent can do

This is the most critical guardrail category. An agent with access to tools can take real actions in real systems. Scope those actions tightly.

  • Least privilege. Give the agent access to exactly the tools it needs, nothing more. An agent that processes invoices doesn’t need access to your HR system.
  • Read vs. write separation. Start agents with read-only access. Graduate to write access only after you’ve validated behavior in production. This one principle prevents most catastrophic failures.
  • Action whitelisting. Define an explicit list of allowed actions. Everything else is denied by default. This is more work than blacklisting, but it’s fundamentally safer.
  • Resource constraints. Cap the number of API calls per task, the total tokens consumed, and the dollar spend per execution. An agent stuck in a reasoning loop shouldn’t be able to run up an unlimited bill.

Human-in-the-loop escalation

Not every decision should be autonomous. Design clear escalation paths for situations where the agent should stop and ask a human.

  • High-stakes actions. Deleting data, sending external communications, approving transactions above a threshold — these should require human confirmation, at least until you have high confidence in agent behavior.
  • Low-confidence situations. When the agent can’t determine the right course of action, it should escalate rather than guess. A well-designed agent that says “I’m not sure, here are my options, what should I do?” is far more valuable than one that confidently does the wrong thing.
  • Novel scenarios. If the agent encounters a situation significantly different from its training distribution, escalate. This requires monitoring for distribution drift, which brings us to observability.

The escalation UX matters. If escalation is clunky or slow, people will either bypass it or stop using the agent. Make it seamless: clear context, specific questions, easy approval flows.

Audit logging: record everything

Every action an agent takes should be logged with enough detail to reconstruct the full decision chain after the fact.

  • Input received. What triggered the agent.
  • Reasoning trace. What the agent considered and why it chose a particular action.
  • Actions taken. Every tool call, every API request, every output generated.
  • Outcomes. What happened as a result, including errors and retries.

This is non-negotiable for regulated industries, but it’s good practice everywhere. When an agent does something unexpected, you need to answer “why” quickly and definitively. Without comprehensive logging, you’re debugging in the dark.

Budget and rate limits

Agents consume resources: LLM tokens, API calls, compute time. Without hard limits, a single misbehaving agent can generate significant costs.

  • Per-task budgets. Set a maximum token and cost limit for each agent execution. If the agent exceeds it, terminate gracefully and alert.
  • Rate limiting. Cap the rate of external API calls to prevent agents from overwhelming downstream services.
  • Circuit breakers. If an agent’s error rate spikes above a threshold, stop it automatically and notify the operations team.

Guardrails are a feature

We sometimes hear resistance to guardrails framed as “you’re limiting the agent’s capability.” That’s backwards. Guardrails are what make agents deployable. An agent you can’t trust in production isn’t capable — it’s a demo.

The goal is agents that are both powerful and predictable. That requires engineering the boundaries with the same rigor you apply to the capabilities. Get the guardrails right, and you can give agents more autonomy over time as you build confidence. Get them wrong, and you’ll spend your time cleaning up incidents instead of delivering value.