One prompt becomes a self-organizing hierarchy of AI agents that plan, delegate, and execute in parallel — auto-routing every step to the cheapest model that's best at it. Up to 90% cheaper than running everything on one frontier model.
Every prompt flows through a hierarchy. T1 plans, spawns T2 managers in parallel, each of which orchestrates T3 workers — all streaming back into a single final answer.
Analyzes complexity · Selects models · Decomposes into sections · Compiles final output
Cascade classifies your prompt before dispatching — simple questions go direct to a T3 worker, complex implementations spin up a full hierarchy.
| Classification | Example | Route | T2 Managers |
|---|---|---|---|
| Simple | "What is a closure?" |
T3
|
— |
| Moderate | "Add pagination to the users API" |
T2
T3 ×2
|
1 |
| Complex | "Refactor auth module to JWT, add tests, open PR" |
T1
T2 ×3
T3 ×n
|
3–5 |
| Highly Complex | "Research, benchmark, and document the full auth ecosystem" |
T1
T2 ×5+
T3 ×n
|
5+ |
Works with every major model provider
No plugin store to browse. The tools your agents need are already wired in.
Set any tier to Auto and Cascade picks the best-value model per task — fusing live public benchmark scores with live OpenRouter pricing. Cheap models win trivial work; frontier models win the hard parts.
Type /auto on for hands-off runs: the plan auto-approves and safe tools run without prompts — while dangerous tools still ask and budget caps stay the hard stop.
Pause before any worker spawns to review T1's plan — an AI reviewer critiques it, you drop sections inline or add a steering note, and it re-plans until you approve.
Hit the budget cap on a big task? /continue resumes with a raised budget — files already created persist on disk, so only the remaining work runs. No redo.
A worker that discovers its task should fan out asks its manager to spawn bounded sibling workers on the fly — dynamic parallelism, no rigid up-front plan, no runaway recursion.
Watch the T1→T2→T3 hierarchy execute in real time directly in the terminal via ink rendering.
Dangerous tool calls escalate through T2 → T1 → user before executing. Never a silent file delete.
Rate-limit hit? Cascade auto-switches providers with exponential backoff. Zero config required.
Shell, file CRUD, git, GitHub/GitLab PRs, Playwright browser automation, PDF creation, code interpreter.
React + ReactFlow live topology graph, session browser, cost tracker, JWT auth, WebSocket updates.
Connect any Model Context Protocol server. Its tools become available to every T3 worker automatically.
Every result exposes costByTier, tokensByTier, and percentage attribution. Set a
live session budget with /budget set 0.50 — Cascade warns you at 80% spend (configurable via
warnAtPct) and stops new tasks the moment the cap is hit, with no config-file edits required.
First-run TUI collects API keys for every provider — including multiple Azure deployments and custom OpenAI-compatible endpoints. Fetches live model lists, then assign T1/T2/T3 models or let Cascade Auto decide.
Redesigned terminal UI with a top status bar showing live tier models and cost, a compact agent tree for T1→T2→T3 progress, and a keyboard hint bar — all purpose-built for Cascade's multi-tier hierarchy.
Run /model inside the REPL for a three-step picker — provider → tier → model — with Auto at
every step. Arrow keys, Tab, j/k and number keys all work; selections write
.cascade/config.json and hot-swap the live router, no restart required.
Pass an AbortSignal to cascade.run() to stop any in-progress run mid-flight. All
active tiers (T1 → T2 → T3) halt at the next safe checkpoint before the next LLM call — no mid-stream
interruptions, no orphaned agents. A run:cancelled event fires with partial output so you can
still surface what was produced. Prevents runaway token spend on long tasks.
// .cascade/config.json
{
"version": "1.0",
"providers": [
{ "type": "anthropic",
"apiKey": "sk-ant-..." },
{ "type": "ollama" }
],
"models": {
"t1": "claude-opus-4",
"t2": "claude-sonnet-4",
"t3": "llama3.2:3b"
},
"tools": {
"shellBlocklist": ["rm -rf"],
"requireApprovalFor": ["shell"]
}
}
Cascade exposes a first-class TypeScript SDK. Bring your own approval flow, stream tokens to any UI, or wire it into a CI pipeline.
Full TypeScript types for every option and result
Token-by-token streaming via callback
Custom approval callbacks for tool gating
Per-tier cost & token breakdown — costByTier, tokensByTier, and percentage
attribution in every result
Live budget management — /budget set <$amount> caps session spend at runtime;
/budget shows a visual spend bar; proactive warning fires at 80% (configurable
warnAtPct) before the hard stop
runCascade, createCascade, streamCascade — three entry points
import { streamCascade } from 'cascade-ai';
await streamCascade(
'Refactor auth module to use JWT, add tests, open a PR',
(token) => process.stdout.write(token),
{
workspacePath: '/my/project',
approvalCallback: async (req) => {
console.log(`Allow ${req.toolName}?`);
return { approved: true, always: false };
},
}
);
Open-source, MIT licensed. No telemetry by default. Runs local models via Ollama — your code never leaves your machine if you don't want it to.