Open runtime for long-running agents Open source — Apache 2.0

Keep long-running agents moving across interruptions — on any model, any cloud.

The open-source runtime for Python agents. Durable execution, durable memory, replay, wait on humans — deploy on your infrastructure.

pip install kitaru
Watch the demo
WHAT BREAKS FIRST

Long-running agents fail in boring, expensive ways.

Agents die when your laptop sleeps

Hours of work disappear on routine failures

A laptop sleep, pod eviction, or API timeout can wipe a run that took hours to build. You burn tokens and start over.

Agents can't coordinate

They need to wait on humans and other systems

Real agents pause for approvals, webhooks, and other agents. Fire-and-forget loops do not handle that well.

Results vanish when agents crash

State either vanishes or gets stitched together by hand

Without checkpoints, artifacts, and memory, you lose context when things go wrong and debugging becomes guesswork.

That is why Kitaru focuses on runtime primitives: checkpoints, replay, wait/resume, and durable state on real infrastructure.

SEE THE CODE

Core primitives. Full durability.

agent.py
import kitaru
from kitaru import flow, checkpoint
 
kitaru.configure(stack="kubernetes")
 
@checkpoint
def research(topic: str) -> dict:
    results = search_web(topic)
    kitaru.save("sources", results)
    return summarize(results)
 
@checkpoint
def write_draft(context: str, prev_id: str) -> str:
    prior = kitaru.load(prev_id, "sources")
    return kitaru.llm(
        "Draft a report on: " + context,
        model="gpt-4o",
    )
 
@flow
def report_agent(topic: str, prev_id: str) -> str:
    data = research(topic)
    draft = write_draft(str(data), prev_id)
    kitaru.log(topic=topic, words=len(draft.split()))
 
    approved = kitaru.wait(
        schema=bool, question="Publish?"
    )
    if approved:
        publish(draft)
    return draft
@flow

Top-level orchestration boundary. Marks a function as a durable workflow.

@checkpoint

Persists output. Crash at step 3? Steps 1-2 never re-run.

kitaru.wait()

Suspends the process. Resume when a human responds, 30s or 3 days later.

kitaru.llm()

Resolves model alias, injects API key, logs cost automatically.

kitaru.log()

Structured metadata on every execution. Query it in the dashboard.

kitaru.save()

Persist any artifact by name inside a checkpoint.

kitaru.load()

Retrieve saved artifacts from any prior execution by ID.

kitaru.configure()

Set stack, project, and runtime defaults. Zero config locally.

CORE RUNTIME PRIMITIVES

The primitives long-running agents keep needing.

These are the runtime basics teams keep rebuilding once agents leave the laptop.

01 — Wait & Resume

Pause. Get input. Continue later.

Suspends at decision points, releases compute, and resumes when input arrives from a human, another agent, or a webhook — even hours or days later.

1
2
3
4
5
6
Waiting for input...
02 — Replay from Failure

Crash at step 6? Resume from step 6.

Every step is checkpointed. Fix the issue and replay from the point of failure instead of re-burning tokens.

1
2
3
4
5
6
7
Running agent...
$12.40 in tokens saved
03 — Framework Portability

Keep your framework. Add durability.

PydanticAI, CrewAI, or raw Python — wrap it with Kitaru and get checkpointed execution without rewriting your agent.

kitaru
PydanticAI
OpenAI SDK
Your code
CrewAI
04 — Parallel Recovery

Fan out work without losing recovery.

checkpoint.submit() dispatches branches concurrently. Each keeps its own checkpoint history, so you can replay only the failed branch.

DURABLE MEMORY

State your agents can come back to.

Versioned, scoped memory across Python, CLI, and MCP — not a bolt-on scratchpad.

memory_demo.py
from kitaru import memory

memory.configure(
    scope="my_repo",
    scope_type="namespace",
)

# Store durable state
memory.set("conventions/test_runner", {
    "command": "just test",
    "notes": "Start targeted, then full suite.",
})

# Next run — any surface — instant recall
conventions = memory.get(
    "conventions/test_runner"
)
Memory Store
conventions/test_runner
v1 v2 v3
sessions/topic_count
v1 v2
scratch/obsolete
v1 deleted
Scopes

Scoped to your use case

namespace for shared team state, flow for per-agent knowledge, execution for single-run scratch. Pick the right boundary.

Compaction

LLM-powered compaction

When memory grows, compact it: an LLM reads source entries and writes a concise summary as a new version. Sources stay intact — compaction is additive.

Audit

Audit trail built in

Every compaction and purge writes a CompactionRecord to a reserved audit log. Hard purge cleans old versions. You always know what happened.

Versioned

Versioned, not overwritten

Every memory.set() creates a new version. Deletes are soft — tombstones preserve the full history for audit and rollback.

WHAT'S BUILT IN

What Kitaru includes

Observability built-in

Execution visibility built in

The server UI lets you inspect checkpoints, execution logs, LLM calls, and costs without buying a separate control plane.

Full execution control

Execution control, not just traces

Pause at decision points, get human input, and resume hours or days later. Replay from checkpoints instead of rerunning the whole flow.

Deployment flexibility

Your infrastructure, your cloud

Run the same Python code locally, on Kubernetes, or across AWS, GCP, and Azure. Move stacks without rewriting the agent.

Python-first, no lock-in

Python-first portability

Use PydanticAI, CrewAI, the OpenAI SDK, or raw Python. Kitaru adds runtime primitives without locking you into a framework.

OPEN RUNTIME

An orchestration layer for long-running agents.
Not a framework.

Keep your existing agent code. Kitaru adds checkpoints, memory, and execution control underneath it, then runs it on your infrastructure.

Your Agent Code
PydanticAI OpenAI SDK CrewAI Raw Python
Write agents your way
Kitaru SDK
@flow @checkpoint wait() llm() log() save() load() memory.* configure()
Core primitives. Durable execution + memory.
Kitaru Engine
Checkpointer DAG Builder Replay Cost Engine
Built on ZenML foundations
Infrastructure
Kubernetes AWS / GCP / Azure S3 / GCS SQL Database
Your cloud
agent.py
import kitaru
from kitaru import flow, checkpoint

@flow
def coding_agent(issue: str) -> str:
    plan = analyze_issue(issue)
    patch = write_code(plan)

    # Pauses. Resumes when input arrives.
    approved = kitaru.wait(
        bool, question="Merge this PR?"
    )
    if approved:
        merge(patch)
    return patch

Your agent crashed at step 5.
Stop re-running steps 1 through 4.

pip install kitaru

Open source (Apache 2.0). pip install and go.