Open runtime for long-running agents Open source — Apache 2.0

Keep long-running agents moving across interruptions — on any model, any cloud.

The open-source runtime for Python agents. Durable execution, durable memory, replay, wait on humans — deploy on your infrastructure.

pip install kitaru

View on GitHub Read the docs

Watch the demo

WHAT BREAKS FIRST

Long-running agents fail in boring, expensive ways.

Hours of work disappear on routine failures

A laptop sleep, pod eviction, or API timeout can wipe a run that took hours to build. You burn tokens and start over.

They need to wait on humans and other systems

Real agents pause for approvals, webhooks, and other agents. Fire-and-forget loops do not handle that well.

State either vanishes or gets stitched together by hand

Without checkpoints, artifacts, and memory, you lose context when things go wrong and debugging becomes guesswork.

That is why Kitaru focuses on runtime primitives: checkpoints, replay, wait/resume, and durable state on real infrastructure.

SEE THE CODE

Core primitives. Full durability.

agent.py

import kitaru
from kitaru import flow, checkpoint
 
kitaru.configure(stack="kubernetes")
 
@checkpoint
def research(topic: str) -> dict:
    results = search_web(topic)
    kitaru.save("sources", results)
    return summarize(results)
 
@checkpoint
def write_draft(context: str, prev_id: str) -> str:
    prior = kitaru.load(prev_id, "sources")
    return kitaru.llm(
        "Draft a report on: " + context,
        model="gpt-4o",
    )
 
@flow
def report_agent(topic: str, prev_id: str) -> str:
    data = research(topic)
    draft = write_draft(str(data), prev_id)
    kitaru.log(topic=topic, words=len(draft.split()))
 
    approved = kitaru.wait(
        schema=bool, question="Publish?"
    )
    if approved:
        publish(draft)
    return draft

@flow

Top-level orchestration boundary. Marks a function as a durable workflow.

@checkpoint

Persists output. Crash at step 3? Steps 1-2 never re-run.

kitaru.wait()

Suspends the process. Resume when a human responds, 30s or 3 days later.

kitaru.llm()

Resolves model alias, injects API key, logs cost automatically.

kitaru.log()

Structured metadata on every execution. Query it in the dashboard.

kitaru.save()

Persist any artifact by name inside a checkpoint.

kitaru.load()

Retrieve saved artifacts from any prior execution by ID.

kitaru.configure()

Set stack, project, and runtime defaults. Zero config locally.

CORE RUNTIME PRIMITIVES

The primitives long-running agents keep needing.

These are the runtime basics teams keep rebuilding once agents leave the laptop.

01 — Wait & Resume

Pause. Get input. Continue later.

Suspends at decision points, releases compute, and resumes when input arrives from a human, another agent, or a webhook — even hours or days later.

Waiting for input...

02 — Replay from Failure

Crash at step 6? Resume from step 6.

Every step is checkpointed. Fix the issue and replay from the point of failure instead of re-burning tokens.

Running agent...

$12.40 in tokens saved

03 — Framework Portability

Keep your framework. Add durability.

PydanticAI, CrewAI, or raw Python — wrap it with Kitaru and get checkpointed execution without rewriting your agent.

kitaru

PydanticAI

OpenAI SDK

Your code

CrewAI

04 — Parallel Recovery

Fan out work without losing recovery.

checkpoint.submit() dispatches branches concurrently. Each keeps its own checkpoint history, so you can replay only the failed branch.

DURABLE MEMORY

State your agents can come back to.

Versioned, scoped memory across Python, CLI, and MCP — not a bolt-on scratchpad.

memory_demo.py

from kitaru import memory

memory.configure(
    scope="my_repo",
    scope_type="namespace",
)

# Store durable state
memory.set("conventions/test_runner", {
    "command": "just test",
    "notes": "Start targeted, then full suite.",
})

# Next run — any surface — instant recall
conventions = memory.get(
    "conventions/test_runner"
)

Memory Store

conventions/test_runner

v1 v2 v3

sessions/topic_count

v1 v2

scratch/obsolete

v1 deleted

Scopes

Scoped to your use case

namespace for shared team state, flow for per-agent knowledge, execution for single-run scratch. Pick the right boundary.

Compaction

LLM-powered compaction

When memory grows, compact it: an LLM reads source entries and writes a concise summary as a new version. Sources stay intact — compaction is additive.

Audit

Audit trail built in

Every compaction and purge writes a CompactionRecord to a reserved audit log. Hard purge cleans old versions. You always know what happened.

Versioned

Versioned, not overwritten

Every memory.set() creates a new version. Deletes are soft — tombstones preserve the full history for audit and rollback.

WHAT'S BUILT IN

What Kitaru includes

Execution visibility built in

The server UI lets you inspect checkpoints, execution logs, LLM calls, and costs without buying a separate control plane.

Execution control, not just traces

Pause at decision points, get human input, and resume hours or days later. Replay from checkpoints instead of rerunning the whole flow.

Your infrastructure, your cloud

Run the same Python code locally, on Kubernetes, or across AWS, GCP, and Azure. Move stacks without rewriting the agent.

Python-first portability

Use PydanticAI, CrewAI, the OpenAI SDK, or raw Python. Kitaru adds runtime primitives without locking you into a framework.

OPEN RUNTIME

An orchestration layer for long-running agents.
Not a framework.

Keep your existing agent code. Kitaru adds checkpoints, memory, and execution control underneath it, then runs it on your infrastructure.

Your Agent Code

PydanticAI OpenAI SDK CrewAI Raw Python

Write agents your way

Kitaru SDK

@flow @checkpoint wait() llm() log() save() load() memory.* configure()

Core primitives. Durable execution + memory.

Kitaru Engine

Checkpointer DAG Builder Replay Cost Engine

Built on ZenML foundations

Infrastructure

Kubernetes AWS / GCP / Azure S3 / GCS SQL Database

Your cloud