Keep long-running agents moving across interruptions — on any model, any cloud.
The open-source runtime for Python agents. Durable execution, durable memory, replay, wait on humans — deploy on your infrastructure.
pip install kitaru Long-running agents fail in boring, expensive ways.
Hours of work disappear on routine failures
A laptop sleep, pod eviction, or API timeout can wipe a run that took hours to build. You burn tokens and start over.
They need to wait on humans and other systems
Real agents pause for approvals, webhooks, and other agents. Fire-and-forget loops do not handle that well.
State either vanishes or gets stitched together by hand
Without checkpoints, artifacts, and memory, you lose context when things go wrong and debugging becomes guesswork.
That is why Kitaru focuses on runtime primitives: checkpoints, replay, wait/resume, and durable state on real infrastructure.
Core primitives. Full durability.
import kitaru
from kitaru import flow, checkpoint
kitaru.configure(stack="kubernetes")
@checkpoint
def research(topic: str) -> dict:
results = search_web(topic)
kitaru.save("sources", results)
return summarize(results)
@checkpoint
def write_draft(context: str, prev_id: str) -> str:
prior = kitaru.load(prev_id, "sources")
return kitaru.llm(
"Draft a report on: " + context,
model="gpt-4o",
)
@flow
def report_agent(topic: str, prev_id: str) -> str:
data = research(topic)
draft = write_draft(str(data), prev_id)
kitaru.log(topic=topic, words=len(draft.split()))
approved = kitaru.wait(
schema=bool, question="Publish?"
)
if approved:
publish(draft)
return draft @flow Top-level orchestration boundary. Marks a function as a durable workflow.
@checkpoint Persists output. Crash at step 3? Steps 1-2 never re-run.
kitaru.wait() Suspends the process. Resume when a human responds, 30s or 3 days later.
kitaru.llm() Resolves model alias, injects API key, logs cost automatically.
kitaru.log() Structured metadata on every execution. Query it in the dashboard.
kitaru.save() Persist any artifact by name inside a checkpoint.
kitaru.load() Retrieve saved artifacts from any prior execution by ID.
kitaru.configure() Set stack, project, and runtime defaults. Zero config locally.
The primitives long-running agents keep needing.
These are the runtime basics teams keep rebuilding once agents leave the laptop.
Pause. Get input. Continue later.
Suspends at decision points, releases compute, and resumes when input arrives from a human, another agent, or a webhook — even hours or days later.
Crash at step 6? Resume from step 6.
Every step is checkpointed. Fix the issue and replay from the point of failure instead of re-burning tokens.
Keep your framework. Add durability.
PydanticAI, CrewAI, or raw Python — wrap it with Kitaru and get checkpointed execution without rewriting your agent.
Fan out work without losing recovery.
checkpoint.submit() dispatches branches concurrently. Each keeps its own checkpoint history, so you can replay only the failed branch.
State your agents can come back to.
Versioned, scoped memory across Python, CLI, and MCP — not a bolt-on scratchpad.
from kitaru import memory
memory.configure(
scope="my_repo",
scope_type="namespace",
)
# Store durable state
memory.set("conventions/test_runner", {
"command": "just test",
"notes": "Start targeted, then full suite.",
})
# Next run — any surface — instant recall
conventions = memory.get(
"conventions/test_runner"
) Scoped to your use case
namespace for shared team state, flow for per-agent knowledge, execution for single-run scratch. Pick the right boundary.
LLM-powered compaction
When memory grows, compact it: an LLM reads source entries and writes a concise summary as a new version. Sources stay intact — compaction is additive.
Audit trail built in
Every compaction and purge writes a CompactionRecord to a reserved audit log. Hard purge cleans old versions. You always know what happened.
Versioned, not overwritten
Every memory.set() creates a new version. Deletes are soft — tombstones preserve the full history for audit and rollback.
What Kitaru includes
Execution visibility built in
The server UI lets you inspect checkpoints, execution logs, LLM calls, and costs without buying a separate control plane.
Execution control, not just traces
Pause at decision points, get human input, and resume hours or days later. Replay from checkpoints instead of rerunning the whole flow.
Your infrastructure, your cloud
Run the same Python code locally, on Kubernetes, or across AWS, GCP, and Azure. Move stacks without rewriting the agent.
Python-first portability
Use PydanticAI, CrewAI, the OpenAI SDK, or raw Python. Kitaru adds runtime primitives without locking you into a framework.
An orchestration layer for long-running agents.
Not a framework.
Keep your existing agent code. Kitaru adds checkpoints, memory, and execution control underneath it, then runs it on your infrastructure.
import kitaru
from kitaru import flow, checkpoint
@flow
def coding_agent(issue: str) -> str:
plan = analyze_issue(issue)
patch = write_code(plan)
# Pauses. Resumes when input arrives.
approved = kitaru.wait(
bool, question="Merge this PR?"
)
if approved:
merge(patch)
return patch Your agent crashed at step 5.
Stop re-running steps 1 through 4.
pip install kitaru Open source (Apache 2.0). pip install and go.
Foundations and proof for Kitaru
From production-proven foundations to purpose-built agent primitives.
Kitaru inherits orchestration foundations from ZenML and provides its own concrete primitives — durable execution, replay, wait/resume, durable memory, and portability.