v0.0.3

Krypton Runtime

Kubernetes-native serving for AI agents, self-hosted LLMs, and MCP servers. Deploy containers or Hugging Face GGUF models with CRDs, then reach them through stable HTTP and OpenAI-compatible endpoints.

krypton ~ zsh
# 1. Install Krypton with Helm
$ helm install krypton oci://ghcr.io/kryptonhq/charts/krypton \
    --namespace krypton-system --create-namespace

# 2. Deploy a no-secrets agent
$ kubectl apply -f https://raw.githubusercontent.com\
    /kryptonhq/runtime/main/examples/agent/python/helloworld/agent.yaml

# 3. Serve Qwen2.5 from Hugging Face
$ kubectl create namespace models
$ kubectl apply -f https://raw.githubusercontent.com\
    /kryptonhq/runtime/main/config/samples/llm/qwen2.5-0.5b.yaml

# 4. Call the model with the OpenAI API shape
$ curl http://localhost:8080/v1/chat/completions \
    -d '{"model":"qwen2-0-5b","messages":[{"role":"user","content":"Hi"}]}'

Agents as cluster resources

A single Agent custom resource registers your A2A, plain HTTP, or framework-backed container. Krypton handles lifecycle, routing, scaling signals, and operator visibility.

Self-host LLMs on Kubernetes

A Model custom resource names a Hugging Face GGUF file and runs it with llama.cpp in your cluster. Serve local models with Kubernetes-native lifecycle, resources, and observability.

MCP, first-class

Run any HTTP-transport MCP server as an Agent, or wrap a stdio MCP binary in the bundled bridge. The operator UI introspects each server’s tools.

Prometheus-native observability

Every component exposes krypton_* series — invocations, latency, desired replicas, scaler decisions, sidecar in-flight. A starter Grafana dashboard ships in the repo.

BYO ingress

The gateway ships as a ClusterIP. Put your existing ingress (Envoy / Nginx / ALB / Cloudflare) in front for TLS, auth, rate limiting — Krypton doesn’t reinvent any of it.

Streaming-native

SSE, chunked HTTP, and WebSocket upgrades pass through the gateway with immediate flushing. Chat completions can stream without buffering away the model’s first token.

Concurrency-aware agents

For agent workloads, the per-pod sidecar enforces in-flight caps and surfaces live load. Replicas can keep up with traffic without exceeding the configured per-pod ceiling.

OpenAI-compatible serving

Each self-hosted Model is reachable through familiar OpenAI API paths like /v1/models and /v1/chat/completions, so existing SDKs can call your in-cluster llama.cpp pods.

llama.cpp built in

Start with GGUF models from Hugging Face. Krypton creates the Deployment and Service, passes the right llama.cpp flags, and tracks model readiness in Kubernetes status.