Agents as cluster resources
A single Agent custom resource registers your A2A, plain HTTP, or
framework-backed container. Krypton handles lifecycle, routing, scaling
signals, and operator visibility.
Self-host LLMs on Kubernetes
A Model custom resource names a Hugging Face GGUF file and runs it
with llama.cpp in your cluster. Serve local models with Kubernetes-native
lifecycle, resources, and observability.
MCP, first-class
Run any HTTP-transport MCP server as an Agent, or wrap a stdio MCP
binary in the bundled bridge. The operator UI introspects each server’s
tools.
Prometheus-native observability
Every component exposes krypton_* series — invocations, latency,
desired replicas, scaler decisions, sidecar in-flight. A starter
Grafana dashboard ships in the repo.
BYO ingress
The gateway ships as a ClusterIP. Put your existing ingress (Envoy / Nginx / ALB / Cloudflare) in front for TLS, auth, rate limiting — Krypton doesn’t reinvent any of it.
Streaming-native
SSE, chunked HTTP, and WebSocket upgrades pass through the gateway with immediate flushing. Chat completions can stream without buffering away the model’s first token.
Concurrency-aware agents
For agent workloads, the per-pod sidecar enforces in-flight caps and surfaces live load. Replicas can keep up with traffic without exceeding the configured per-pod ceiling.
OpenAI-compatible serving
Each self-hosted Model is reachable through familiar OpenAI API paths
like /v1/models and /v1/chat/completions, so existing SDKs can call
your in-cluster llama.cpp pods.
llama.cpp built in
Start with GGUF models from Hugging Face. Krypton creates the Deployment and Service, passes the right llama.cpp flags, and tracks model readiness in Kubernetes status.