Skip to content

LLM Observability SDK

Python SDK for capturing, enriching, and analyzing LLM call telemetry with zero-to-minimal code changes.

PyPI GitHub


What is this?

The LLM Observability SDK instruments your Python application to automatically capture every LLM API call — latency, token counts, cost, PII detection, streaming TTFT, and more — then sends it all to a pre-wired Grafana + Prometheus + Tempo stack.

Your App ──► LLM Provider (OpenAI / Anthropic / LiteLLM / LangChain)
    ▼ (auto-instrumented)
instrumentation-sdk
    ├──► FastAPI REST API  (localhost:8002)
    ├──► Prometheus        (localhost:9090)
    ├──► Grafana Dashboards (localhost:3002)
    └──► Tempo Traces      (localhost:4317)

Observability Stack

The all-in-one container ships four pre-built dashboards:

Dashboard What it shows
LLM Latency & TTFT p50 / p95 / p99 latency and time-to-first-token per model
LLM Cost USD cost per service and model over time
LLM Error & Retry Success vs error rate, finish reason distribution
LLM Security & Safety PII detection rate, prompt injection attempts

Prometheus Metrics Dashboard Prometheus metrics scraped every 5 seconds from the SDK

Distributed Tracing Dashboard Distributed traces sent via OTLP to Grafana Tempo


5-Minute Quick Start

pip install instrumentation-sdk
llm-observe start

Then add one line to your app:

from instrumentation_sdk import init_auto_instrumentation
init_auto_instrumentation()

Open Grafana at http://localhost:3002 — spans appear within 5–10 seconds.


SDK Feature Map

instrumentation-sdk & temporal-ewma-worker
├── Auto-Instrumentation        → zero-code patching
│   ├── OpenAI
│   ├── Anthropic
│   ├── LiteLLM
│   └── LangChain
├── Manual Instrumentation
│   ├── @llm_observe            → decorator
│   ├── llm_span                → context manager
│   └── llm_span_with_tokens    → context manager + pre-call token count
├── Streaming Observability
│   ├── wrap_stream             → sync TTFT tracking
│   └── wrap_async_stream       → async TTFT tracking
├── Security
│   ├── PII Scanning            → Aho-Corasick + regex redaction
│   └── Injection Detection     → SQL / prompt-override patterns
├── Sampling
│   └── Deterministic Gate      → SHA-256 % 100 (1% sampled)
├── Embeddings
│   └── MiniLM                  → async 384-dim prompt embeddings
├── Cost Anomaly Detection
│   └── Temporal EWMA worker    → decoupled scheduled baseline computing
└── Observability Backend
    ├── Prometheus Metrics       → 8 metric families
    ├── Grafana Dashboards       → 4 pre-built dashboards
    └── Tempo Traces             → OTLP distributed tracing

Documentation Pages

Page What it covers
Installation & Quick Start Install, first span, verify it works
Auto-Instrumentation Zero-code patching for OpenAI, Anthropic, LiteLLM, LangChain
Manual Spans — Decorator @llm_observe decorator usage
Manual Spans — Context Manager llm_span / llm_span_with_tokens context managers
Streaming Observability TTFT tracking, wrap_stream, wrap_async_stream
PII & Injection Scanning Aho-Corasick redaction, scan API
Deterministic Sampling SHA-256 modulo-100 gate
MiniLM Embeddings Async 384-dim prompt embeddings
Prometheus Metrics & Grafana Cost, latency, TTFT dashboards
Temporal EWMA Cost Anomaly Detection Decoupled EWMA baseline computing & cost anomaly detection worker
REST Management API Full endpoint reference
Docker & CLI Deployment llm-observe CLI, all-in-one container
Config Files Reference Model prices, PII patterns, infra configs

Current Version

1.8.2 — see Changelog