LLM Observability SDK¶

Python SDK for capturing, enriching, and analyzing LLM call telemetry with zero-to-minimal code changes.

What is this?¶

The LLM Observability SDK instruments your Python application to automatically capture every LLM API call — latency, token counts, cost, PII detection, streaming TTFT, and more — then sends it all to a pre-wired Grafana + Prometheus + Tempo stack.

Your App ──► LLM Provider (OpenAI / Anthropic / LiteLLM / LangChain)
    │
    ▼ (auto-instrumented)
instrumentation-sdk
    │
    ├──► FastAPI REST API  (localhost:8002)
    ├──► Prometheus        (localhost:9090)
    ├──► Grafana Dashboards (localhost:3002)
    └──► Tempo Traces      (localhost:4317)

Observability Stack¶

The all-in-one container ships four pre-built dashboards:

Dashboard	What it shows
LLM Latency & TTFT	p50 / p95 / p99 latency and time-to-first-token per model
LLM Cost	USD cost per service and model over time
LLM Error & Retry	Success vs error rate, finish reason distribution
LLM Security & Safety	PII detection rate, prompt injection attempts

Prometheus Metrics Dashboard Prometheus metrics scraped every 5 seconds from the SDK

Distributed Tracing Dashboard Distributed traces sent via OTLP to Grafana Tempo

5-Minute Quick Start¶

pip install instrumentation-sdk
llm-observe start

Then add one line to your app:

from instrumentation_sdk import init_auto_instrumentation
init_auto_instrumentation()

Open Grafana at http://localhost:3002 — spans appear within 5–10 seconds.

SDK Feature Map¶

instrumentation-sdk & temporal-ewma-worker
│
├── Auto-Instrumentation        → zero-code patching
│   ├── OpenAI
│   ├── Anthropic
│   ├── LiteLLM
│   └── LangChain
│
├── Manual Instrumentation
│   ├── @llm_observe            → decorator
│   ├── llm_span                → context manager
│   └── llm_span_with_tokens    → context manager + pre-call token count
│
├── Streaming Observability
│   ├── wrap_stream             → sync TTFT tracking
│   └── wrap_async_stream       → async TTFT tracking
│
├── Security
│   ├── PII Scanning            → Aho-Corasick + regex redaction
│   └── Injection Detection     → SQL / prompt-override patterns
│
├── Sampling
│   └── Deterministic Gate      → SHA-256 % 100 (1% sampled)
│
├── Embeddings
│   └── MiniLM                  → async 384-dim prompt embeddings
│
├── Cost Anomaly Detection
│   └── Temporal EWMA worker    → decoupled scheduled baseline computing
│
└── Observability Backend
    ├── Prometheus Metrics       → 8 metric families
    ├── Grafana Dashboards       → 4 pre-built dashboards
    └── Tempo Traces             → OTLP distributed tracing

Documentation Pages¶

Page	What it covers
Installation & Quick Start	Install, first span, verify it works
Auto-Instrumentation	Zero-code patching for OpenAI, Anthropic, LiteLLM, LangChain
Manual Spans — Decorator	`@llm_observe` decorator usage
Manual Spans — Context Manager	`llm_span` / `llm_span_with_tokens` context managers
Streaming Observability	TTFT tracking, `wrap_stream`, `wrap_async_stream`
PII & Injection Scanning	Aho-Corasick redaction, scan API
Deterministic Sampling	SHA-256 modulo-100 gate
MiniLM Embeddings	Async 384-dim prompt embeddings
Prometheus Metrics & Grafana	Cost, latency, TTFT dashboards
Temporal EWMA Cost Anomaly Detection	Decoupled EWMA baseline computing & cost anomaly detection worker
REST Management API	Full endpoint reference
Docker & CLI Deployment	`llm-observe` CLI, all-in-one container
Config Files Reference	Model prices, PII patterns, infra configs

Current Version¶

1.8.2 — see Changelog