Open source evaluation orchestration for AI systems.
EvalHub is a platform for running systematic evaluations of models, agents, and AI systems across multiple frameworks, without locking you into any single one.
Run evaluations against any registered benchmark, whether built-in or one you create yourself, track experiments in MLflow, and store immutable results as OCI artefacts.
It works locally for development and scales on Kubernetes for production.
| Repository | Description |
|---|---|
eval-hub |
Go REST API server — evaluation orchestration, provider registry, benchmark discovery, collection management |
eval-hub-sdk |
Python SDK — async/sync clients, adapter framework (BYOF), CLI tools, MCP server for agent integration |
eval-hub-contrib |
Community-contributed framework adapters (LightEval, GuideLLM, MTEB, and more) |
eval-hub.github.io |
Documentation site — architecture, guides, SDK reference |
- Versioned REST API (v1) with OpenAPI specification and interactive docs
- Provider registry with benchmark discovery and category filtering
- Benchmark collections with weighted scoring and compliance requirements
- Bring Your Own Framework implement a single method to add any evaluation framework
- Kubernetes-native job orchestration with resource isolation
- MLflow integration for experiment tracking, lineage, and result comparison
- OCI artefact persistence for reproducible, immutable evaluation results
- Multi-provider batching groups compatible benchmarks to reduce execution time
- Prometheus metrics and OpenTelemetry tracing
| Adapter | What it evaluates |
|---|---|
| lm-eval-harness | 167 benchmarks across 12 categories (reasoning, math, science, safety, ...) |
| LightEval | Accuracy, normalised accuracy, exact match |
| GuideLLM | TTFT, ITL, throughput, latency |
| MTEB | Semantic similarity, retrieval, classification |
| Garak | OWASP Top 10, vulnerability scanning, safety probes |
Full documentation is available at eval-hub.github.io:
- Overview — architecture and core concepts
- Installation — install server and SDK
- Quick start — run your first evaluation
- Python SDK reference — client and adapter API
- Building adapters — create your own framework adapter
Apache 2.0 — see LICENCE.