Skip to content

Benchmarks API

A/B benchmarking endpoints. Create experiments, manage lifecycle, and retrieve statistical comparison results.

POST /api/benchmarks

Create a new benchmark.

Request Body

json
{
  "name": "GPT-4o vs Claude Sonnet",
  "description": "Compare cost and latency for customer support agent",
  "agentId": "support-agent",
  "variants": [
    { "name": "GPT-4o", "tag": "v-gpt4o", "description": "OpenAI GPT-4o" },
    { "name": "Claude Sonnet", "tag": "v-claude-sonnet", "description": "Anthropic Claude 3 Sonnet" }
  ],
  "metrics": ["avg_cost", "avg_latency", "error_rate", "completion_rate"],
  "minSessionsPerVariant": 30,
  "timeRange": {
    "from": "2026-02-01T00:00:00Z",
    "to": "2026-02-28T23:59:59Z"
  }
}

Body Fields

FieldTypeRequiredDescription
namestringBenchmark name
descriptionstringOptional description
agentIdstringScope to a specific agent
variantsarray2–10 variants to compare
variants[].namestringVariant display name
variants[].tagstringSession tag for this variant
variants[].descriptionstringVariant description
variants[].agentIdstringOverride agent ID for this variant
metricsarrayMetrics to compare (default: all except health_score)
minSessionsPerVariantnumberMinimum sessions for meaningful results
timeRangeobjectTime range filter for sessions

Valid Metrics

error_rate, avg_cost, avg_latency, tool_success_rate, completion_rate, avg_tokens, avg_duration

health_score is not yet supported for benchmarks.

Response (201)

Returns the created benchmark object with generated id and status: "draft".

Errors

StatusCondition
400Validation error (missing name, <2 variants, invalid metric, etc.)

GET /api/benchmarks

List benchmarks.

Query Parameters

ParameterTypeDefaultRangeDescription
statusstringFilter by status: draft, running, completed, cancelled
agentIdstringFilter by agent ID
limitnumber201–100Results per page
offsetnumber0≥ 0Pagination offset

Response (200)

json
{
  "benchmarks": [
    {
      "id": "bench_abc123",
      "name": "GPT-4o vs Claude Sonnet",
      "description": "...",
      "status": "running",
      "agentId": "support-agent",
      "metrics": ["avg_cost", "avg_latency"],
      "variants": [
        { "id": "var_001", "name": "GPT-4o", "tag": "v-gpt4o" },
        { "id": "var_002", "name": "Claude Sonnet", "tag": "v-claude-sonnet" }
      ],
      "createdAt": "2026-02-01T00:00:00.000Z",
      "updatedAt": "2026-02-05T10:00:00.000Z"
    }
  ],
  "total": 5,
  "hasMore": false
}

GET /api/benchmarks/:id

Get benchmark detail. Includes per-variant session counts.

Path Parameters

ParameterTypeDescription
idstringBenchmark ID

Response (200)

Returns the benchmark object with variants enriched with sessionCount.

json
{
  "id": "bench_abc123",
  "name": "GPT-4o vs Claude Sonnet",
  "status": "running",
  "variants": [
    { "id": "var_001", "name": "GPT-4o", "tag": "v-gpt4o", "sessionCount": 25 },
    { "id": "var_002", "name": "Claude Sonnet", "tag": "v-claude-sonnet", "sessionCount": 31 }
  ],
  "metrics": ["avg_cost", "avg_latency", "error_rate", "completion_rate"],
  "minSessionsPerVariant": 30,
  "createdAt": "2026-02-01T00:00:00.000Z",
  "updatedAt": "2026-02-05T10:00:00.000Z"
}

Errors

StatusCondition
404Benchmark not found

PUT /api/benchmarks/:id/status

Transition benchmark status.

Path Parameters

ParameterTypeDescription
idstringBenchmark ID

Request Body

json
{
  "status": "running"
}

Valid Transitions

FromTo
draftrunning, cancelled
runningcompleted, cancelled

Pre-conditions

  • draft → running: Each variant must have at least 1 session.
  • running → completed: Triggers result computation and caching.

Response (200)

Returns the updated benchmark object.

Errors

StatusCondition
400Invalid status value
404Benchmark not found
409Invalid transition (e.g., completed → running) or no sessions for a variant

GET /api/benchmarks/:id/results

Get statistical comparison results.

Path Parameters

ParameterTypeDescription
idstringBenchmark ID

Query Parameters

ParameterTypeDefaultDescription
includeDistributionsbooleanfalseInclude raw value arrays for distribution charts

Response (200)

json
{
  "benchmarkId": "bench_abc123",
  "tenantId": "default",
  "variants": [
    {
      "variantId": "var_001",
      "variantName": "GPT-4o",
      "sessionCount": 50,
      "metrics": {
        "avg_cost": { "mean": 0.032, "median": 0.028, "stddev": 0.012, "min": 0.005, "max": 0.089, "count": 50 },
        "avg_latency": { "mean": 1200, "median": 1100, "stddev": 300, "min": 400, "max": 2500, "count": 50 }
      }
    },
    {
      "variantId": "var_002",
      "variantName": "Claude Sonnet",
      "sessionCount": 48,
      "metrics": {
        "avg_cost": { "mean": 0.021, "median": 0.019, "stddev": 0.008, "min": 0.003, "max": 0.052, "count": 48 },
        "avg_latency": { "mean": 890, "median": 820, "stddev": 250, "min": 300, "max": 1800, "count": 48 }
      }
    }
  ],
  "comparisons": [
    {
      "metric": "avg_cost",
      "variantA": { "id": "var_001", "name": "GPT-4o", "stats": { "..." : "..." } },
      "variantB": { "id": "var_002", "name": "Claude Sonnet", "stats": { "..." : "..." } },
      "absoluteDiff": -0.011,
      "percentDiff": -34.4,
      "testType": "welch_t",
      "testStatistic": 3.12,
      "pValue": 0.0023,
      "confidenceInterval": { "lower": -0.018, "upper": -0.004 },
      "effectSize": 0.89,
      "significant": true,
      "winner": "Claude Sonnet",
      "confidence": "★★★"
    }
  ],
  "summary": "Claude Sonnet wins on avg_cost (p=0.002) and avg_latency (p=0.016). No significant difference on error_rate or completion_rate.",
  "computedAt": "2026-02-08T10:30:00.000Z"
}

Response Fields — Comparisons

FieldTypeDescription
metricstringMetric being compared
testTypestringwelch_t (continuous) or chi_squared (rates)
testStatisticnumberTest statistic value
pValuenumberp-value
confidenceIntervalobject95% CI for the difference
effectSizenumberCohen's d or phi coefficient
significantbooleantrue if p < 0.05
winnerstring|undefinedVariant name of the winner (if significant)
confidencestring★★★ (p<0.01), ★★ (p<0.05), (p<0.1), (ns)

Errors

StatusCondition
400Benchmark is still in draft status
404Benchmark not found

DELETE /api/benchmarks/:id

Delete a benchmark. Only draft and cancelled benchmarks can be deleted.

Path Parameters

ParameterTypeDescription
idstringBenchmark ID

Response (204)

No content.

Errors

StatusCondition
404Benchmark not found
409Benchmark is running or completed

Released under the MIT License.