LongMemEval

The cross-SDK harness that drives every memory daemon through the same question set.

LongMemEval is the long-horizon memory benchmark jeffs-brain adopts to verify that its TypeScript, Go, and Python SDKs answer retrieval-grounded questions at parity.

There are two distinct benchmark surfaces:

eval/runner.py drives the three shared daemon scenarios: ask-basic, ask-augmented, and search-retrieve-only.
eval/scripts/run_tri_lme.sh runs the replay-backed tri-SDK benchmark in search-retrieve-only mode only. Extraction, evidence rendering, the augmented reader, judging, and manifests stay in the Go runner so daemon retrieval is the only SDK variable.
Native memory eval lme commands remain SDK-local. Go is the reference replay runner, TypeScript supports single-SDK native flows, and Python participates through memory serve.

The floor at which the harness fails a run is configurable, default 0.90.

Layout

eval/runner.py — the Click CLI entry point. Loads a dataset, starts the chosen SDK’s daemon, drives one shared scenario (ask-basic, ask-augmented, or search-retrieve-only), scores the result, and writes results/<date>/<sdk>.json.
eval/datasets/ — JSONL fixtures plus the full 500-question LongMemEval-S binary.
eval/scorer/ — two scorers: ExactScorer (deterministic substring match) and JudgeScorer (OpenAI LLM-as-judge with a USD budget guard).
eval/scripts/run_tri_lme.sh — replay-backed tri-SDK orchestration: extract once, spawn all three daemons on ports 18850-18852, run the Go LME runner against each daemon in retrieve-only mode, and write a summary.
eval/sdks/{ts,go,py}.py — per-SDK runner registry. Implement SdkRunner.build_command to plug in another SDK.

Install

cd ~/code/jeffs-brain/memory/eval
uv sync

Plain pip install -e .[dev] works too; the backend is hatchling.

Environment

Variable	Purpose
`JB_LLM_PROVIDER`	Pin the daemon’s provider: `openai`, `anthropic`, `ollama`, `fake`.
`JB_LLM_MODEL`	Pin the reader model.
`OPENAI_API_KEY`	Required for OpenAI readers and the default judge.
`JB_EVAL_JUDGE_MODEL`	Override the judge (default `gpt-4o`).
`JB_EVAL_BUDGET_USD`	Fail-fast spend cap enforced in the judge scorer.
`OLLAMA_HOST`	Default `http://localhost:11434`.
`ANTHROPIC_API_KEY`	When the daemon runs Anthropic.
`JB_LME_JUDGE_MODEL`	Go LME-specific judge override.
`JB_LME_ACTOR_MODEL`	Go LME-specific actor override.

Running a smoke benchmark

The smoke.jsonl fixture is provider-agnostic, runs against an empty brain, and grades the daemon answer path. There is no judge API cost when paired with the exact scorer.

uv run python runner.py --sdk ts --dataset datasets/smoke.jsonl --scorer exact

uv run python runner.py --sdk go --dataset datasets/smoke.jsonl --scorer exact

uv run python runner.py --sdk py --dataset datasets/smoke.jsonl --scorer exact

Each run prints a one-line summary and a result path:

<sdk>: <passed>/<total> pass_rate=<rate> mean_score=<score> -> results/<date>/<sdk>.json

If pass_rate drops below --floor, the process exits 1.

Running the judge benchmark

OPENAI_API_KEY=sk-... \
  uv run python runner.py --sdk ts --dataset datasets/lme.jsonl --scorer judge --scenario ask-augmented

Every flag:

Flag	Default	Purpose
`--sdk {ts,go,py}`	required	Which SDK daemon to drive.
`--scenario {ask-basic,ask-augmented,search-retrieve-only}`	`ask-basic`	Shared daemon scenario to exercise.
`--mode {auto,hybrid,hybrid-rerank,bm25,semantic}`	`auto`	Retrieval mode forwarded unchanged to `/ask` or `/search`.
`--dataset`	required	Path to the JSONL fixture.
`--scorer {exact,judge}`	`judge`	Scorer.
`--limit`	none	Stop after N questions.
`--output`	`results/`	Override the output root.
`--port`	`0`	Daemon port (`0` picks a free port).
`--floor`	`0.90`	Minimum pass rate; below this the run fails.
`--brain`	`eval`	Brain id the daemon reads from.
`--top-k`	`5`	Top-k passed to `/ask` or `/search`.
`--candidate-k`	`0`	Retrieve-only only. `0` defers to the daemon default.
`--rerank-top-n`	`0`	Retrieve-only only. `0` defers to the daemon default.

Full LongMemEval replay (Go)

The 500-question replay lives in the Go SDK:

memory eval lme run \
  --dataset longmemeval_s.json \
  --ingest-mode replay \
  --concurrency 8 \
  --judge claude-haiku-4-5 \
  --actor gpt-4o \
  --max-cost-usd 20 \
  --output lme-go.json

Replay mode reconstructs the corpus by replaying each session through the SDK’s extract stage, lets the agentic or direct ask loop answer every question, then scores with the configured judge.

The stratified Go sampler now fills the requested sample size exactly. A --sample-size 50 run processes 50 questions rather than a floored per-category subset.

Plugging in your own SDK

The harness is plug-and-play. Subclass SdkRunner:

Drop sdks/<name>.py that implements build_command(port), workdir, and the daemon’s /healthz expectations.
Register it in sdks/__init__.py:get_runner.
Add the value to the --sdk Click choice in runner.py.
Add a matrix entry to .github/workflows/eval-nightly.yml.

The wire contract the daemon must honour: POST /v1/brains/{brainId}/ask returning text/event-stream with retrieve, answer_delta, citation, done, and error events.

Scorers

ExactScorer reads item["expected_substrings"] and returns 1.0 if any expected substring appears in the answer (case-insensitive by default), else 0.0. No network traffic.

JudgeScorer sends {question, reference_answer, candidate} to OpenAI Chat Completions in strict JSON-object mode at temperature=0.0. Default model gpt-4o ($2.50 / $10 per 1M tokens). Budget enforced through JB_EVAL_BUDGET_USD; exceeding it raises BudgetExceededError and halts the run.

Runner-level pass threshold is score >= 0.5, independent of the scorer’s granularity.

Datasets

Every dataset line is a JSON object with id, question, expected_substrings (required for exact), reference_answer (required for judge), optional tags. Blank lines and lines starting with # are skipped.

Dataset	Purpose
`smoke.jsonl`	20 provider-agnostic factual questions. Fast, no API cost with Ollama.
`lme.jsonl`	100-question shared daemon benchmark used for `ask-augmented` and `search-retrieve-only` once the `eval` brain is populated.
`longmemeval_s.json`	Upstream LongMemEval-S JSON array, 500 questions. Run via the native Go replay path or the replay-backed tri-SDK script, not `runner.py`.

Add a new dataset by dropping <name>.jsonl into datasets/ and passing --dataset datasets/<name>.jsonl. No code change.

Interpreting results

runner.py writes an EvalScore JSON at results/<date>/<sdk>.json:

{
  "sdk": "<sdk>",
  "scorer": "<scorer>",
  "total": 0,
  "passed": 0,
  "pass_rate": 0,
  "mean_score": 0,
  "started_at": "<iso-8601>",
  "finished_at": "<iso-8601>",
  "brain": "<brain-id>",
  "questions": [
    {
      "id": "q-001",
      "question": "...",
      "answer": "...",
      "score": 0,
      "passed": false,
      "latency_ms": 0,
      "citations": []
    }
  ]
}

pass_rate is the gate against --floor. mean_score gives continuous quality where the judge returns fractional scores. Citations and the per-question error string are the debugging surface. The Go LME runner adds judge_verdict (correct, abstain_correct, etc.) and cost_accounting.total_usd for replay runs.

Cost and rate limits

Running full LongMemEval with gpt-4o-mini as reader and judge across three SDKs on 500 questions lands at roughly $3-$5 per day, or $100-$150 per month if you run nightly. The Go replay path takes --max-cost-usd 20 by default and aborts when the cumulative spend exceeds it.

Judge calls are serial in the Python runner, which caps judge QPS at the OpenAI rate limit. Tri-SDK replay concurrency defaults to 16.