LongMemEval
The cross-SDK benchmark harness that drives every memory daemon through the same question set and scores the results.
LongMemEval is the long-horizon memory benchmark jeffs-brain adopts to verify that its TypeScript, Go, and Python SDKs answer retrieval-grounded questions at parity. The eval/ harness in the repo spawns each SDK’s memory serve daemon, drives a JSONL of questions through the shared POST /v1/brains/{brainId}/ask SSE contract, scores the streamed answer, and writes a per-SDK result matrix.
The target is 93.4% pass rate against the TypeScript baseline. The floor at which the harness fails a run is configurable, default 0.90.
Layout
eval/runner.py— the Click CLI entry point. Loads a dataset, starts the chosen SDK’s daemon, POSTs each question to/v1/brains/{brain}/ask, foldsanswer_delta/tokenevents into the final answer, collectscitationevents, scores the result, and writesresults/<date>/<sdk>.json.eval/datasets/— JSONL fixtures plus the full 500-question LongMemEval-S binary.eval/scorer/— two scorers:ExactScorer(deterministic substring match) andJudgeScorer(OpenAI LLM-as-judge with a USD budget guard).eval/scripts/run_tri_lme.sh— end-to-end tri-SDK orchestration: extract once, spawn all three daemons on ports 18850–18852, runmemory eval lme runper SDK in parallel, and write a summary.eval/sdks/{ts,go,py}.py— per-SDK runner registry. ImplementSdkRunner.build_commandto plug in another SDK.
Install
cd ~/code/jeffs-brain/memory/eval
uv sync
Plain pip install -e .[dev] works too; the backend is hatchling.
Environment
| Variable | Purpose |
|---|---|
JB_LLM_PROVIDER | Pin the daemon’s provider: openai, anthropic, ollama, fake. |
JB_LLM_MODEL | Pin the reader model. |
OPENAI_API_KEY | Required for OpenAI readers and the default judge. |
JB_EVAL_JUDGE_MODEL | Override the judge (default gpt-4o). |
JB_EVAL_BUDGET_USD | Fail-fast spend cap enforced in the judge scorer. |
OLLAMA_HOST | Default http://localhost:11434. |
ANTHROPIC_API_KEY | When the daemon runs Anthropic. |
JB_LME_JUDGE_MODEL | Go LME-specific judge override. |
JB_LME_ACTOR_MODEL | Go LME-specific actor override. |
Running a smoke benchmark
The smoke.jsonl fixture is provider-agnostic, runs against an empty brain, and grades the LLM’s direct answer — no API cost when paired with Ollama.
uv run python runner.py --sdk ts --dataset datasets/smoke.jsonl --scorer exact uv run python runner.py --sdk go --dataset datasets/smoke.jsonl --scorer exact uv run python runner.py --sdk py --dataset datasets/smoke.jsonl --scorer exact Each run prints a one-line summary and a result path:
ts: 19/20 pass_rate=0.95 mean_score=0.95 -> results/2026-04-19/ts.json
If pass_rate drops below --floor, the process exits 1.
Running the judge benchmark
OPENAI_API_KEY=sk-... \
uv run python runner.py --sdk ts --dataset datasets/lme.jsonl --scorer judge
Every flag:
| Flag | Default | Purpose |
|---|---|---|
--sdk {ts,go,py} | required | Which SDK daemon to drive. |
--mode {direct,agentic} | direct | Ask-flow mode. |
--dataset | required | Path to the JSONL fixture. |
--scorer {exact,judge} | exact | Scorer. |
--limit | none | Stop after N questions (useful for smoke). |
--output | auto | Override the output path. |
--port | 0 | Daemon port (0 picks a free port). |
--floor | 0.90 | Minimum pass rate; below this the run fails. |
--brain | eval | Brain id the daemon reads from. |
--top-k | 8 | Top-k passed to the ask endpoint. |
Full LongMemEval replay (Go)
The 500-question replay lives in the Go SDK:
memory eval lme run \
--dataset longmemeval_s.json \
--ingest-mode replay \
--concurrency 8 \
--judge claude-haiku-4-5 \
--actor gpt-4o \
--max-cost-usd 20 \
--output lme-go.json
Replay mode reconstructs the corpus by replaying each session through the SDK’s extract stage, lets the agentic or direct ask loop answer every question, then scores with the configured judge.
Plugging in your own SDK
The harness is plug-and-play. Subclass SdkRunner:
- Drop
sdks/<name>.pythat implementsbuild_command(port),workdir, and the daemon’s/healthzexpectations. - Register it in
sdks/__init__.py:get_runner. - Add the value to the
--sdkClick choice inrunner.py. - Add a matrix entry to
.github/workflows/eval-nightly.yml.
The wire contract the daemon must honour: POST /v1/brains/{brainId}/ask returning text/event-stream with retrieve, answer_delta, citation, done, and error events.
Scorers
ExactScorer reads item["expected_substrings"] and returns 1.0 if any expected substring appears in the answer (case-insensitive by default), else 0.0. No network traffic.
JudgeScorer sends {question, reference_answer, candidate} to OpenAI Chat Completions in strict JSON-object mode at temperature=0.0. Default model gpt-4o ($2.50 / $10 per 1M tokens). Budget enforced through JB_EVAL_BUDGET_USD; exceeding it raises BudgetExceededError and halts the run.
Runner-level pass threshold is score >= 0.5, independent of the scorer’s granularity.
Datasets
Every dataset line is a JSON object with id, question, expected_substrings (required for exact), reference_answer (required for judge), optional tags. Blank lines and lines starting with # are skipped.
| Dataset | Purpose |
|---|---|
smoke.jsonl | 20 provider-agnostic factual questions. Fast, no API cost with Ollama. |
lme.jsonl | 100-question benchmark spanning facts, definitions, temporal, procedural, memory-retrieval concepts. Ollama-friendly. |
longmemeval_s.json | Upstream LongMemEval-S, 500 questions. Run via the Go replay path. |
Add a new dataset by dropping <name>.jsonl into datasets/ and passing --dataset datasets/<name>.jsonl. No code change.
Interpreting results
runner.py writes an EvalScore JSON at results/<date>/<sdk>.json:
{
"sdk": "ts",
"scorer": "exact",
"total": 20,
"passed": 19,
"pass_rate": 0.95,
"mean_score": 0.95,
"started_at": "2026-04-19T09:00:00Z",
"finished_at": "2026-04-19T09:00:17Z",
"brain": "eval",
"questions": [
{
"id": "q-001",
"question": "...",
"answer": "...",
"score": 1.0,
"passed": true,
"latency_ms": 471,
"citations": []
}
]
}
pass_rate is the gate against --floor. mean_score gives continuous quality where the judge returns fractional scores. Citations and the per-question error string are the debugging surface. The Go LME runner adds judge_verdict (correct, abstain_correct, etc.) and cost_accounting.total_usd for replay runs.
Cost and rate limits
Running full LongMemEval with gpt-4o-mini as reader and judge across three SDKs on 500 questions lands at roughly $3–$5 per day, or $100–$150 per month if you run nightly. The Go replay path takes --max-cost-usd 20 by default and aborts when the cumulative spend exceeds it.
Judge calls are serial in the Python runner, which caps judge QPS at the OpenAI rate limit. Tri-SDK replay concurrency defaults to 16.
Cross-SDK smoke results
Recent tri-SDK smoke run against gemma3:latest on Ollama: TypeScript, Go, and Python all at 19/20 (95%), with p50 latencies 407–471 ms and p95 630–836 ms. Full write-ups live under eval/results/cross-sdk/ in the repo.