LongMemEval
The cross-SDK harness that drives every memory daemon through the same question set.
LongMemEval is the long-horizon memory benchmark jeffs-brain adopts to verify that its TypeScript, Go, and Python SDKs answer retrieval-grounded questions at parity.
There are two distinct benchmark surfaces:
eval/runner.pydrives the three shared daemon scenarios:ask-basic,ask-augmented, andsearch-retrieve-only.eval/scripts/run_tri_lme.shruns the replay-backed tri-SDK benchmark insearch-retrieve-onlymode only. Extraction, evidence rendering, the augmented reader, judging, and manifests stay in the Go runner so daemon retrieval is the only SDK variable.- Native
memory eval lmecommands remain SDK-local. Go is the reference replay runner, TypeScript supports single-SDK native flows, and Python participates throughmemory serve.
The floor at which the harness fails a run is configurable, default 0.90.
Layout
eval/runner.py— the Click CLI entry point. Loads a dataset, starts the chosen SDK’s daemon, drives one shared scenario (ask-basic,ask-augmented, orsearch-retrieve-only), scores the result, and writesresults/<date>/<sdk>.json.eval/datasets/— JSONL fixtures plus the full 500-question LongMemEval-S binary.eval/scorer/— two scorers:ExactScorer(deterministic substring match) andJudgeScorer(OpenAI LLM-as-judge with a USD budget guard).eval/scripts/run_tri_lme.sh— replay-backed tri-SDK orchestration: extract once, spawn all three daemons on ports 18850-18852, run the Go LME runner against each daemon inretrieve-onlymode, and write a summary.eval/sdks/{ts,go,py}.py— per-SDK runner registry. ImplementSdkRunner.build_commandto plug in another SDK.
Install
cd ~/code/jeffs-brain/memory/eval
uv sync
Plain pip install -e .[dev] works too; the backend is hatchling.
Environment
| Variable | Purpose |
|---|---|
JB_LLM_PROVIDER | Pin the daemon’s provider: openai, anthropic, ollama, fake. |
JB_LLM_MODEL | Pin the reader model. |
OPENAI_API_KEY | Required for OpenAI readers and the default judge. |
JB_EVAL_JUDGE_MODEL | Override the judge (default gpt-4o). |
JB_EVAL_BUDGET_USD | Fail-fast spend cap enforced in the judge scorer. |
OLLAMA_HOST | Default http://localhost:11434. |
ANTHROPIC_API_KEY | When the daemon runs Anthropic. |
JB_LME_JUDGE_MODEL | Go LME-specific judge override. |
JB_LME_ACTOR_MODEL | Go LME-specific actor override. |
Running a smoke benchmark
The smoke.jsonl fixture is provider-agnostic, runs against an empty brain,
and grades the daemon answer path. There is no judge API cost when paired
with the exact scorer.
uv run python runner.py --sdk ts --dataset datasets/smoke.jsonl --scorer exact uv run python runner.py --sdk go --dataset datasets/smoke.jsonl --scorer exact uv run python runner.py --sdk py --dataset datasets/smoke.jsonl --scorer exact Each run prints a one-line summary and a result path:
<sdk>: <passed>/<total> pass_rate=<rate> mean_score=<score> -> results/<date>/<sdk>.json
If pass_rate drops below --floor, the process exits 1.
Running the judge benchmark
OPENAI_API_KEY=sk-... \
uv run python runner.py --sdk ts --dataset datasets/lme.jsonl --scorer judge --scenario ask-augmented
Every flag:
| Flag | Default | Purpose |
|---|---|---|
--sdk {ts,go,py} | required | Which SDK daemon to drive. |
--scenario {ask-basic,ask-augmented,search-retrieve-only} | ask-basic | Shared daemon scenario to exercise. |
--mode {auto,hybrid,hybrid-rerank,bm25,semantic} | auto | Retrieval mode forwarded unchanged to /ask or /search. |
--dataset | required | Path to the JSONL fixture. |
--scorer {exact,judge} | judge | Scorer. |
--limit | none | Stop after N questions. |
--output | results/ | Override the output root. |
--port | 0 | Daemon port (0 picks a free port). |
--floor | 0.90 | Minimum pass rate; below this the run fails. |
--brain | eval | Brain id the daemon reads from. |
--top-k | 5 | Top-k passed to /ask or /search. |
--candidate-k | 0 | Retrieve-only only. 0 defers to the daemon default. |
--rerank-top-n | 0 | Retrieve-only only. 0 defers to the daemon default. |
Full LongMemEval replay (Go)
The 500-question replay lives in the Go SDK:
memory eval lme run \
--dataset longmemeval_s.json \
--ingest-mode replay \
--concurrency 8 \
--judge claude-haiku-4-5 \
--actor gpt-4o \
--max-cost-usd 20 \
--output lme-go.json
Replay mode reconstructs the corpus by replaying each session through the SDK’s extract stage, lets the agentic or direct ask loop answer every question, then scores with the configured judge.
The stratified Go sampler now fills the requested sample size exactly. A --sample-size 50 run processes 50 questions rather than a floored per-category subset.
Plugging in your own SDK
The harness is plug-and-play. Subclass SdkRunner:
- Drop
sdks/<name>.pythat implementsbuild_command(port),workdir, and the daemon’s/healthzexpectations. - Register it in
sdks/__init__.py:get_runner. - Add the value to the
--sdkClick choice inrunner.py. - Add a matrix entry to
.github/workflows/eval-nightly.yml.
The wire contract the daemon must honour: POST /v1/brains/{brainId}/ask returning text/event-stream with retrieve, answer_delta, citation, done, and error events.
Scorers
ExactScorer reads item["expected_substrings"] and returns 1.0 if any expected substring appears in the answer (case-insensitive by default), else 0.0. No network traffic.
JudgeScorer sends {question, reference_answer, candidate} to OpenAI Chat Completions in strict JSON-object mode at temperature=0.0. Default model gpt-4o ($2.50 / $10 per 1M tokens). Budget enforced through JB_EVAL_BUDGET_USD; exceeding it raises BudgetExceededError and halts the run.
Runner-level pass threshold is score >= 0.5, independent of the scorer’s granularity.
Datasets
Every dataset line is a JSON object with id, question, expected_substrings (required for exact), reference_answer (required for judge), optional tags. Blank lines and lines starting with # are skipped.
| Dataset | Purpose |
|---|---|
smoke.jsonl | 20 provider-agnostic factual questions. Fast, no API cost with Ollama. |
lme.jsonl | 100-question shared daemon benchmark used for ask-augmented and search-retrieve-only once the eval brain is populated. |
longmemeval_s.json | Upstream LongMemEval-S JSON array, 500 questions. Run via the native Go replay path or the replay-backed tri-SDK script, not runner.py. |
Add a new dataset by dropping <name>.jsonl into datasets/ and passing --dataset datasets/<name>.jsonl. No code change.
Interpreting results
runner.py writes an EvalScore JSON at results/<date>/<sdk>.json:
{
"sdk": "<sdk>",
"scorer": "<scorer>",
"total": 0,
"passed": 0,
"pass_rate": 0,
"mean_score": 0,
"started_at": "<iso-8601>",
"finished_at": "<iso-8601>",
"brain": "<brain-id>",
"questions": [
{
"id": "q-001",
"question": "...",
"answer": "...",
"score": 0,
"passed": false,
"latency_ms": 0,
"citations": []
}
]
}
pass_rate is the gate against --floor. mean_score gives continuous quality where the judge returns fractional scores. Citations and the per-question error string are the debugging surface. The Go LME runner adds judge_verdict (correct, abstain_correct, etc.) and cost_accounting.total_usd for replay runs.
Cost and rate limits
Running full LongMemEval with gpt-4o-mini as reader and judge across three
SDKs on 500 questions lands at roughly $3-$5 per day, or $100-$150 per
month if you run nightly. The Go replay path takes --max-cost-usd 20 by
default and aborts when the cumulative spend exceeds it.
Judge calls are serial in the Python runner, which caps judge QPS at the OpenAI rate limit. Tri-SDK replay concurrency defaults to 16.