Vol. 001 Jeffs Brain·Documentation

Source eval/longmemeval

LongMemEval

The cross-SDK benchmark harness that drives every memory daemon through the same question set and scores the results.

LongMemEval is the long-horizon memory benchmark jeffs-brain adopts to verify that its TypeScript, Go, and Python SDKs answer retrieval-grounded questions at parity. The eval/ harness in the repo spawns each SDK’s memory serve daemon, drives a JSONL of questions through the shared POST /v1/brains/{brainId}/ask SSE contract, scores the streamed answer, and writes a per-SDK result matrix.

The target is 93.4% pass rate against the TypeScript baseline. The floor at which the harness fails a run is configurable, default 0.90.

Layout

  • eval/runner.py — the Click CLI entry point. Loads a dataset, starts the chosen SDK’s daemon, POSTs each question to /v1/brains/{brain}/ask, folds answer_delta / token events into the final answer, collects citation events, scores the result, and writes results/<date>/<sdk>.json.
  • eval/datasets/ — JSONL fixtures plus the full 500-question LongMemEval-S binary.
  • eval/scorer/ — two scorers: ExactScorer (deterministic substring match) and JudgeScorer (OpenAI LLM-as-judge with a USD budget guard).
  • eval/scripts/run_tri_lme.sh — end-to-end tri-SDK orchestration: extract once, spawn all three daemons on ports 18850–18852, run memory eval lme run per SDK in parallel, and write a summary.
  • eval/sdks/{ts,go,py}.py — per-SDK runner registry. Implement SdkRunner.build_command to plug in another SDK.

Install

cd ~/code/jeffs-brain/memory/eval
uv sync

Plain pip install -e .[dev] works too; the backend is hatchling.

Environment

VariablePurpose
JB_LLM_PROVIDERPin the daemon’s provider: openai, anthropic, ollama, fake.
JB_LLM_MODELPin the reader model.
OPENAI_API_KEYRequired for OpenAI readers and the default judge.
JB_EVAL_JUDGE_MODELOverride the judge (default gpt-4o).
JB_EVAL_BUDGET_USDFail-fast spend cap enforced in the judge scorer.
OLLAMA_HOSTDefault http://localhost:11434.
ANTHROPIC_API_KEYWhen the daemon runs Anthropic.
JB_LME_JUDGE_MODELGo LME-specific judge override.
JB_LME_ACTOR_MODELGo LME-specific actor override.

Running a smoke benchmark

The smoke.jsonl fixture is provider-agnostic, runs against an empty brain, and grades the LLM’s direct answer — no API cost when paired with Ollama.

uv run python runner.py --sdk ts --dataset datasets/smoke.jsonl --scorer exact
uv run python runner.py --sdk go --dataset datasets/smoke.jsonl --scorer exact
uv run python runner.py --sdk py --dataset datasets/smoke.jsonl --scorer exact

Each run prints a one-line summary and a result path:

ts: 19/20 pass_rate=0.95 mean_score=0.95 -> results/2026-04-19/ts.json

If pass_rate drops below --floor, the process exits 1.

Running the judge benchmark

OPENAI_API_KEY=sk-... \
  uv run python runner.py --sdk ts --dataset datasets/lme.jsonl --scorer judge

Every flag:

FlagDefaultPurpose
--sdk {ts,go,py}requiredWhich SDK daemon to drive.
--mode {direct,agentic}directAsk-flow mode.
--datasetrequiredPath to the JSONL fixture.
--scorer {exact,judge}exactScorer.
--limitnoneStop after N questions (useful for smoke).
--outputautoOverride the output path.
--port0Daemon port (0 picks a free port).
--floor0.90Minimum pass rate; below this the run fails.
--brainevalBrain id the daemon reads from.
--top-k8Top-k passed to the ask endpoint.

Full LongMemEval replay (Go)

The 500-question replay lives in the Go SDK:

memory eval lme run \
  --dataset longmemeval_s.json \
  --ingest-mode replay \
  --concurrency 8 \
  --judge claude-haiku-4-5 \
  --actor gpt-4o \
  --max-cost-usd 20 \
  --output lme-go.json

Replay mode reconstructs the corpus by replaying each session through the SDK’s extract stage, lets the agentic or direct ask loop answer every question, then scores with the configured judge.

Plugging in your own SDK

The harness is plug-and-play. Subclass SdkRunner:

  1. Drop sdks/<name>.py that implements build_command(port), workdir, and the daemon’s /healthz expectations.
  2. Register it in sdks/__init__.py:get_runner.
  3. Add the value to the --sdk Click choice in runner.py.
  4. Add a matrix entry to .github/workflows/eval-nightly.yml.

The wire contract the daemon must honour: POST /v1/brains/{brainId}/ask returning text/event-stream with retrieve, answer_delta, citation, done, and error events.

Scorers

ExactScorer reads item["expected_substrings"] and returns 1.0 if any expected substring appears in the answer (case-insensitive by default), else 0.0. No network traffic.

JudgeScorer sends {question, reference_answer, candidate} to OpenAI Chat Completions in strict JSON-object mode at temperature=0.0. Default model gpt-4o ($2.50 / $10 per 1M tokens). Budget enforced through JB_EVAL_BUDGET_USD; exceeding it raises BudgetExceededError and halts the run.

Runner-level pass threshold is score >= 0.5, independent of the scorer’s granularity.

Datasets

Every dataset line is a JSON object with id, question, expected_substrings (required for exact), reference_answer (required for judge), optional tags. Blank lines and lines starting with # are skipped.

DatasetPurpose
smoke.jsonl20 provider-agnostic factual questions. Fast, no API cost with Ollama.
lme.jsonl100-question benchmark spanning facts, definitions, temporal, procedural, memory-retrieval concepts. Ollama-friendly.
longmemeval_s.jsonUpstream LongMemEval-S, 500 questions. Run via the Go replay path.

Add a new dataset by dropping <name>.jsonl into datasets/ and passing --dataset datasets/<name>.jsonl. No code change.

Interpreting results

runner.py writes an EvalScore JSON at results/<date>/<sdk>.json:

{
  "sdk": "ts",
  "scorer": "exact",
  "total": 20,
  "passed": 19,
  "pass_rate": 0.95,
  "mean_score": 0.95,
  "started_at": "2026-04-19T09:00:00Z",
  "finished_at": "2026-04-19T09:00:17Z",
  "brain": "eval",
  "questions": [
    {
      "id": "q-001",
      "question": "...",
      "answer": "...",
      "score": 1.0,
      "passed": true,
      "latency_ms": 471,
      "citations": []
    }
  ]
}

pass_rate is the gate against --floor. mean_score gives continuous quality where the judge returns fractional scores. Citations and the per-question error string are the debugging surface. The Go LME runner adds judge_verdict (correct, abstain_correct, etc.) and cost_accounting.total_usd for replay runs.

Cost and rate limits

Running full LongMemEval with gpt-4o-mini as reader and judge across three SDKs on 500 questions lands at roughly $3–$5 per day, or $100–$150 per month if you run nightly. The Go replay path takes --max-cost-usd 20 by default and aborts when the cumulative spend exceeds it.

Judge calls are serial in the Python runner, which caps judge QPS at the OpenAI rate limit. Tri-SDK replay concurrency defaults to 16.

Cross-SDK smoke results

Recent tri-SDK smoke run against gemma3:latest on Ollama: TypeScript, Go, and Python all at 19/20 (95%), with p50 latencies 407–471 ms and p95 630–836 ms. Full write-ups live under eval/results/cross-sdk/ in the repo.