In-House Training Substrate¶
CONCEPT:AHE-3.1 — Training Substrate (reward decomposition / distillation) · CONCEPT:KG-2.22 — Rust-Native Data Science · CONCEPT:ML-001…007 — high-caliber LLM trainer (cross-repo) Spans:
agent-utilities(reward spine, memory, training personas) ·data-science-mcp(corpora + trainers + curation + pretrain) ·universal-skills(train_modelworkflow) ·epistemic-graph(Rust kernels)
Overview¶
The framework can fine-tune its own open-weight models end-to-end without leaving the ecosystem. The substrate is layered so that everything except the GPU fine-tune runs is deterministic, CPU-testable, and shippable today; the actual runs (Wave D) execute on the GB10 Grace-Blackwell box. The design split is "build now, run later".
flowchart TD
subgraph AU["agent-utilities (deterministic reward spine)"]
TS["graph/training_signals.py<br/>advantage · failure-point · composite-reward · difficulty-floor"]
EM["harness/evolving_memory.py<br/>insight bank (merge-generalize)"]
RB["harness/replay_buffer.py<br/>prioritized replay (decisive states)"]
end
subgraph DSM["data-science-mcp (corpora + gradient trainers)"]
TD["training_data.py<br/>SFT/DPO/GRPO corpus builders + reward"]
OBJ["trainers/objectives.py<br/>torch loss kernels"]
TR["trainers/{sft,dpo,grpo}_trainer.py<br/>(impl training_data.Trainer)"]
PM["peft_manager.py · tokenizer_registry.py · rollout_buffer.py"]
EH["trainers/eval_hooks.py → AHE-3.1 reliability suite"]
end
subgraph EG["epistemic-graph (Rust performance path)"]
K["src/datascience/training.rs<br/>softmax · CE · DPO · GRPO · KL · Adam/SGD"]
end
TS --> TD
TD --> TR
OBJ --> TR
PM --> TR
TR --> EH
K -. "perf path — same math batched over the wire" .- OBJ
EH --> DEPLOY["Deploy seam: register checkpoint → model-registry role"]
Layers¶
1. Deterministic reward / data engine (no GPU)¶
agent_utilities/graph/training_signals.py— the reward spine: batch-normalized advantage, failure-point attribution, composite conditionally-gated reward, difficulty-floor filtering.data_science_mcp/training_data.py— turns execution traces into SFT ({prompt, completion}), DPO ({prompt, chosen, rejected, failure_point}), and GRPO (group-normalized advantages) corpora, reusing the spine. MCP toolsbuild_training_dataset/compose_reward. Defines theTrainerProtocol seam.
2. Gradient trainers (torch/PEFT — data-science-mcp[training])¶
trainers/objectives.py— torch loss kernels: masked cross-entropy, sequence log-prob, Bradley-Terrydpo_loss, group-relativegrpo_surrogate(+ token-masked LA-GRPO), Schulman-k3approx_kl.trainers/base.py—TrainConfig+TrainerBase(pureplan(), dependency-injectable model/tokenizer so the loop is CPU-smoke-testable on a toy model with no GPU/HF download).trainers/{sft,dpo,grpo}_trainer.py— concrete trainers implementing thetraining_data.TrainerProtocol.peft_manager.py—LoraSpec/PeftManager(lazy peft/QLoRA) + pure-numpyties_merge(MeMo multi-adapter merge).tokenizer_registry.py— special/functional-token injection + embedding resize (ATLAS/SDAR).rollout_buffer.py— prompt→generation→logprob→reward staging with aVLLMRolloutClient(generations served by the running vLLM) and GRPO export.trainers/eval_hooks.py— bridges a checkpoint into the AHE-3.1 reliability suite (faithfulness/safety/tool-necessity/…): did fine-tuning internalize the behavior without regressing grounding/safety?- MCP tools:
train_sft/train_dpo/train_grpo/merge_adapters_ties(plan-by-default,execute=trueto run).
3. Rust performance path (epistemic-graph, CONCEPT:KG-2.22)¶
src/datascience/training.rs re-implements the loss/optimizer kernels in
pure Rust (no candle — matching the repo's style): softmax/log_softmax,
cross_entropy (+grad), dpo_loss (+grads), grpo_surrogate (+grad with
zero-grad clip region), kl_divergence (k3), adam_step/sgd_step. Exposed over
the MessagePack/UDS protocol as client.datascience.*, so a trainer can batch a
step over the wire in one round-trip instead of marshalling per element. Same math
as the torch kernels; the torch path is the default and the Rust path is the
optimization.
Deploy seam — a checkpoint goes live with no hot-path edit¶
A trained checkpoint is registered as a ModelDefinition and bound to a role
(e.g. an rlm-* role). Every consumer that calls create_model(role=…) resolves
through model_registry.resolve_role, so the new model goes live the moment the
binding is updated — no orchestration/RLM code change. Serve it via the running
vLLM.
Build-now / run-later boundary¶
| Layer | Status | Where it runs |
|---|---|---|
| Reward/data engine | ✅ built | CPU, anywhere |
| C2 torch trainers | ✅ built, CPU-smoke-tested on toy model | CPU now / GB10 for real fine-tunes |
| C1 Rust kernels | ✅ built, Rust + Python round-trip tested | CPU |
| Deploy seam | ✅ exists | — |
| Wave D fine-tune runs | ⛔ GPU-gated | GB10 (pin Blackwell peft/bitsandbytes/vllm) |
First run is OpenSeeker SFT (Qwen2.5-1.5B LoRA) — SFT-only, no rollouts, fast,
and validates the whole path. See WAVE_C_INFRA.md
for per-paper GB10 requirements.
Wave D — the end-to-end run pipeline¶
data-science-mcp training_pipeline.py is the single runnable flow that
sequences the layers above:
traces → build SFT corpus → plan → train → reliability-eval (eval_hooks)
→ save checkpoint → register_checkpoint(role) → live via pick_for_role
run_sft_pipeline(config, traces=…, eval_cases=…, registry=…, deploy=DeploymentTarget(role=…))
returns a structured report and, when a registry + deploy target are given, binds
the trained checkpoint to a role so it goes live with no hot-path edit. It is
CPU-smoke-tested end-to-end on a toy model; on the GB10 the only deltas are real
deps, a real base model, and the GPU. See data-science-mcp docs/training.md for
the OpenSeeker recipe.
LLM trainer expansion (CONCEPT:ML-001…007)¶
The substrate above was hardened into a full LLM trainer — create, pretrain from
random init, and fine-tune models, robustly and at scale, driven by agents. The
CONCEPT:ML-* family is a deliberate cross-repo id family (it spans the three
repos below) rather than a single-pillar one; it expands AHE-3.1 and
DSCI-004. SDD spec: .specify/specs/llm-model-trainer/.
| Concept | Capability | Where |
|---|---|---|
ML-001 |
Trainer hardening — shared run_loop (precision/accum/clip/scheduler/checkpoint+resume) |
data-science-mcp/trainers/loop.py |
ML-002 |
Corpus curation — stream/dedup/decontaminate/quality-filter/pack/lineage (epistemic-graph HNSW accel.) | data-science-mcp/data_engine.py |
ML-003 |
Pretrain from random init — BPE tokenizer + AutoConfig→from_config |
data-science-mcp/{tokenizer_trainer,trainers/pretrain_trainer}.py |
ML-004 |
Experiment tracking — MLflow + epistemic-graph TrainingRun mirror |
data-science-mcp/tracking.py |
ML-005 |
Scale-out — FSDP and DeepSpeed ZeRO-3 peers + launcher | data-science-mcp/{trainers/accelerate_launch.py,launch/} |
ML-006 |
Benchmark eval — lm-eval beside the AHE-3.1 reliability suite | data-science-mcp/trainers/eval_hooks.py |
ML-007 |
Agent-driven training — personas + workflow (below) | agent-utilities + universal-skills |
Agent layer (CONCEPT:ML-007)¶
The whole loop is exposed as an agent workflow. Four prompt personas live in
agent_utilities/prompts/ (auto-discovered by load_specialized_prompts):
| Persona | Role | Key tools |
|---|---|---|
data_curator |
build / quality-filter / dedup / decontaminate corpus + lineage | curate_corpus, dedup_corpus, decontaminate_corpus, dataset_lineage |
training_engineer |
plan & launch SFT/DPO/GRPO + from-scratch pretrain | train_sft, train_dpo, train_grpo, pretrain_model, train_tokenizer, merge_adapters_ties |
eval_judge |
reliability + benchmark scoring; gate advance/repeat/abort | run_interpretability_suite, grade_response, evaluate_model |
ml_orchestrator |
own the DAG; delegate; branch on gates; register checkpoint | graph_orchestrate |
They are bound into the model_training_team and driven by the train_model
workflow skill (universal-skills/.../workflows/ml/train_model/), a 12-step DAG:
prepare_corpus → curate/dedup/decontaminate → (train_tokenizer) →
train(sft|pretrain) → eval → GATE → align(dpo|grpo) → eval → GATE →
merge_adapters → final_eval → register_model
Run it with graph_orchestrate(action="execute_workflow", name="train_model", task=…).
Install dependencies per capability: see data-science-mcp docs/installation.md.
Related¶
- data-science-mcp:
docs/training.md(trainer usage + GPU recipes),docs/installation.md(capability→dependency matrix),docs/concepts.md(CONCEPT:ML-*registry). - universal-skills:
workflows/ml/train_model/SKILL.md(the agent workflow). - epistemic-graph:
docs/RUST_COMPUTE_GUIDE.md(kernel pattern). - SDD:
.specify/specs/llm-model-trainer/(spec/plan/tasks).