Skip to content

In-House Training Substrate

CONCEPT:AHE-3.1 — Training Substrate (reward decomposition / distillation) · CONCEPT:KG-2.22 — Rust-Native Data Science · CONCEPT:ML-001…007 — high-caliber LLM trainer (cross-repo) Spans: agent-utilities (reward spine, memory, training personas) · data-science-mcp (corpora + trainers + curation + pretrain) · universal-skills (train_model workflow) · epistemic-graph (Rust kernels)

Overview

The framework can fine-tune its own open-weight models end-to-end without leaving the ecosystem. The substrate is layered so that everything except the GPU fine-tune runs is deterministic, CPU-testable, and shippable today; the actual runs (Wave D) execute on the GB10 Grace-Blackwell box. The design split is "build now, run later".

flowchart TD
    subgraph AU["agent-utilities (deterministic reward spine)"]
        TS["graph/training_signals.py<br/>advantage · failure-point · composite-reward · difficulty-floor"]
        EM["harness/evolving_memory.py<br/>insight bank (merge-generalize)"]
        RB["harness/replay_buffer.py<br/>prioritized replay (decisive states)"]
    end
    subgraph DSM["data-science-mcp (corpora + gradient trainers)"]
        TD["training_data.py<br/>SFT/DPO/GRPO corpus builders + reward"]
        OBJ["trainers/objectives.py<br/>torch loss kernels"]
        TR["trainers/{sft,dpo,grpo}_trainer.py<br/>(impl training_data.Trainer)"]
        PM["peft_manager.py · tokenizer_registry.py · rollout_buffer.py"]
        EH["trainers/eval_hooks.py → AHE-3.1 reliability suite"]
    end
    subgraph EG["epistemic-graph (Rust performance path)"]
        K["src/datascience/training.rs<br/>softmax · CE · DPO · GRPO · KL · Adam/SGD"]
    end
    TS --> TD
    TD --> TR
    OBJ --> TR
    PM --> TR
    TR --> EH
    K -. "perf path — same math batched over the wire" .- OBJ
    EH --> DEPLOY["Deploy seam: register checkpoint → model-registry role"]

Layers

1. Deterministic reward / data engine (no GPU)

  • agent_utilities/graph/training_signals.py — the reward spine: batch-normalized advantage, failure-point attribution, composite conditionally-gated reward, difficulty-floor filtering.
  • data_science_mcp/training_data.py — turns execution traces into SFT ({prompt, completion}), DPO ({prompt, chosen, rejected, failure_point}), and GRPO (group-normalized advantages) corpora, reusing the spine. MCP tools build_training_dataset / compose_reward. Defines the Trainer Protocol seam.

2. Gradient trainers (torch/PEFT — data-science-mcp[training])

  • trainers/objectives.py — torch loss kernels: masked cross-entropy, sequence log-prob, Bradley-Terry dpo_loss, group-relative grpo_surrogate (+ token-masked LA-GRPO), Schulman-k3 approx_kl.
  • trainers/base.pyTrainConfig + TrainerBase (pure plan(), dependency-injectable model/tokenizer so the loop is CPU-smoke-testable on a toy model with no GPU/HF download).
  • trainers/{sft,dpo,grpo}_trainer.py — concrete trainers implementing the training_data.Trainer Protocol.
  • peft_manager.pyLoraSpec/PeftManager (lazy peft/QLoRA) + pure-numpy ties_merge (MeMo multi-adapter merge).
  • tokenizer_registry.py — special/functional-token injection + embedding resize (ATLAS/SDAR).
  • rollout_buffer.py — prompt→generation→logprob→reward staging with a VLLMRolloutClient (generations served by the running vLLM) and GRPO export.
  • trainers/eval_hooks.py — bridges a checkpoint into the AHE-3.1 reliability suite (faithfulness/safety/tool-necessity/…): did fine-tuning internalize the behavior without regressing grounding/safety?
  • MCP tools: train_sft / train_dpo / train_grpo / merge_adapters_ties (plan-by-default, execute=true to run).

3. Rust performance path (epistemic-graph, CONCEPT:KG-2.22)

src/datascience/training.rs re-implements the loss/optimizer kernels in pure Rust (no candle — matching the repo's style): softmax/log_softmax, cross_entropy (+grad), dpo_loss (+grads), grpo_surrogate (+grad with zero-grad clip region), kl_divergence (k3), adam_step/sgd_step. Exposed over the MessagePack/UDS protocol as client.datascience.*, so a trainer can batch a step over the wire in one round-trip instead of marshalling per element. Same math as the torch kernels; the torch path is the default and the Rust path is the optimization.

Deploy seam — a checkpoint goes live with no hot-path edit

model_registry.resolve_role  ←  rlm/roles  ←  create_model(role=…)

A trained checkpoint is registered as a ModelDefinition and bound to a role (e.g. an rlm-* role). Every consumer that calls create_model(role=…) resolves through model_registry.resolve_role, so the new model goes live the moment the binding is updated — no orchestration/RLM code change. Serve it via the running vLLM.

Build-now / run-later boundary

Layer Status Where it runs
Reward/data engine ✅ built CPU, anywhere
C2 torch trainers ✅ built, CPU-smoke-tested on toy model CPU now / GB10 for real fine-tunes
C1 Rust kernels ✅ built, Rust + Python round-trip tested CPU
Deploy seam ✅ exists
Wave D fine-tune runs ⛔ GPU-gated GB10 (pin Blackwell peft/bitsandbytes/vllm)

First run is OpenSeeker SFT (Qwen2.5-1.5B LoRA) — SFT-only, no rollouts, fast, and validates the whole path. See WAVE_C_INFRA.md for per-paper GB10 requirements.

Wave D — the end-to-end run pipeline

data-science-mcp training_pipeline.py is the single runnable flow that sequences the layers above:

traces → build SFT corpus → plan → train → reliability-eval (eval_hooks)
       → save checkpoint → register_checkpoint(role) → live via pick_for_role

run_sft_pipeline(config, traces=…, eval_cases=…, registry=…, deploy=DeploymentTarget(role=…)) returns a structured report and, when a registry + deploy target are given, binds the trained checkpoint to a role so it goes live with no hot-path edit. It is CPU-smoke-tested end-to-end on a toy model; on the GB10 the only deltas are real deps, a real base model, and the GPU. See data-science-mcp docs/training.md for the OpenSeeker recipe.

LLM trainer expansion (CONCEPT:ML-001…007)

The substrate above was hardened into a full LLM trainer — create, pretrain from random init, and fine-tune models, robustly and at scale, driven by agents. The CONCEPT:ML-* family is a deliberate cross-repo id family (it spans the three repos below) rather than a single-pillar one; it expands AHE-3.1 and DSCI-004. SDD spec: .specify/specs/llm-model-trainer/.

Concept Capability Where
ML-001 Trainer hardening — shared run_loop (precision/accum/clip/scheduler/checkpoint+resume) data-science-mcp/trainers/loop.py
ML-002 Corpus curation — stream/dedup/decontaminate/quality-filter/pack/lineage (epistemic-graph HNSW accel.) data-science-mcp/data_engine.py
ML-003 Pretrain from random init — BPE tokenizer + AutoConfigfrom_config data-science-mcp/{tokenizer_trainer,trainers/pretrain_trainer}.py
ML-004 Experiment tracking — MLflow + epistemic-graph TrainingRun mirror data-science-mcp/tracking.py
ML-005 Scale-out — FSDP and DeepSpeed ZeRO-3 peers + launcher data-science-mcp/{trainers/accelerate_launch.py,launch/}
ML-006 Benchmark eval — lm-eval beside the AHE-3.1 reliability suite data-science-mcp/trainers/eval_hooks.py
ML-007 Agent-driven training — personas + workflow (below) agent-utilities + universal-skills

Agent layer (CONCEPT:ML-007)

The whole loop is exposed as an agent workflow. Four prompt personas live in agent_utilities/prompts/ (auto-discovered by load_specialized_prompts):

Persona Role Key tools
data_curator build / quality-filter / dedup / decontaminate corpus + lineage curate_corpus, dedup_corpus, decontaminate_corpus, dataset_lineage
training_engineer plan & launch SFT/DPO/GRPO + from-scratch pretrain train_sft, train_dpo, train_grpo, pretrain_model, train_tokenizer, merge_adapters_ties
eval_judge reliability + benchmark scoring; gate advance/repeat/abort run_interpretability_suite, grade_response, evaluate_model
ml_orchestrator own the DAG; delegate; branch on gates; register checkpoint graph_orchestrate

They are bound into the model_training_team and driven by the train_model workflow skill (universal-skills/.../workflows/ml/train_model/), a 12-step DAG:

prepare_corpus → curate/dedup/decontaminate → (train_tokenizer) →
  train(sft|pretrain) → eval → GATE → align(dpo|grpo) → eval → GATE →
  merge_adapters → final_eval → register_model

Run it with graph_orchestrate(action="execute_workflow", name="train_model", task=…). Install dependencies per capability: see data-science-mcp docs/installation.md.

  • data-science-mcp: docs/training.md (trainer usage + GPU recipes), docs/installation.md (capability→dependency matrix), docs/concepts.md (CONCEPT:ML-* registry).
  • universal-skills: workflows/ml/train_model/SKILL.md (the agent workflow).
  • epistemic-graph: docs/RUST_COMPUTE_GUIDE.md (kernel pattern).
  • SDD: .specify/specs/llm-model-trainer/ (spec/plan/tasks).