Content-Aware Ingestion (ArchiveBox · crawl4ai · scholarx)¶
CONCEPT:KG-2.7 — extends the unified Ingestion Engine so that fetching a web page is pluggable and what we do with it depends on the content. Throw a document, a URL, a PDF, or a research-roundup blog at the engine and it ingests, enriches, vectorizes — and for a page that points at research papers it also downloads and ingests those papers automatically.
Why¶
The Ingestion Engine (knowledge_graph/ingestion/engine.py) already turns any
ContentType into Document/Chunk/Concept nodes with one enrichment pass and
ConceptMatcher cross-linking (KG-2.75). Three gaps remained: web fetching was a bare
requests.get (no crawl4ai, no ArchiveBox); a research roundup was ingested as a
flat page (its paper links ignored); and the deployed ArchiveBox instance — which
perfectly preserves pages — was unused. This change closes all three on top of the
existing spine.
Components¶
- Unified web-fetch resolver —
ingestion/web_fetch.py:resolve_web_fetch(url). One front door, precedence ArchiveBox → crawl4ai → requests+markitdown. ArchiveBox serves the preserved snapshot via thearchivebox-apiMCP server (archive-on-miss); crawl4ai renders JS; requests is the zero-dep floor. Both theDOCUMENTURL adaptor and the skill-graph pipeline call it. - Research acquisition —
ingestion/paper_links.pyextracts arXiv/DOI/PDF refs and detects a "roundup" (≥3 distinct scholarly links);ingestion/research_acquisition.pydownloads them viascholarx-mcp(sx_storage download_url, direct fallback) intoresearch/papers. TheDOCUMENTadaptor, after ingesting a roundup page, downloads + ingests each paper and writes aMENTIONSedge page→paper. Default ON for roundups; force/disable withmetadata["extract_papers"]. - ArchiveBox source — an
archiveboxmcp_toolsource preset (KG-2.59) lists snapshots;core/source_sync.py:_sync_archiveboxingests each archived URL through theDOCUMENTpath (so the body is ArchiveBox-served and roundups still acquire their papers); theLoopControllerruns it as a default-on background stage whenARCHIVEBOX_URLis set. The skill-graph pipeline gains anarchiveboxsource kind to build a corpus from snapshots under a tag. - Surfaces —
graph_ingestactionsingest_urlandarchivebox_sync(with the generic action-routed REST twin).
Flow¶
flowchart TB
In["ingest(manifest) — URL / PDF / dir / blog"] --> CT["ContentType.classify"]
CT --> Doc["_ingest_document (DOCUMENT)"]
Doc -->|URL| Fetch["resolve_web_fetch(url)\nArchiveBox → crawl4ai → requests"]
Fetch --> Unit["_ingest_document_file → Document + Chunks + Concept"]
Unit --> Profile["extract_paper_links / is_research_roundup"]
Profile -->|roundup| Papers["acquire_papers (scholarx) → research/papers\n→ recursive ingest_batch(PDFs)"]
Papers --> Match["ConceptMatcher (KG-2.75)\nArticle/ResearchPaper RELATES_TO Concept"]
Papers --> Link["MENTIONS edges: page → each paper"]
Unit --> Match
subgraph Background tick / on demand
Loop["LoopController.run_one_cycle"] --> ABsync["source_sync: _sync_archivebox\n(delta / pull-all)"]
ABsync --> In
SkillGraph["skill_graph_pipeline._acquire_archivebox\n(corpus from archived tag)"] --> ABpreset["archivebox mcp_tool preset"]
end
Configuration¶
ARCHIVEBOX_URL— presence is the on-signal that prefers ArchiveBox for fetching and enables the background intake stage. The credential lives with thearchivebox-apiMCP server, not here.- crawl4ai is auto-detected (
SKILL_GRAPH_CRAWLER/ installedcrawl4ai); no flag. metadata["extract_papers"]—Trueforces paper acquisition for any page,Falsedisables it; absent → auto-detect roundups.
Usage¶
# Ingest a research roundup → page + its papers, cross-linked
graph_ingest action=ingest_url target_path=https://www.turingpost.com/p/... # auto
graph_ingest action=ingest_url target_path=<url> description=extract_papers # force
# Pull everything preserved in ArchiveBox (or delta)
graph_ingest action=archivebox_sync corpus_name=full
# Build a skill-graph from archived snapshots tagged "research"
graph_ingest action=build_skill_graph corpus_name=research-archive base_path='[{"kind":"archivebox","uri":"research"}]'