Ontology Document Processing¶

Source: https://www.palantir.com/docs/foundry/ontology/document-processing/ Captured during the ontology-parity effort. Concrete feature taxonomy only.

Pipeline stages¶

Media import & reference extraction — PDFs uploaded as media sets; a "Get Media References" board retrieves references from media-set datasets, producing structured media-reference objects.
Text extraction — a "Text Extraction" board converts document content to raw text, enabling semantic operations on previously inaccessible unstructured content.

Chunking strategy¶

Breaking larger passages into semantically distinct units addresses two constraints: embedding-model max-token limits, and the fact that smaller segments are more semantically distinct during retrieval.

Configuration (Chunk String board)¶

Target size — configurable character threshold (e.g. ~256 chars).
Separators — multi-priority hierarchy for intelligent split points.
Overlap — configurable context preservation between adjacent chunks (e.g. 20 chars).

Non-code chunking pipeline¶

Stage	Board	Transformation
1	Chunk String	Text array with overlap
2	Explode Array with Position	Array elements → individual rows + position metadata
3	Field Extraction	Isolate position + chunk values into columns
4	String Concatenation	Unique chunk ids (`original_id` + position)
5	Column Pruning	Remove intermediate columns

Output structure¶

Each chunk row carries object_id (source-document ref for linking), chunk_id, chunk (text), and embedding (vector, generated downstream).

Integration with semantic search¶

Embedding generation — chunked text vectors created via the semantic-search pipeline.
Object materialization — chunks become Ontology objects with bidirectional link types to the source document, embedding + text properties, and full-text/semantic indexing.
Presentation — search results surface alongside the rendered source PDF for source-of-truth cross-validation.

Advanced processing¶

For sophisticated chunking (sliding windows, semantic boundaries, hierarchical chunking) use Python/TypeScript functions in code repositories rather than Pipeline Builder boards.

Design principle¶

Documents are treated as decomposable knowledge units where chunk granularity directly impacts downstream retrieval precision/recall.