Document Processing¶

Grampus ships three built-in tools that let agents ingest PDF, Word (.docx), and Excel (.xlsx) files as structured text chunks, ready to store in EpisodicMemory or feed directly into a RAG pipeline.

Installation¶

Document processing requires optional extras:

pip install 'grampus-ai[documents]'

This installs: pymupdf4llm (PDF primary), pypdf (PDF fallback), python-docx (Word), openpyxl (Excel).

The core Grampus install without [documents] is unaffected. Agents that call a document tool without the extras get a clear error: code="MISSING_DEPENDENCY".

Quick Start¶

from grampus.tools.library import LIBRARY_REGISTRY

registry = LIBRARY_REGISTRY

# Call via the registry
result = await registry.get_or_raise("read_pdf").fn(
    path="/data/paper.pdf",
    chunk_size=512,
)

if result["ok"]:
    for chunk in result["chunks"]:
        print(chunk["context_header"], "—", chunk["content"][:80])

Or import the functions directly:

from grampus.tools.library.document_tools import read_pdf_tool, read_docx_tool, read_excel_tool

result = await read_pdf_tool(path="/data/paper.pdf")

Chunking Strategies¶

Strategy	How it splits	When to use
`recursive` (default)	Paragraph → sentence → word boundaries	Prose, reports, research papers
`fixed`	Fixed word-count windows with 10% overlap	Dense tables, logs, structured data

Recursive is the 2026 benchmark winner (69% E2E accuracy across 50 papers). It never breaks mid-sentence when avoidable, making embeddings more semantically coherent.

Fixed with overlap is better when you need deterministic, overlapping windows — for example, when matching short query phrases against a code listing or a financial statement.

# Fixed strategy with 15% overlap
result = await read_pdf_tool(
    path="/data/report.pdf",
    chunk_size=256,
    chunk_strategy="fixed",
)

Contextual Retrieval¶

Every chunk carries a context_header field — the heading breadcrumb above that text:

"Annual Report 2024 > Financial Results > Q3 Revenue"

This field is not part of content. The embedding layer should concatenate them for retrieval so each chunk is self-contained when returned in isolation:

for chunk in result["chunks"]:
    embedding_text = chunk["context_header"] + "\n\n" + chunk["content"]
    vector = await embed(embedding_text)
    await memory_manager.remember(chunk["content"], embedding=vector)

The context_header falls back to metadata.title if no headings are present, and to the file's basename if neither is set.

Excel Specifics¶

Each sheet is processed independently; chunk["sheet"] identifies the source sheet.
The first row is treated as the header.
Sheets with more than 1000 rows are truncated; a note is appended in the chunk content.
Each sheet is rendered as a Markdown table before chunking.

result = await read_excel_tool(path="/data/financials.xlsx")
for chunk in result["chunks"]:
    print(f"Sheet: {chunk['sheet']}, tokens: {chunk['token_estimate']}")

Integrating with Memory¶

from grampus.tools.library.document_tools import read_pdf_tool

result = await read_pdf_tool(path="/data/paper.pdf")
if not result["ok"]:
    raise RuntimeError(result["error"])

for chunk in result["chunks"]:
    await memory_manager.remember(
        content=chunk["content"],
        metadata={
            "source": chunk["metadata"]["source"],
            "page": chunk["page"],
            "section": chunk["context_header"],
        },
    )

File Size Limits¶

The default limit is 50 MB per file. Files larger than this return:

{"ok": false, "code": "FILE_TOO_LARGE", "error": "File exceeds 50 MB size limit: ..."}

Error Codes¶

Code	Cause
`MISSING_DEPENDENCY`	Required library not installed — run `pip install 'grampus-ai[documents]'`
`FILE_NOT_FOUND`	Path does not exist
`UNSUPPORTED_FORMAT`	File extension does not match the tool (e.g. passing `.txt` to `read_pdf`)
`FILE_TOO_LARGE`	File exceeds the 50 MB size limit
`PARSE_ERROR`	Library raised an unexpected error while reading the file
`INVALID_STRATEGY`	`chunk_strategy` is not `'recursive'` or `'fixed'`