How It Works

DocuLayer has no ML inference, no vector store, no embeddings. It is a deterministic fetch-parse-search pipeline.

Pipeline

1. resolve_identifier("pydantic")
        │
        ▼  shortcut table hit → https://docs.pydantic.dev

2. discover_llms_txt("https://docs.pydantic.dev")
        │
        ▼  fetches /llms.txt → 88 indexed entries

3. _candidate_urls("field validators")
        │
        ▼  keyword-score each entry → top 3 pages
           ["concepts/validators/", "api/validators/", "concepts/models/"]

4. DocFetcher.fetch(url) × 3   (parallel, TTLCache checked first)
        │
        ▼  raw HTML → markdownify → DocParser → list[DocSection]

5. DocSearcher(all_sections).search("field validators", top_k=5)
        │
        ▼  BM25Okapi scores → ranked SearchResult list

6. Return verbatim section content + source URL + fetch timestamp

No embeddings. No vector store. No ML inference. No generated text.

Components

`resolve_identifier`

Turns any identifier string into a canonical documentation URL. Resolution order:

Shortcut table (hardcoded for 15 popular packages)
PyPI JSON API
npm registry

`discover_llms_txt`

Probes {root_url}/llms.txt. If found, parses the entry list (title + URL per line). If not found, falls back to the root HTML page.

`_candidate_urls`

Keyword-scores every llms.txt entry against the query using simple term overlap, then returns the top 1–3 page URLs to fetch. Skips irrelevant pages entirely — keeping fetches minimal.

`DocFetcher`

Fetches URLs via httpx with configurable timeout
Checks TTLCache first; a cache hit skips the network entirely
Converts raw HTML → Markdown via markdownify
Passes Markdown to DocParser

`DocParser`

Splits a Markdown document into DocSection objects by heading level (h1–h6). Each section carries:

title — heading text
content — verbatim Markdown body beneath the heading
source_url — the page URL
depth — heading level

`DocSearcher`

Runs BM25Okapi (from rank-bm25) over the section corpus. Returns the top-k sections by BM25 score. No ML, no internet, no state.

`TTLCache`

An in-memory LRU+TTL cache keyed by URL. Entries expire after DOCULAYER_CACHE_TTL seconds (default 1 hour). Nothing is written to disk.

Design Decisions

Why BM25 and not embeddings?

Embeddings require an ML model (runtime, memory, latency), a vector index, and periodic re-embedding as docs change. BM25 is deterministic, zero-dependency, and correct for exact API name lookup — the dominant use case.

Why llms.txt first?

Without llms.txt, fetching "httpx docs" requires crawling potentially hundreds of pages. With llms.txt, DocuLayer fetches 1–3 pages that the package maintainer explicitly marked as relevant. This cuts latency from seconds to milliseconds.

Why in-memory TTL cache and no disk?

Disk writes create state that outlasts the process — stale cache files, permissions issues, path conflicts in containerized environments. Memory is simpler, safer, and the TTL guarantees freshness without a separate invalidation mechanism.

Pipeline​

Components​

resolve_identifier​

discover_llms_txt​

_candidate_urls​

DocFetcher​

DocParser​

DocSearcher​

TTLCache​

Design Decisions​