Skip to main content

How It Works

DocuLayer has no ML inference, no vector store, no embeddings. It is a deterministic fetch-parse-search pipeline.

Pipeline

1. resolve_identifier("pydantic")

▼ shortcut table hit → https://docs.pydantic.dev

2. discover_llms_txt("https://docs.pydantic.dev")

▼ fetches /llms.txt → 88 indexed entries

3. _candidate_urls("field validators")

▼ keyword-score each entry → top 3 pages
["concepts/validators/", "api/validators/", "concepts/models/"]

4. DocFetcher.fetch(url) × 3 (parallel, TTLCache checked first)

▼ raw HTML → markdownify → DocParser → list[DocSection]

5. DocSearcher(all_sections).search("field validators", top_k=5)

▼ BM25Okapi scores → ranked SearchResult list

6. Return verbatim section content + source URL + fetch timestamp

No embeddings. No vector store. No ML inference. No generated text.


Components

resolve_identifier

Turns any identifier string into a canonical documentation URL. Resolution order:

  1. Shortcut table (hardcoded for 15 popular packages)
  2. PyPI JSON API
  3. npm registry

discover_llms_txt

Probes {root_url}/llms.txt. If found, parses the entry list (title + URL per line). If not found, falls back to the root HTML page.

_candidate_urls

Keyword-scores every llms.txt entry against the query using simple term overlap, then returns the top 1–3 page URLs to fetch. Skips irrelevant pages entirely — keeping fetches minimal.

DocFetcher

  • Fetches URLs via httpx with configurable timeout
  • Checks TTLCache first; a cache hit skips the network entirely
  • Converts raw HTML → Markdown via markdownify
  • Passes Markdown to DocParser

DocParser

Splits a Markdown document into DocSection objects by heading level (h1–h6). Each section carries:

  • title — heading text
  • content — verbatim Markdown body beneath the heading
  • source_url — the page URL
  • depth — heading level

DocSearcher

Runs BM25Okapi (from rank-bm25) over the section corpus. Returns the top-k sections by BM25 score. No ML, no internet, no state.

TTLCache

An in-memory LRU+TTL cache keyed by URL. Entries expire after DOCULAYER_CACHE_TTL seconds (default 1 hour). Nothing is written to disk.


Design Decisions

Why BM25 and not embeddings?

Embeddings require an ML model (runtime, memory, latency), a vector index, and periodic re-embedding as docs change. BM25 is deterministic, zero-dependency, and correct for exact API name lookup — the dominant use case.

Why llms.txt first?

Without llms.txt, fetching "httpx docs" requires crawling potentially hundreds of pages. With llms.txt, DocuLayer fetches 1–3 pages that the package maintainer explicitly marked as relevant. This cuts latency from seconds to milliseconds.

Why in-memory TTL cache and no disk?

Disk writes create state that outlasts the process — stale cache files, permissions issues, path conflicts in containerized environments. Memory is simpler, safer, and the TTL guarantees freshness without a separate invalidation mechanism.