PhyslibSearch

Documentation

PhyslibSearch — semantic search for the PhysLib formal physics library.

Overview

PhyslibSearch is a semantic search engine for PhysLib, a formal Lean 4 physics library. It lets you find theorems, definitions, and instances using natural language — e.g. “Schrödinger equation” or “conservation of angular momentum” — without knowing the exact Lean name.

The index covers 9,663 visible declarations across all PhysLib modules, each with a human-readable informal description generated by an LLM and embedded into a high-dimensional vector space for similarity search.

How it works

The pipeline has three stages:

  1. Informalization. Each Lean 4 declaration is translated into an informal mathematical description (name + explanation) by Gemini. Context such as neighboring declarations in the same module and type dependencies is included in the prompt to improve quality.
  2. Embedding. The informal description is embedded using gemini-embedding-2-preview (3072-dimensional cosine space) and stored in ChromaDB.
  3. Search.At query time the user's query is optionally augmented with HyDE (Hypothetical Document Embeddings): an LLM generates a plausible Lean 4 declaration that would answer the query, and that hypothetical declaration is embedded instead of the raw query. This aligns the query embedding with the document embedding space.

API Reference

Base URL: http://physlibsearch.net. All endpoints accept and return JSON. Rate limits apply: 1 req/s on search, 15 req/min on expand.

POST/search

Search declarations using natural language. Accepts a batch of queries and returns a ranked list of results per query.

Returns list[list[QueryResult]] — one inner list per input query. Sets a session cookie for feedback tracking.

Request body

{
  "query": ["Newton's second law"],
  "num_results": 10          // 1–150, default 10
}

Response

[
  [
    {
      "result": {
        "name": ["UnitExamples", "newtonsSecondWithDim_isDimensionallyCorrect"],
        "kind": "theorem",
        "module_name": ["UnitExamples"],
        "signature": "theorem UnitExamples.newtonsSecondWithDim_isDimensionallyCorrect ...",
        "type": "IsDimensionallyCorrect UnitExamples.NewtonsSecondWithDim",
        "informal_name": "Newton's second law F = ma is dimensionally correct",
        "informal_description": "The proposition representing Newton's second law ...",
        "docstring": null,
        "value": null
      },
      "distance": 0.276
    }
  ]
]
POST/expand

Expand a natural language query using HyDE (Hypothetical Document Embeddings). Gemini generates a plausible Lean 4 declaration that would answer the query. Pass the result to /search for better embedding alignment.

Request body is a raw JSON string (quoted). Response is a raw JSON string.

Request body

"conservation of angular momentum"

Response

"theorem angularMomentumConservation {J : AngularMomentum} (h : NoExternalTorque J) : IsConserved J := ..."
GET/modules

List all PhysLib modules with their declaration counts.

Response

[
  { "name": ["Physlib", "ClassicalMechanics", "EulerLagrange"], "count": 4 },
  { "name": ["Physlib", "Mathematics", "LinearMaps"],            "count": 38 },
  ...
]
POST/modules/declarations

Fetch all visible declarations in a given module, in source order.

Request body

["Physlib", "ClassicalMechanics", "EulerLagrange"]

Response

[
  {
    "name": ["Physlib", "ClassicalMechanics", "EulerLagrange", "eulerLagrangeEq"],
    "kind": "theorem",
    "module_name": ["Physlib", "ClassicalMechanics", "EulerLagrange"],
    "signature": "theorem Physlib.ClassicalMechanics.EulerLagrange.eulerLagrangeEq ...",
    "informal_name": "Euler–Lagrange equation",
    "informal_description": "...",
    ...
  }
]
POST/fetch

Fetch full declaration records by Lean name. Useful for retrieving specific declarations you already know the name of.

Request body

[
  ["UnitExamples", "newtonsSecondWithDim_isDimensionallyCorrect"],
  ["Physlib", "ClassicalMechanics", "EulerLagrange", "eulerLagrangeEq"]
]

Response

[ { "name": [...], "kind": "theorem", ... }, ... ]

Example with curl:

# Search
curl -s -X POST http://physlibsearch.net/search \
  -H "Content-Type: application/json" \
  -d '{"query": ["Schrödinger equation"], "num_results": 5}' | jq .

# Expand query with HyDE, then search
AUGMENTED=$(curl -s -X POST http://physlibsearch.net/expand \
  -H "Content-Type: application/json" \
  -d '"conservation of energy"')

curl -s -X POST http://physlibsearch.net/search \
  -H "Content-Type: application/json" \
  -d "{"query": [$AUGMENTED], "num_results": 5}" | jq .

Self-hosting

PhyslibSearch runs on three services:

  • PostgreSQL — stores declarations, informal descriptions, and query logs
  • ChromaDB — local file-based vector store (no separate server needed)
  • FastAPI — REST API server (server.py)

Required environment variables (see .env.example):

GEMINI_API_KEY=...          # Google AI Studio key for embeddings & LLM
CONNECTION_STRING=...       # PostgreSQL connection string
CHROMA_PATH=chroma          # Path to ChromaDB files
GEMINI_MODEL=...            # LLM model for informalization (one-time)
GEMINI_FAST_MODEL=...       # LLM model for HyDE augmentation (per query)
GEMINI_EMBEDDING_MODEL=gemini-embedding-2-preview
# 1. Build the index (one-time)
python -m database schema       # create PostgreSQL schema
python -m database jixia        # extract declarations with jixia
python -m database informal     # informalize with LLM
python -m database vector-db    # compute embeddings

# 2. Run the API
uvicorn server:app --host 0.0.0.0 --port 8000

Next steps

  • Continuous integration. As PhysLib grows, the database should update automatically when new declarations are merged. The indexing pipeline is designed to be incremental — each SQL step uses ON CONFLICT DO NOTHING, and the informalization and embedding steps skip already-processed entries — but this needs to be verified end-to-end for every stage before wiring up a CI workflow that triggers on new PhysLib commits.
  • Index QuantumInfo. The QuantumInfo module has not been indexed yet due to integration issues between PhysLib's QuantumInfo dependencies and jixia. Once resolved, re-running python -m database jixia should pick it up incrementally without reprocessing the rest of the library.

Open source

PhyslibSearch is open source, built by Kernel Science. View on GitHub ↗

The repository contains the full indexing pipeline, API server, and this frontend.

  • database/ — indexing pipeline (extraction, informalization, embedding)
  • server.py — FastAPI REST API
  • engine.py — ChromaDB + PostgreSQL retrieval
  • query_expansion.py — HyDE query expansion
  • frontend/ — this Next.js interface