Mini Legal Casebase Search Engine: Design

The Problem

Why Legal Search Is Hard to Get Right

Legal practitioners search differently from general web users. A query like "negligence" isn't a Google search; it's the start of a research trail that needs to surface the right cases, with enough structure to evaluate each result before clicking into it. Three failure modes define the problem:

⊘

Full-text keyword search

Returns any document containing the word "negligence." In a 500-case corpus that's mostly noise. It ignores courts, jurisdictions, and citation relationships, and can't distinguish a case about negligence from one that mentions it once in passing.

⊘

LLM chatbot ("ask me anything")

Synthesises an answer from retrieved chunks. The citation it quotes may not exist. The rule it states may blend two different cases. You can't easily trace which source supports which claim. Hallucination risk is highest exactly where precision matters most.

✓

A casebase

Returns ranked, intact cases with metadata. You see the actual judgment, not a synthesis of it. Relevance is transparent: score breakdown plus highlighted matches. Navigation follows the citation graph: filter by court, browse by citation, follow cited-by links.

Architecture

System Overview

Five layers. Ingestion is a one-time offline job: reads raw judgment files, normalises them into the Case schema, extracts citations, embeds each paragraph, writes to three stores. Search is a live FastAPI server that reads those stores at query time.

Input

CorpusJSON with metadataPlain text + YAML headerAustLII HTML

Ingestion

PipelineParse & normaliseField extractionCitation detectionParagraph chunkingEmbedding (MiniLM)

Storage

Whoosh BM25 IndexField-weightedPer-field analyzers

Vector Store (.npy)384-dim embeddingsPara-level granularity

PostgreSQLCase metadataCitation graph edges

API

FastAPIQuery routerHybrid rankerFacet filterSnippet generatorCitation lookup

UI

React SPASearch barFacet panelResults listCase detail view

← scroll →

Why three stores? Each one fits a different access pattern. Whoosh handles full-text BM25 with field weighting. The NumPy file handles vector similarity with no overhead, just a matrix multiply. PostgreSQL handles relational queries: facet counts, citation graph traversal, metadata joins.

Why offline ingestion? Embedding 50 paragraphs per case with all-MiniLM-L6-v2 takes about 2 seconds on CPU. Doing that at query time is a non-starter. Running it once at index time keeps query latency to pure retrieval, typically under 50ms.

Data Model

The Case Schema

Every judgment is normalised into a Case dataclass before indexing. The schema is flat at the top level, with nesting only where the data is genuinely hierarchical (paragraphs with their embeddings).

python

from dataclasses import dataclass, field
from datetime import date

@dataclass
class Paragraph:
    number: int          # Paragraph number in judgment
    text: str            # Raw paragraph text
    embedding: list[float]  # 384-dim vector (all-MiniLM-L6-v2)

@dataclass
class Case:
    id: str              # SHA-256 of normalised citation (stable, unique)
    name: str            # "Donoghue v Stevenson"
    citation: str        # "[1932] AC 562"
    court: str           # "House of Lords"
    jurisdiction: str    # "United Kingdom"
    date: date           # Date of judgment
    judges: list[str]    # Presiding judges
    catchwords: list[str]   # Issue labels (manual or NLP-extracted)
    full_text: str       # Complete judgment text (UTF-8)
    paragraphs: list[Paragraph]  # Segmented, each independently embedded
    citations_out: list[str]  # Case IDs this case cites
    citations_in: list[str]   # Case IDs that cite this one (reverse-indexed)

id

SHA-256 of the normalised citation string. Stable across re-ingestion, unique per case. Primary key in PostgreSQL and the Whoosh document ID.

paragraphs

Split by paragraph number from the source HTML or text. Each paragraph carries its own 384-dim embedding, so vector search can match specific passages inside long judgments.

citations_out / citations_in

citations_out is extracted from the text during ingestion. citations_in is derived at index time by reversing all citations_out edges across the corpus.

Ingestion

From Raw Judgment to Indexed Record

The ingestion script (python ingest.py corpus/) walks a directory of judgment files and processes each through five stages. It's idempotent: re-running it skips cases whose SHA-256 ID is already in the index.

01

Parse

Detects input format by file extension and magic bytes. AustLII HTML is scraped with BeautifulSoup. Plain text files expect a YAML frontmatter block (--- at the top) with name, citation, court, and date. JSON files are loaded directly. Output is a normalised dict regardless of input format.

02

Normalise

Date strings are parsed into Python date objects. Citation strings are converted to a canonical form ("(1932) AC 562" → "[1932] AC 562"). Court names are mapped to a controlled vocabulary. Unicode is NFC-normalised. The output is a validated Python dataclass.

03

Extract citations

The full text is scanned with jurisdiction-aware regex patterns covering Australian neutral citations, English series citations, and CLR reports. Each matched citation is normalised and looked up against the in-progress index; if the target case is already indexed, the edge is stored as a live in-corpus link.

04

Chunk and embed

The judgment is split by paragraph number. Each paragraph is encoded with all-MiniLM-L6-v2 into a 384-dim float32 vector. Paragraphs shorter than 30 words are merged with their neighbour before embedding to avoid noise from structural fragments like headings, dates, and signatures.

05

Write to stores

Case metadata and full text are written to Whoosh. Paragraph embeddings are appended to the NumPy store, a memory-mapped .npy file rebuilt atomically on each ingest run. The case record and citation edges are inserted into PostgreSQL in a single transaction: the write either completes fully or rolls back.

BM25 Index

Whoosh: Field-Weighted Keyword Search

The keyword index is built with Whoosh, a pure-Python full-text search library. Each case is a single document with five searchable fields. Field boosts give citation and name matches more weight than body-text matches, which reflects how legal practitioners actually search.

python

from whoosh.fields import Schema, TEXT, ID, STORED, KEYWORD, DATETIME
from whoosh.analysis import StemmingAnalyzer, KeywordAnalyzer

# Field boosts express the relative importance of a match in each field.
# A citation hit is worth twice a body hit; a name hit worth 1.5×.
case_schema = Schema(
    id=ID(stored=True, unique=True),
    name=TEXT(
        stored=True,
        analyzer=StemmingAnalyzer(),
        field_boost=1.5,    # "Donoghue" matches are highly relevant
    ),
    citation=TEXT(
        stored=True,
        analyzer=KeywordAnalyzer(),   # Never stem "[1932] AC 562"
        field_boost=2.0,    # Exact citation lookup is the most precise query
    ),
    court=TEXT(stored=True, field_boost=1.2),
    catchwords=TEXT(
        stored=True,
        analyzer=StemmingAnalyzer(),
        field_boost=1.3,    # Issue labels are denser signal than body text
    ),
    jurisdiction=KEYWORD(stored=True, commas=True),
    date=DATETIME(stored=True),
    full_text=TEXT(analyzer=StemmingAnalyzer(), field_boost=1.0),
)

# Boolean / phrase search is parsed with Whoosh's MultifieldParser:
#   negligence AND "duty of care"     → AND operator
#   Donoghue OR Stevenson             → OR operator
#   "neighbour principle"             → phrase search
#   court:Lords negligence            → field-scoped term

Field Weights

Field	Analyzer	Boost	Rationale
`citation`	`KeywordAnalyzer`	2.0×	Exact citation lookup is the most precise query; never stem
`name`	`StemmingAnalyzer`	1.5×	Case name is a strong relevance signal
`catchwords`	`StemmingAnalyzer`	1.3×	Curated issue labels are denser signal than body prose
`court`	`StandardAnalyzer`	1.2×	Court name is a common filter anchor
`full_text`	`StemmingAnalyzer`	1.0×	Baseline body-text scoring

Whoosh's MultifieldParser searches all fields at once unless a term is scoped to a specific field (court:Lords, citation:"AC 562"). Boolean operators work natively: negligence AND "duty of care", Donoghue OR Stevenson, phrase search with quotes. BM25F (the variant with per-field boosts) has been Whoosh's default scorer since version 2.7.

Vector Index

Paragraph-Level Semantic Embeddings

The vector index enables conceptual search: finding cases about a legal principle even when the exact keywords don't appear in the text. It runs at paragraph granularity rather than document granularity. That's the critical choice.

Model: all-MiniLM-L6-v2

A 22M-parameter SBERT model producing 384-dim embeddings. Fast enough to run on CPU in under 100ms per query, trained on diverse sentence pairs so legal prose gives useful similarity out of the box, and the vectors for 1,500 paragraphs fit entirely in RAM at about 2.2MB.

Why paragraph-level?

A 50-page judgment's document embedding is the centroid of hundreds of paragraph embeddings, an average that's hard to distinguish from other long judgments on similar topics. A query about "the neighbour principle in negligence" should match the paragraph where Lord Atkin articulates it, not the judgment's global average.

Brute-force cosine, not HNSW

At 50 cases × 30 paragraphs = 1,500 vectors of 384 floats, a full NumPy cosine scan takes under 2ms. FAISS HNSW needs a 50ms index build and gives no query speedup at this scale. ANN starts paying off around 50,000 vectors, roughly 1,600 cases. Below that, brute force wins.

Case score = max paragraph

A case's vector score is the highest cosine similarity across all its paragraphs. Averaging would penalise long judgments with many off-topic paragraphs. Max pooling rewards cases that contain at least one paragraph highly relevant to the query, which is the right criterion for retrieval.

Hybrid Ranking

Combining BM25 and Cosine Similarity

Neither BM25 nor vector search is always better. BM25 wins on citation lookups and case name searches. Vector wins on conceptual queries. The two are blended with a normalised linear combination, with a confidence check that lets high-precision keyword queries skip the vector path entirely.

score=α · norm(BM25)+(1 − α) · norm(cosine)

Default α = 0.7 · both distributions normalised to [0, 1] before combining

python

def search(query: str, alpha: float = 0.7, top_k: int = 10):
    # ── 1. BM25 via Whoosh ────────────────────────────────────────────
    with ix.searcher() as searcher:
        hits = searcher.search(qp.parse(query), limit=top_k * 2)
        bm25_raw = {h["id"]: h.score for h in hits}

    # ── 2. Query routing ──────────────────────────────────────────────
    # If BM25 is already confident (exact citation or case name match),
    # bypass vector search entirely — keyword wins.
    if bm25_raw and max(bm25_raw.values()) > HIGH_CONFIDENCE_THRESHOLD:
        return build_results(bm25_raw, mode="keyword")

    # ── 3. Vector search (paragraph-level) ────────────────────────────
    # Embed the query using the same model used at index time.
    q_vec = model.encode(query)          # shape: (384,)

    cosine_raw = {}
    for case_id, paragraphs in paragraph_store.items():
        # A case's score = the best paragraph match (max pooling).
        # Paragraph-level search finds relevant passages inside long judgments.
        scores = [1 - cosine(q_vec, p.embedding) for p in paragraphs]
        cosine_raw[case_id] = max(scores)

    # ── 4. Normalise both distributions to [0, 1] ──────────────────────
    def norm(d):
        lo, hi = min(d.values()), max(d.values())
        return {k: (v - lo) / (hi - lo + 1e-9) for k, v in d.items()}

    bm25_n   = norm(bm25_raw)
    cosine_n = norm(cosine_raw)

    # ── 5. Weighted linear combination ────────────────────────────────
    #   score = α · norm(BM25) + (1 − α) · norm(cosine)
    #   Default α = 0.7 favours keyword precision;
    #   lower α shifts weight toward semantic recall.
    all_ids = set(bm25_n) | set(cosine_n)
    hybrid = {
        cid: alpha * bm25_n.get(cid, 0) + (1 - alpha) * cosine_n.get(cid, 0)
        for cid in all_ids
    }
    ranked = sorted(hybrid, key=hybrid.get, reverse=True)[:top_k]
    return build_results({cid: hybrid[cid] for cid in ranked}, mode="hybrid")

Effect of α on search behaviour

α = 1.0Pure BM25. Maximum precision, no semantic recall.

α = 0.7Default. Strong keyword preference; vector covers conceptual queries that BM25 misses.

α = 0.5Equal weight. Good when the corpus is small and both signals are equally reliable.

α = 0.0Pure semantic. Maximum recall; can surface stylistically similar cases that aren't legally relevant.

Search Pipeline

End-to-End Query Flow

Each search request follows this path. The router is the key decision point: high-confidence keyword queries skip the vector path, keeping latency under 10ms for citation lookups while still offering semantic recall for open-ended queries.

User Query

"negligence duty of care"

↓

Query Router

Check BM25 confidence > threshold?

↓

High confidence

BM25 Only

Whoosh field-weighted scoring

or

Low confidence

BM25 + Vector

Cosine sim over paragraph embeddings

↓

Hybrid Ranker

α · norm(BM25) + (1 − α) · norm(cosine)

↓

Facet Filter

Court · Year · Jurisdiction applied post-rank

↓

Snippet Generator

Sliding-window → sentence boundary → highlight

↓

Ranked Results

JSON → React UI

Snippet Generation

Surfacing the Right Passage

A snippet shows where in the judgment the query matched. Taking the first 200 characters is wrong; the relevant passage might be on page 40. A sliding-window algorithm finds the densest region of query-term hits, then expands it to sentence boundaries.

python

def generate_snippet(text: str, terms: list[str], window: int = 50) -> str:
    words = text.split()
    term_set = {t.lower() for t in terms}

    # Slide a window across the text, scoring each position by term density.
    # Higher density = more query terms in this region = more relevant excerpt.
    best_start, best_score = 0, 0
    for i in range(max(1, len(words) - window)):
        score = sum(
            1 for w in words[i : i + window]
            if w.lower().strip(".,;:()") in term_set
        )
        if score > best_score:
            best_start, best_score = i, score

    # Expand the window slightly and snap to sentence boundaries.
    excerpt = words[max(0, best_start - 8) : best_start + window + 8]
    text = " ".join(excerpt)

    # Wrap first sentence boundary at each end to keep the snippet coherent.
    text = re.sub(r'^[^.!?]*[.!?]\s*', '', text)  # trim leading partial sentence
    m = re.search(r'[.!?]', text[::-1])             # find last sentence end
    if m:
        text = text[: len(text) - m.start()]

    # Highlight matched terms with <mark> so the UI can render them.
    for term in terms:
        text = re.sub(
            rf'\b({re.escape(term)})\b', r'<mark>\1</mark>',
            text, flags=re.IGNORECASE
        )
    return f"…{text}…"

Citation Graph

Parsing and Indexing Case References

Legal judgments are dense with citations to prior cases. Detecting and linking them turns the corpus into a navigable citation graph: "cited by", similar cases by shared citation, and direct in-corpus links.

python

# Citations appear in multiple jurisdiction-specific formats.
# The regex union below covers the major Australian and UK patterns.
PATTERNS = [
    # Australian neutral citations   → [2001] HCA 14
    r'\[(?P<year>\d{4})\]\s*(?P<court>HCA|FCA|FCAFC|NSWCA|VSCA|QCA)\s*(?P<num>\d+)',

    # English series citations        → [1990] 2 AC 605  /  (1932) AC 562
    r'[\[\(](?P<year>\d{4})[\]\)]\s*(?:\d+\s*)?'
    r'(?P<series>AC|QB|Ch|WLR|All ER|EWCA Civ|EWCA Crim|UKSC)\s*(?P<page>\d+)',

    # Commonwealth Law Reports        → (2001) 207 CLR 562
    r'\((?P<year>\d{4})\)\s*(?P<vol>\d+)\s*CLR\s*(?P<page>\d+)',
]

def extract_citations(text: str) -> list[str]:
    found = []
    for pat in PATTERNS:
        for m in re.finditer(pat, text):
            found.append(normalise(m.group(0)))
    return list(dict.fromkeys(found))   # deduplicate, preserve order

# Edges are stored in Postgres for O(1) reverse lookup ("cited by"):
# CREATE TABLE citation_edges (
#     src_id TEXT REFERENCES cases(id),
#     tgt_citation TEXT,          -- raw citation string
#     tgt_id TEXT,                -- NULL if target not in corpus
#     PRIMARY KEY (src_id, tgt_citation)
# );

In-corpus vs. out-of-corpus citations. When a cited case is in the index, the link is a live navigation link. When it isn't, the citation string still appears as plain text; you can see what the case references even if those cases aren't in the corpus.

API Design

FastAPI Search Endpoint

Two endpoints: POST /search for ranked retrieval and GET /cases/{id} for full case detail. The score_breakdown field on each result shows how much of the score came from BM25 vs. the vector path, making the ranking auditable rather than opaque.

python

from fastapi import FastAPI
from pydantic import BaseModel
from typing import Literal

class SearchRequest(BaseModel):
    q: str
    mode: Literal["keyword", "semantic", "hybrid"] = "hybrid"
    courts: list[str] = []
    jurisdictions: list[str] = []
    year_from: int | None = None
    year_to: int | None = None
    page: int = 1
    per_page: int = 10

class CaseResult(BaseModel):
    id: str
    name: str
    citation: str
    court: str
    jurisdiction: str
    date: str
    snippet: str           # HTML with <mark> highlights
    score: float           # Normalised to [0, 1]
    score_breakdown: dict  # {"bm25": 0.82, "vector": 0.71, "final": 0.78}

class SearchResponse(BaseModel):
    total: int
    page: int
    results: list[CaseResult]
    facet_counts: dict     # {"court": {"House of Lords": 8, ...}, ...}
    mode_used: str         # "keyword" | "hybrid" — which path fired

@app.post("/search", response_model=SearchResponse)
async def search(req: SearchRequest):
    results = engine.search(
        query=req.q,
        mode=req.mode,
        filters={"courts": req.courts, "jurisdictions": req.jurisdictions,
                 "year_range": (req.year_from, req.year_to)},
    )
    return paginate(results, req.page, req.per_page)

@app.get("/cases/{case_id}")
async def get_case(case_id: str) -> FullCase:
    # Returns full text + metadata + citation graph for one case.
    return db.get_case(case_id)

Interface Design

Proposed UI: Search Results

The results page is deliberately not a chatbot. No message thread, no "ask me anything" prompt, no assistant persona. A search bar, a facet panel, and a ranked list. Each result card shows metadata, a relevance score, and a snippet with highlighted query terms. It reads like a library catalogue, not a chat window.

Proposed UI: Case Detail

Three zones: a structured metadata block at the top (always visible), the full judgment with query terms highlighted in the passage the snippet generator found, and a citation sidebar showing outbound references and reverse links to in-corpus cases that cite this one.

localhost:5173/cases/donoghue-v-stevenson-1932-ac-562

CaseDonoghue v Stevenson

Citation[1932] AC 562

CourtHouse of Lords

Date26 May 1932

JurisdictionUnited Kingdom

JudgesLord Atkin, Lord Thankerton, Lord Macmillan (majority)

Catchwords

NegligenceDuty of careManufacturer liabilityNeighbour principleProduct liability

Judgment: Lord Atkin

My Lords, the sole question for determination in this case is legal: whether, as a matter of law in the circumstances alleged, the defender owed any duty to the pursuer to take care with regard to the security of the bottle of ginger beer. The law of negligence, whether you style it such or treat it as in other systems as a species of culpa, is no doubt based upon a general public sentiment of moral wrongdoing for which the offender must pay. But acts or omissions which any moral code would censure cannot in a practical world be treated so as to give a right to every person injured by them to demand relief…

The rule that you are to love your neighbour becomes in law, you must not injure your neighbour; and the lawyer's question, Who is my neighbour? receives a restricted reply. You must take reasonable care to avoid acts or omissions which you can reasonably foresee would be likely to injure your neighbour. Who, then, in law is my neighbour? The answer seems to be: persons who are so closely and directly affected by my act that I ought reasonably to have them in contemplation as being so affected when I am directing my mind to the acts or omissions which are called in question…

…The question is whether the manufacturer of an article of food, medicine, or the like, sold by him to a distributor in circumstances which prevent the distributor or the ultimate purchaser or consumer from discovering by inspection any defect, is under a legal duty to the ultimate purchaser or consumer to take reasonable care that the article is free from defect likely to cause injury to health. I do not think so ill of our jurisprudence as to suppose that its principles are so remote from the ordinary needs of civilised society…

Design Rationale

Retrieval vs RAG: The Critical Distinction

RAG retrieves chunks of text and feeds them to a language model, which synthesises a new answer. This system does not do that. The retrieval step is the output. No language model touches the results. That's not a limitation; it's the point.

Legal research requires verifiability. A practitioner citing a case in a submission needs to know what the case actually says, not a model's paraphrase of it. This system returns the source and lets the practitioner read it.

Aspect	This Casebasepure retrieval	Chatbot / RAGretrieval + generation
Query input	Keyword or conceptual query	Natural language question
Output	Ranked list of original case texts	LLM-generated prose answer
Citations	Exact and verified; the source is the result	May be synthesised or hallucinated
Determinism	Same query → same ranked list	Non-deterministic; temperature-dependent
Explainability	Score + matched fields visible	Black-box; hard to audit
Corpus fidelity	Each case is a discrete, intact record	Model may blend text across cases
Legal liability	System returns sources; user interprets	System implicitly interprets sources
Hallucination risk	None; only indexed text is returned	Inherent to all generative models

Design Decisions

Key Tradeoffs

Each decision below was a real fork. The choice made is documented with its rationale and the conditions under which the alternative would have been right.

Paragraph-level embeddings, not document-level

Why

Long judgments (50+ pages) produce document embeddings that average out into noise. A query about "duty of care" should match the paragraphs where Lord Atkin articulates that principle, not the judgment's global centroid.

Tradeoff

30× more vectors, and case scoring requires max-pooling across paragraphs. Fine for a 5–50 case corpus.

BM25 first, vector as fallback

Why

Legal searches are often specific: citation lookups, case name searches, court-scoped queries. BM25 handles these with high precision. Vector search adds recall for conceptual queries where exact keywords don't appear.

Tradeoff

The routing threshold is a tunable hyperparameter. Too high and you degrade recall; too low and you waste vector compute on queries BM25 would get right anyway.

Whoosh over Elasticsearch for this scale

Why

Elasticsearch requires a JVM, cluster config, and index management, none of which is justified for a 5–50 case corpus. Whoosh is pure Python, embedded, zero-config. The calculus only flips above roughly 10,000 documents.

Tradeoff

Whoosh is single-writer; concurrent indexing needs a file lock. Not a concern at this scale, but it would be the first bottleneck to hit.

Brute-force cosine over ANN index (HNSW/FAISS)

Why

ANN indexes pay off at 100,000+ vectors. At 50 cases × 30 paragraphs = ~1,500 vectors of 384 floats, a full cosine scan takes under 2ms. FAISS adds 50ms of index build time and gives back nothing on queries at this scale.

Tradeoff

Brute-force scales linearly. At ~5,000 paragraphs (≈150 cases) a query still takes under 10ms. Beyond that, switch to FAISS.

Facet counts as real document frequencies

Why

The number next to each facet (e.g. 'House of Lords (8)') is a real BM25 faceted query count, not an estimate. Approximate counts mislead legal researchers who need to know exactly how many matching cases exist.

Tradeoff

Exact facet counts require one extra Whoosh query per facet dimension. At this scale that's negligible.

Citation graph in Postgres, not the search index

Why

Citation relationships are a graph problem: reverse lookups ("who cites this case?"), multi-hop traversal ("find cases two steps from Donoghue"). A SQL join handles this cleanly. Whoosh stored fields don't do graph queries.

Tradeoff

Adds a Postgres dependency. SQLite with the same schema would work just as well at this corpus size.