Mini Legal Casebase Search Engine
A search engine for legal judgments, not a chatbot. You put in a query, you get back ranked cases from the actual corpus: metadata, highlighted passages, citation links. The lawyer does the reading.
Why Legal Search Is Hard to Get Right
Legal practitioners search differently from general web users. A query like "negligence" isn't a Google search; it's the start of a research trail that needs to surface the right cases, with enough structure to evaluate each result before clicking into it. Three failure modes define the problem:
Full-text keyword search
Returns any document containing the word "negligence." In a 500-case corpus that's mostly noise. It ignores courts, jurisdictions, and citation relationships, and can't distinguish a case about negligence from one that mentions it once in passing.
LLM chatbot ("ask me anything")
Synthesises an answer from retrieved chunks. The citation it quotes may not exist. The rule it states may blend two different cases. You can't easily trace which source supports which claim. Hallucination risk is highest exactly where precision matters most.
A casebase
Returns ranked, intact cases with metadata. You see the actual judgment, not a synthesis of it. Relevance is transparent: score breakdown plus highlighted matches. Navigation follows the citation graph: filter by court, browse by citation, follow cited-by links.
System Overview
Five layers. Ingestion is a one-time offline job: reads raw judgment files, normalises them into the Case schema, extracts citations, embeds each paragraph, writes to three stores. Search is a live FastAPI server that reads those stores at query time.
← scroll →
The Case Schema
Every judgment is normalised into a Case dataclass before indexing. The schema is flat at the top level, with nesting only where the data is genuinely hierarchical (paragraphs with their embeddings).
from dataclasses import dataclass, field
from datetime import date
@dataclass
class Paragraph:
number: int # Paragraph number in judgment
text: str # Raw paragraph text
embedding: list[float] # 384-dim vector (all-MiniLM-L6-v2)
@dataclass
class Case:
id: str # SHA-256 of normalised citation (stable, unique)
name: str # "Donoghue v Stevenson"
citation: str # "[1932] AC 562"
court: str # "House of Lords"
jurisdiction: str # "United Kingdom"
date: date # Date of judgment
judges: list[str] # Presiding judges
catchwords: list[str] # Issue labels (manual or NLP-extracted)
full_text: str # Complete judgment text (UTF-8)
paragraphs: list[Paragraph] # Segmented, each independently embedded
citations_out: list[str] # Case IDs this case cites
citations_in: list[str] # Case IDs that cite this one (reverse-indexed)idSHA-256 of the normalised citation string. Stable across re-ingestion, unique per case. Primary key in PostgreSQL and the Whoosh document ID.
paragraphsSplit by paragraph number from the source HTML or text. Each paragraph carries its own 384-dim embedding, so vector search can match specific passages inside long judgments.
citations_out / citations_incitations_out is extracted from the text during ingestion. citations_in is derived at index time by reversing all citations_out edges across the corpus.
From Raw Judgment to Indexed Record
The ingestion script (python ingest.py corpus/) walks a directory of judgment files and processes each through five stages. It's idempotent: re-running it skips cases whose SHA-256 ID is already in the index.
Detects input format by file extension and magic bytes. AustLII HTML is scraped with BeautifulSoup. Plain text files expect a YAML frontmatter block (--- at the top) with name, citation, court, and date. JSON files are loaded directly. Output is a normalised dict regardless of input format.
Date strings are parsed into Python date objects. Citation strings are converted to a canonical form ("(1932) AC 562" → "[1932] AC 562"). Court names are mapped to a controlled vocabulary. Unicode is NFC-normalised. The output is a validated Python dataclass.
The full text is scanned with jurisdiction-aware regex patterns covering Australian neutral citations, English series citations, and CLR reports. Each matched citation is normalised and looked up against the in-progress index; if the target case is already indexed, the edge is stored as a live in-corpus link.
The judgment is split by paragraph number. Each paragraph is encoded with all-MiniLM-L6-v2 into a 384-dim float32 vector. Paragraphs shorter than 30 words are merged with their neighbour before embedding to avoid noise from structural fragments like headings, dates, and signatures.
Case metadata and full text are written to Whoosh. Paragraph embeddings are appended to the NumPy store, a memory-mapped .npy file rebuilt atomically on each ingest run. The case record and citation edges are inserted into PostgreSQL in a single transaction: the write either completes fully or rolls back.
Whoosh: Field-Weighted Keyword Search
The keyword index is built with Whoosh, a pure-Python full-text search library. Each case is a single document with five searchable fields. Field boosts give citation and name matches more weight than body-text matches, which reflects how legal practitioners actually search.
from whoosh.fields import Schema, TEXT, ID, STORED, KEYWORD, DATETIME
from whoosh.analysis import StemmingAnalyzer, KeywordAnalyzer
# Field boosts express the relative importance of a match in each field.
# A citation hit is worth twice a body hit; a name hit worth 1.5×.
case_schema = Schema(
id=ID(stored=True, unique=True),
name=TEXT(
stored=True,
analyzer=StemmingAnalyzer(),
field_boost=1.5, # "Donoghue" matches are highly relevant
),
citation=TEXT(
stored=True,
analyzer=KeywordAnalyzer(), # Never stem "[1932] AC 562"
field_boost=2.0, # Exact citation lookup is the most precise query
),
court=TEXT(stored=True, field_boost=1.2),
catchwords=TEXT(
stored=True,
analyzer=StemmingAnalyzer(),
field_boost=1.3, # Issue labels are denser signal than body text
),
jurisdiction=KEYWORD(stored=True, commas=True),
date=DATETIME(stored=True),
full_text=TEXT(analyzer=StemmingAnalyzer(), field_boost=1.0),
)
# Boolean / phrase search is parsed with Whoosh's MultifieldParser:
# negligence AND "duty of care" → AND operator
# Donoghue OR Stevenson → OR operator
# "neighbour principle" → phrase search
# court:Lords negligence → field-scoped termField Weights
| Field | Analyzer | Boost | Rationale |
|---|---|---|---|
citation | KeywordAnalyzer | 2.0× | Exact citation lookup is the most precise query; never stem |
name | StemmingAnalyzer | 1.5× | Case name is a strong relevance signal |
catchwords | StemmingAnalyzer | 1.3× | Curated issue labels are denser signal than body prose |
court | StandardAnalyzer | 1.2× | Court name is a common filter anchor |
full_text | StemmingAnalyzer | 1.0× | Baseline body-text scoring |
Whoosh's MultifieldParser searches all fields at once unless a term is scoped to a specific field (court:Lords, citation:"AC 562"). Boolean operators work natively: negligence AND "duty of care", Donoghue OR Stevenson, phrase search with quotes. BM25F (the variant with per-field boosts) has been Whoosh's default scorer since version 2.7.
Paragraph-Level Semantic Embeddings
The vector index enables conceptual search: finding cases about a legal principle even when the exact keywords don't appear in the text. It runs at paragraph granularity rather than document granularity. That's the critical choice.
A 22M-parameter SBERT model producing 384-dim embeddings. Fast enough to run on CPU in under 100ms per query, trained on diverse sentence pairs so legal prose gives useful similarity out of the box, and the vectors for 1,500 paragraphs fit entirely in RAM at about 2.2MB.
A 50-page judgment's document embedding is the centroid of hundreds of paragraph embeddings, an average that's hard to distinguish from other long judgments on similar topics. A query about "the neighbour principle in negligence" should match the paragraph where Lord Atkin articulates it, not the judgment's global average.
At 50 cases × 30 paragraphs = 1,500 vectors of 384 floats, a full NumPy cosine scan takes under 2ms. FAISS HNSW needs a 50ms index build and gives no query speedup at this scale. ANN starts paying off around 50,000 vectors, roughly 1,600 cases. Below that, brute force wins.
A case's vector score is the highest cosine similarity across all its paragraphs. Averaging would penalise long judgments with many off-topic paragraphs. Max pooling rewards cases that contain at least one paragraph highly relevant to the query, which is the right criterion for retrieval.
Combining BM25 and Cosine Similarity
Neither BM25 nor vector search is always better. BM25 wins on citation lookups and case name searches. Vector wins on conceptual queries. The two are blended with a normalised linear combination, with a confidence check that lets high-precision keyword queries skip the vector path entirely.
def search(query: str, alpha: float = 0.7, top_k: int = 10):
# ── 1. BM25 via Whoosh ────────────────────────────────────────────
with ix.searcher() as searcher:
hits = searcher.search(qp.parse(query), limit=top_k * 2)
bm25_raw = {h["id"]: h.score for h in hits}
# ── 2. Query routing ──────────────────────────────────────────────
# If BM25 is already confident (exact citation or case name match),
# bypass vector search entirely — keyword wins.
if bm25_raw and max(bm25_raw.values()) > HIGH_CONFIDENCE_THRESHOLD:
return build_results(bm25_raw, mode="keyword")
# ── 3. Vector search (paragraph-level) ────────────────────────────
# Embed the query using the same model used at index time.
q_vec = model.encode(query) # shape: (384,)
cosine_raw = {}
for case_id, paragraphs in paragraph_store.items():
# A case's score = the best paragraph match (max pooling).
# Paragraph-level search finds relevant passages inside long judgments.
scores = [1 - cosine(q_vec, p.embedding) for p in paragraphs]
cosine_raw[case_id] = max(scores)
# ── 4. Normalise both distributions to [0, 1] ──────────────────────
def norm(d):
lo, hi = min(d.values()), max(d.values())
return {k: (v - lo) / (hi - lo + 1e-9) for k, v in d.items()}
bm25_n = norm(bm25_raw)
cosine_n = norm(cosine_raw)
# ── 5. Weighted linear combination ────────────────────────────────
# score = α · norm(BM25) + (1 − α) · norm(cosine)
# Default α = 0.7 favours keyword precision;
# lower α shifts weight toward semantic recall.
all_ids = set(bm25_n) | set(cosine_n)
hybrid = {
cid: alpha * bm25_n.get(cid, 0) + (1 - alpha) * cosine_n.get(cid, 0)
for cid in all_ids
}
ranked = sorted(hybrid, key=hybrid.get, reverse=True)[:top_k]
return build_results({cid: hybrid[cid] for cid in ranked}, mode="hybrid")α = 1.0Pure BM25. Maximum precision, no semantic recall.α = 0.7Default. Strong keyword preference; vector covers conceptual queries that BM25 misses.α = 0.5Equal weight. Good when the corpus is small and both signals are equally reliable.α = 0.0Pure semantic. Maximum recall; can surface stylistically similar cases that aren't legally relevant.End-to-End Query Flow
Each search request follows this path. The router is the key decision point: high-confidence keyword queries skip the vector path, keeping latency under 10ms for citation lookups while still offering semantic recall for open-ended queries.
Surfacing the Right Passage
A snippet shows where in the judgment the query matched. Taking the first 200 characters is wrong; the relevant passage might be on page 40. A sliding-window algorithm finds the densest region of query-term hits, then expands it to sentence boundaries.
def generate_snippet(text: str, terms: list[str], window: int = 50) -> str:
words = text.split()
term_set = {t.lower() for t in terms}
# Slide a window across the text, scoring each position by term density.
# Higher density = more query terms in this region = more relevant excerpt.
best_start, best_score = 0, 0
for i in range(max(1, len(words) - window)):
score = sum(
1 for w in words[i : i + window]
if w.lower().strip(".,;:()") in term_set
)
if score > best_score:
best_start, best_score = i, score
# Expand the window slightly and snap to sentence boundaries.
excerpt = words[max(0, best_start - 8) : best_start + window + 8]
text = " ".join(excerpt)
# Wrap first sentence boundary at each end to keep the snippet coherent.
text = re.sub(r'^[^.!?]*[.!?]\s*', '', text) # trim leading partial sentence
m = re.search(r'[.!?]', text[::-1]) # find last sentence end
if m:
text = text[: len(text) - m.start()]
# Highlight matched terms with <mark> so the UI can render them.
for term in terms:
text = re.sub(
rf'\b({re.escape(term)})\b', r'<mark>\1</mark>',
text, flags=re.IGNORECASE
)
return f"…{text}…"Parsing and Indexing Case References
Legal judgments are dense with citations to prior cases. Detecting and linking them turns the corpus into a navigable citation graph: "cited by", similar cases by shared citation, and direct in-corpus links.
# Citations appear in multiple jurisdiction-specific formats.
# The regex union below covers the major Australian and UK patterns.
PATTERNS = [
# Australian neutral citations → [2001] HCA 14
r'\[(?P<year>\d{4})\]\s*(?P<court>HCA|FCA|FCAFC|NSWCA|VSCA|QCA)\s*(?P<num>\d+)',
# English series citations → [1990] 2 AC 605 / (1932) AC 562
r'[\[\(](?P<year>\d{4})[\]\)]\s*(?:\d+\s*)?'
r'(?P<series>AC|QB|Ch|WLR|All ER|EWCA Civ|EWCA Crim|UKSC)\s*(?P<page>\d+)',
# Commonwealth Law Reports → (2001) 207 CLR 562
r'\((?P<year>\d{4})\)\s*(?P<vol>\d+)\s*CLR\s*(?P<page>\d+)',
]
def extract_citations(text: str) -> list[str]:
found = []
for pat in PATTERNS:
for m in re.finditer(pat, text):
found.append(normalise(m.group(0)))
return list(dict.fromkeys(found)) # deduplicate, preserve order
# Edges are stored in Postgres for O(1) reverse lookup ("cited by"):
# CREATE TABLE citation_edges (
# src_id TEXT REFERENCES cases(id),
# tgt_citation TEXT, -- raw citation string
# tgt_id TEXT, -- NULL if target not in corpus
# PRIMARY KEY (src_id, tgt_citation)
# );FastAPI Search Endpoint
Two endpoints: POST /search for ranked retrieval and GET /cases/{id} for full case detail. The score_breakdown field on each result shows how much of the score came from BM25 vs. the vector path, making the ranking auditable rather than opaque.
from fastapi import FastAPI
from pydantic import BaseModel
from typing import Literal
class SearchRequest(BaseModel):
q: str
mode: Literal["keyword", "semantic", "hybrid"] = "hybrid"
courts: list[str] = []
jurisdictions: list[str] = []
year_from: int | None = None
year_to: int | None = None
page: int = 1
per_page: int = 10
class CaseResult(BaseModel):
id: str
name: str
citation: str
court: str
jurisdiction: str
date: str
snippet: str # HTML with <mark> highlights
score: float # Normalised to [0, 1]
score_breakdown: dict # {"bm25": 0.82, "vector": 0.71, "final": 0.78}
class SearchResponse(BaseModel):
total: int
page: int
results: list[CaseResult]
facet_counts: dict # {"court": {"House of Lords": 8, ...}, ...}
mode_used: str # "keyword" | "hybrid" — which path fired
@app.post("/search", response_model=SearchResponse)
async def search(req: SearchRequest):
results = engine.search(
query=req.q,
mode=req.mode,
filters={"courts": req.courts, "jurisdictions": req.jurisdictions,
"year_range": (req.year_from, req.year_to)},
)
return paginate(results, req.page, req.per_page)
@app.get("/cases/{case_id}")
async def get_case(case_id: str) -> FullCase:
# Returns full text + metadata + citation graph for one case.
return db.get_case(case_id)Proposed UI: Search Results
The results page is deliberately not a chatbot. No message thread, no "ask me anything" prompt, no assistant persona. A search bar, a facet panel, and a ranked list. Each result card shows metadata, a relevance score, and a snippet with highlighted query terms. It reads like a library catalogue, not a chat window.
…the manufacturer of an article of food, medicine, or the like, sold by him to a distributor in circumstances which prevent the distributor or the ultimate purchaser or consumer from discovering by inspection any defect, is under a legal duty to the ultimate purchaser or consumer to take reasonable care that the article is free from defect likely to cause injury to health. The decision of the majority… recognised a general principle of negligence…
View full judgment →…in order to establish that a duty of care arises in a particular situation, it is not sufficient to ask simply whether there exists between the plaintiff and defendant a sufficiently close relationship of proximity or neighbourhood such that, in the reasonable contemplation of the former, carelessness on his part might be likely to cause damage to the latter. Three criteria must be satisfied: the damage must be foreseeable, there must be proximity, and it must be fair and reasonable to impose a duty…
View full judgment →…it does not follow that because a function is being exercised in relation to individuals who may be affected by its exercise, there is a duty of care owed to those individuals. The law of negligence does not impose a duty to take care in the exercise of a statutory power merely because the power was conferred for the protection of a class of persons…
View full judgment →Proposed UI: Case Detail
Three zones: a structured metadata block at the top (always visible), the full judgment with query terms highlighted in the passage the snippet generator found, and a citation sidebar showing outbound references and reverse links to in-corpus cases that cite this one.
My Lords, the sole question for determination in this case is legal: whether, as a matter of law in the circumstances alleged, the defender owed any duty to the pursuer to take care with regard to the security of the bottle of ginger beer. The law of negligence, whether you style it such or treat it as in other systems as a species of culpa, is no doubt based upon a general public sentiment of moral wrongdoing for which the offender must pay. But acts or omissions which any moral code would censure cannot in a practical world be treated so as to give a right to every person injured by them to demand relief…
The rule that you are to love your neighbour becomes in law, you must not injure your neighbour; and the lawyer's question, Who is my neighbour? receives a restricted reply. You must take reasonable care to avoid acts or omissions which you can reasonably foresee would be likely to injure your neighbour. Who, then, in law is my neighbour? The answer seems to be: persons who are so closely and directly affected by my act that I ought reasonably to have them in contemplation as being so affected when I am directing my mind to the acts or omissions which are called in question…
…The question is whether the manufacturer of an article of food, medicine, or the like, sold by him to a distributor in circumstances which prevent the distributor or the ultimate purchaser or consumer from discovering by inspection any defect, is under a legal duty to the ultimate purchaser or consumer to take reasonable care that the article is free from defect likely to cause injury to health. I do not think so ill of our jurisprudence as to suppose that its principles are so remote from the ordinary needs of civilised society…
Retrieval vs RAG: The Critical Distinction
RAG retrieves chunks of text and feeds them to a language model, which synthesises a new answer. This system does not do that. The retrieval step is the output. No language model touches the results. That's not a limitation; it's the point.
Legal research requires verifiability. A practitioner citing a case in a submission needs to know what the case actually says, not a model's paraphrase of it. This system returns the source and lets the practitioner read it.
| Aspect | This Casebasepure retrieval | Chatbot / RAGretrieval + generation |
|---|---|---|
| Query input | Keyword or conceptual query | Natural language question |
| Output | Ranked list of original case texts | LLM-generated prose answer |
| Citations | Exact and verified; the source is the result | May be synthesised or hallucinated |
| Determinism | Same query → same ranked list | Non-deterministic; temperature-dependent |
| Explainability | Score + matched fields visible | Black-box; hard to audit |
| Corpus fidelity | Each case is a discrete, intact record | Model may blend text across cases |
| Legal liability | System returns sources; user interprets | System implicitly interprets sources |
| Hallucination risk | None; only indexed text is returned | Inherent to all generative models |
Key Tradeoffs
Each decision below was a real fork. The choice made is documented with its rationale and the conditions under which the alternative would have been right.
Long judgments (50+ pages) produce document embeddings that average out into noise. A query about "duty of care" should match the paragraphs where Lord Atkin articulates that principle, not the judgment's global centroid.
30× more vectors, and case scoring requires max-pooling across paragraphs. Fine for a 5–50 case corpus.
Legal searches are often specific: citation lookups, case name searches, court-scoped queries. BM25 handles these with high precision. Vector search adds recall for conceptual queries where exact keywords don't appear.
The routing threshold is a tunable hyperparameter. Too high and you degrade recall; too low and you waste vector compute on queries BM25 would get right anyway.
Elasticsearch requires a JVM, cluster config, and index management, none of which is justified for a 5–50 case corpus. Whoosh is pure Python, embedded, zero-config. The calculus only flips above roughly 10,000 documents.
Whoosh is single-writer; concurrent indexing needs a file lock. Not a concern at this scale, but it would be the first bottleneck to hit.
ANN indexes pay off at 100,000+ vectors. At 50 cases × 30 paragraphs = ~1,500 vectors of 384 floats, a full cosine scan takes under 2ms. FAISS adds 50ms of index build time and gives back nothing on queries at this scale.
Brute-force scales linearly. At ~5,000 paragraphs (≈150 cases) a query still takes under 10ms. Beyond that, switch to FAISS.
The number next to each facet (e.g. 'House of Lords (8)') is a real BM25 faceted query count, not an estimate. Approximate counts mislead legal researchers who need to know exactly how many matching cases exist.
Exact facet counts require one extra Whoosh query per facet dimension. At this scale that's negligible.
Citation relationships are a graph problem: reverse lookups ("who cites this case?"), multi-hop traversal ("find cases two steps from Donoghue"). A SQL join handles this cleanly. Whoosh stored fields don't do graph queries.
Adds a Postgres dependency. SQLite with the same schema would work just as well at this corpus size.