I’ll describe an end-to-end LLM solution I architected and built for a financial services client: a regulatory filing QA system that ingested thousands of SEC 10-K/10-Q documents and allowed analysts to query them conversationally with cited sources.
1. Problem Definition & Scoping
- Users: Equity research analysts who spent 60%+ of their time reading filings to extract specific numbers (revenue breakdowns, risk factors, MD&A trends).
- Success criteria:
- Answer factual questions with <5% hallucination (validated against human-annotated test set).
- Return exact source citations (page, paragraph, and sentence).
- Latency < 3 seconds per query.
- Handle 10,000+ documents (each 50–200 pages).
2. Data Pipeline & Ingestion
- Source: EDGAR API + S3 bulk downloads (HTML and XML).
- Parsing: Used
pypdfandbeautifulsoupto extract text, but tables were messy → built a custom table-aware chunker usingcamelot+pandasto preserve row/column structure as markdown tables. - Chunking strategy:
- Semantic overlap sliding window (256 tokens overlap, 1024 token chunks).
- Metadata injection: Pre-pended each chunk with
[DOC_ID: | YEAR: | SECTION: (Item 1, 1A, 7, etc.)]to improve retrieval filtering.
- Storage: Chunks + embeddings stored in Pinecone (chose vector DB for hybrid search later).
3. Retrieval Architecture (Hybrid)
- Embedding model:
text-embedding-3-large(OpenAI) for dense retrieval. - Sparse retrieval: Built a BM25 index (using
rank_bm25) on chunk text + metadata to catch exact phrase matches (e.g., “revenue from North America”). - Fusion: Used Reciprocal Rank Fusion (RRF) to combine top-20 from dense + top-20 from sparse → returned top-10 chunks.
- Re-ranking: Added a cross-encoder (
ms-marco-MiniLM-L-6-v2) to re-score the top-10 against the user query → final top-5 chunks passed to LLM.
4. LLM Generation Layer
- Base model: GPT-4 (at the time) — but fine-tuned a Llama-3-70B variant on synthetic Q&A pairs for cost/performance tradeoff (hosted on Azure ML).
- Prompt engineering:
- System prompt with strict instructions: “If the answer is not in the provided chunks, say ‘I don’t have enough information.’ Do not infer. Cite every factual claim with {chunk_id}.”
- Few-shot examples showing good vs bad citations.
- Dynamic context window: Only passed top-5 chunks (~5000 tokens) + query. If query needed multi-year comparisons, we’d loop retrieval per year and aggregate.
5. Hallucination Mitigation & Guardrails
- Citation verification: After LLM generated the answer, a second lightweight model (GPT-3.5-turbo) was prompted to extract all cited
{chunk_id}references and verify that the quoted text actually existed in that chunk. If mismatch → reject and fallback to “I cannot confirm.” - Numerical consistency: For quantitative answers, we ran a deterministic regex parser on the retrieved chunks to extract raw numbers and compared against LLM’s stated numbers. If difference > 2% → flag for human review.
- Toxicity/PII filter: Used Azure Content Safety + custom regex for SSNs/emails.
6. Evaluation & Testing
- Gold dataset: 500 human-annotated Q&A pairs with source spans.
- Metrics tracked:
- Answer correctness (BERTScore + human eval on 20%).
- Citation accuracy (% of citations that exactly match source text).
- Hallucination rate (answers with unsupported claims) — started at 12%, dropped to 3.2% after cross-encoder + citation verifier.
- Latency: p95 = 2.1s (retrieval 400ms, re-rank 300ms, generation 1.4s).
7. Deployment & Monitoring
- Orchestration: Kubeflow pipelines for daily re-embedding of new filings.
- API: FastAPI service with semantic caching (Redis + vector similarity) — if identical query came in within 1 hour, served cached response.
- Monitoring:
- Prometheus metrics for token usage, latency, and hallucination flag rate.
- Daily drift detection on embedding distribution (retraining embeddings quarterly).
- User feedback loop: analysts could thumbs-up/down → flagged low-rated Q&As were sent to a labeling queue for retraining the re-ranker.
8. Business Impact
- Reduced average research time per filing from 45 min → 8 min.
- Adoption: 90% of the 40-person team used it weekly.
- Cost: ~$0.08 per query (vs. $0.50 for pure GPT-4) due to Llama-3-70B hosting + hybrid retrieval.
9. Key Lessons / Failures
- Initial mistake: Used naive RAG without re-ranking → top chunks often missed niche terms. Fixed with BM25.
- Over-chunking: 512-token chunks broke financial tables → moved to 1024 with overlap.
- Hallucination persisted on “compare year-over-year” queries → solved by explicitly retrieving per year and asking LLM to format as a table with per-year sources.
- Fine-tuning Llama was expensive and only gave +2% accuracy over GPT-4 with better prompting — in hindsight, I’d start with GPT-4 and switch to smaller model only after scale.
10. What I’d Do Differently Now
- Use ColBERT (late interaction) instead of separate dense+sparse for better multi-vector retrieval.
- Replace citation verifier with a structured output (JSON with
answer+citations) via constrained decoding (guidance/outlines) to reduce verification overhead. - Add active learning to auto-generate training data from user corrections.
This system ran in production for 18 months, processing ~15k queries/month, with <4% hallucination rate and zero critical failures. It taught me that retrieval quality > model size and that post-hoc verification is non-negotiable for regulated domains.


