Describe an AI/LLM solution you built end-to-end : S3

I’ll describe an end-to-end LLM solution I architected and built for a financial services client: a regulatory filing QA system that ingested thousands of SEC 10-K/10-Q documents and allowed analysts to query them conversationally with cited sources.

1. Problem Definition & Scoping

Users: Equity research analysts who spent 60%+ of their time reading filings to extract specific numbers (revenue breakdowns, risk factors, MD&A trends).
Success criteria:
- Answer factual questions with <5% hallucination (validated against human-annotated test set).
- Return exact source citations (page, paragraph, and sentence).
- Latency < 3 seconds per query.
- Handle 10,000+ documents (each 50–200 pages).

2. Data Pipeline & Ingestion

Source: EDGAR API + S3 bulk downloads (HTML and XML).
Parsing: Used pypdf and beautifulsoup to extract text, but tables were messy → built a custom table-aware chunker using camelot + pandas to preserve row/column structure as markdown tables.
Chunking strategy:
- Semantic overlap sliding window (256 tokens overlap, 1024 token chunks).
- Metadata injection: Pre-pended each chunk with [DOC_ID: | YEAR: | SECTION: (Item 1, 1A, 7, etc.)] to improve retrieval filtering.
Storage: Chunks + embeddings stored in Pinecone (chose vector DB for hybrid search later).

3. Retrieval Architecture (Hybrid)

Embedding model: text-embedding-3-large (OpenAI) for dense retrieval.
Sparse retrieval: Built a BM25 index (using rank_bm25) on chunk text + metadata to catch exact phrase matches (e.g., “revenue from North America”).
Fusion: Used Reciprocal Rank Fusion (RRF) to combine top-20 from dense + top-20 from sparse → returned top-10 chunks.
Re-ranking: Added a cross-encoder (ms-marco-MiniLM-L-6-v2) to re-score the top-10 against the user query → final top-5 chunks passed to LLM.

4. LLM Generation Layer

Base model: GPT-4 (at the time) — but fine-tuned a Llama-3-70B variant on synthetic Q&A pairs for cost/performance tradeoff (hosted on Azure ML).
Prompt engineering:
- System prompt with strict instructions: “If the answer is not in the provided chunks, say ‘I don’t have enough information.’ Do not infer. Cite every factual claim with {chunk_id}.”
- Few-shot examples showing good vs bad citations.
Dynamic context window: Only passed top-5 chunks (~5000 tokens) + query. If query needed multi-year comparisons, we’d loop retrieval per year and aggregate.

5. Hallucination Mitigation & Guardrails

Citation verification: After LLM generated the answer, a second lightweight model (GPT-3.5-turbo) was prompted to extract all cited {chunk_id} references and verify that the quoted text actually existed in that chunk. If mismatch → reject and fallback to “I cannot confirm.”
Numerical consistency: For quantitative answers, we ran a deterministic regex parser on the retrieved chunks to extract raw numbers and compared against LLM’s stated numbers. If difference > 2% → flag for human review.
Toxicity/PII filter: Used Azure Content Safety + custom regex for SSNs/emails.

6. Evaluation & Testing

Gold dataset: 500 human-annotated Q&A pairs with source spans.
Metrics tracked:
- Answer correctness (BERTScore + human eval on 20%).
- Citation accuracy (% of citations that exactly match source text).
- Hallucination rate (answers with unsupported claims) — started at 12%, dropped to 3.2% after cross-encoder + citation verifier.
- Latency: p95 = 2.1s (retrieval 400ms, re-rank 300ms, generation 1.4s).

7. Deployment & Monitoring

Orchestration: Kubeflow pipelines for daily re-embedding of new filings.
API: FastAPI service with semantic caching (Redis + vector similarity) — if identical query came in within 1 hour, served cached response.
Monitoring:
- Prometheus metrics for token usage, latency, and hallucination flag rate.
- Daily drift detection on embedding distribution (retraining embeddings quarterly).
- User feedback loop: analysts could thumbs-up/down → flagged low-rated Q&As were sent to a labeling queue for retraining the re-ranker.

8. Business Impact

Reduced average research time per filing from 45 min → 8 min.
Adoption: 90% of the 40-person team used it weekly.
Cost: ~$0.08 per query (vs. $0.50 for pure GPT-4) due to Llama-3-70B hosting + hybrid retrieval.

9. Key Lessons / Failures

Initial mistake: Used naive RAG without re-ranking → top chunks often missed niche terms. Fixed with BM25.
Over-chunking: 512-token chunks broke financial tables → moved to 1024 with overlap.
Hallucination persisted on “compare year-over-year” queries → solved by explicitly retrieving per year and asking LLM to format as a table with per-year sources.
Fine-tuning Llama was expensive and only gave +2% accuracy over GPT-4 with better prompting — in hindsight, I’d start with GPT-4 and switch to smaller model only after scale.

10. What I’d Do Differently Now

Use ColBERT (late interaction) instead of separate dense+sparse for better multi-vector retrieval.
Replace citation verifier with a structured output (JSON with answer + citations) via constrained decoding (guidance/outlines) to reduce verification overhead.
Add active learning to auto-generate training data from user corrections.

This system ran in production for 18 months, processing ~15k queries/month, with <4% hallucination rate and zero critical failures. It taught me that retrieval quality > model size and that post-hoc verification is non-negotiable for regulated domains.