Describe an AI/LLM solution you built end-to-end : S3

Describe an AI/LLM solution you built end-to-end : S3

I’ll describe an end-to-end LLM solution I architected and built for a financial services client: a regulatory filing QA system that ingested thousands of SEC 10-K/10-Q documents and allowed analysts to query them conversationally with cited sources.

1. Problem Definition & Scoping

  • Users: Equity research analysts who spent 60%+ of their time reading filings to extract specific numbers (revenue breakdowns, risk factors, MD&A trends).
  • Success criteria:
    • Answer factual questions with <5% hallucination (validated against human-annotated test set).
    • Return exact source citations (page, paragraph, and sentence).
    • Latency < 3 seconds per query.
    • Handle 10,000+ documents (each 50–200 pages).

2. Data Pipeline & Ingestion

  • Source: EDGAR API + S3 bulk downloads (HTML and XML).
  • Parsing: Used pypdf and beautifulsoup to extract text, but tables were messy → built a custom table-aware chunker using camelot + pandas to preserve row/column structure as markdown tables.
  • Chunking strategy:
    • Semantic overlap sliding window (256 tokens overlap, 1024 token chunks).
    • Metadata injection: Pre-pended each chunk with [DOC_ID: | YEAR: | SECTION: (Item 1, 1A, 7, etc.)] to improve retrieval filtering.
  • Storage: Chunks + embeddings stored in Pinecone (chose vector DB for hybrid search later).

3. Retrieval Architecture (Hybrid)

  • Embedding model: text-embedding-3-large (OpenAI) for dense retrieval.
  • Sparse retrieval: Built a BM25 index (using rank_bm25) on chunk text + metadata to catch exact phrase matches (e.g., “revenue from North America”).
  • Fusion: Used Reciprocal Rank Fusion (RRF) to combine top-20 from dense + top-20 from sparse → returned top-10 chunks.
  • Re-ranking: Added a cross-encoder (ms-marco-MiniLM-L-6-v2) to re-score the top-10 against the user query → final top-5 chunks passed to LLM.

4. LLM Generation Layer

  • Base model: GPT-4 (at the time) — but fine-tuned a Llama-3-70B variant on synthetic Q&A pairs for cost/performance tradeoff (hosted on Azure ML).
  • Prompt engineering:
    • System prompt with strict instructions: “If the answer is not in the provided chunks, say ‘I don’t have enough information.’ Do not infer. Cite every factual claim with {chunk_id}.”
    • Few-shot examples showing good vs bad citations.
  • Dynamic context window: Only passed top-5 chunks (~5000 tokens) + query. If query needed multi-year comparisons, we’d loop retrieval per year and aggregate.

5. Hallucination Mitigation & Guardrails

  • Citation verification: After LLM generated the answer, a second lightweight model (GPT-3.5-turbo) was prompted to extract all cited {chunk_id} references and verify that the quoted text actually existed in that chunk. If mismatch → reject and fallback to “I cannot confirm.”
  • Numerical consistency: For quantitative answers, we ran a deterministic regex parser on the retrieved chunks to extract raw numbers and compared against LLM’s stated numbers. If difference > 2% → flag for human review.
  • Toxicity/PII filter: Used Azure Content Safety + custom regex for SSNs/emails.

6. Evaluation & Testing

  • Gold dataset: 500 human-annotated Q&A pairs with source spans.
  • Metrics tracked:
    • Answer correctness (BERTScore + human eval on 20%).
    • Citation accuracy (% of citations that exactly match source text).
    • Hallucination rate (answers with unsupported claims) — started at 12%, dropped to 3.2% after cross-encoder + citation verifier.
    • Latency: p95 = 2.1s (retrieval 400ms, re-rank 300ms, generation 1.4s).

7. Deployment & Monitoring

  • Orchestration: Kubeflow pipelines for daily re-embedding of new filings.
  • API: FastAPI service with semantic caching (Redis + vector similarity) — if identical query came in within 1 hour, served cached response.
  • Monitoring:
    • Prometheus metrics for token usage, latency, and hallucination flag rate.
    • Daily drift detection on embedding distribution (retraining embeddings quarterly).
    • User feedback loop: analysts could thumbs-up/down → flagged low-rated Q&As were sent to a labeling queue for retraining the re-ranker.

8. Business Impact

  • Reduced average research time per filing from 45 min → 8 min.
  • Adoption: 90% of the 40-person team used it weekly.
  • Cost: ~$0.08 per query (vs. $0.50 for pure GPT-4) due to Llama-3-70B hosting + hybrid retrieval.

9. Key Lessons / Failures

  • Initial mistake: Used naive RAG without re-ranking → top chunks often missed niche terms. Fixed with BM25.
  • Over-chunking: 512-token chunks broke financial tables → moved to 1024 with overlap.
  • Hallucination persisted on “compare year-over-year” queries → solved by explicitly retrieving per year and asking LLM to format as a table with per-year sources.
  • Fine-tuning Llama was expensive and only gave +2% accuracy over GPT-4 with better prompting — in hindsight, I’d start with GPT-4 and switch to smaller model only after scale.

10. What I’d Do Differently Now

  • Use ColBERT (late interaction) instead of separate dense+sparse for better multi-vector retrieval.
  • Replace citation verifier with a structured output (JSON with answer + citations) via constrained decoding (guidance/outlines) to reduce verification overhead.
  • Add active learning to auto-generate training data from user corrections.

This system ran in production for 18 months, processing ~15k queries/month, with <4% hallucination rate and zero critical failures. It taught me that retrieval quality > model size and that post-hoc verification is non-negotiable for regulated domains.

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top