GenAI/LLM Development Experience – Complete Interview Questions & Answers (2026 Edition)

GenAI/LLM Development Experience

This is one of the most frequently asked interview topics for AI Architects, AI Engineers, Applied AI Engineers, GenAI Developers, and Solution Architects.

1. Tell me about your GenAI/LLM development experience.

Sample Answer

I have hands-on experience designing and developing enterprise Generative AI applications using Large Language Models. My work includes building RAG systems, AI assistants, document intelligence platforms, enterprise search, workflow automation, prompt engineering, and multi-agent AI systems.

I have worked with models including GPT-4/5, Claude, Llama, Mistral, Amazon Nova, and Amazon Titan through Amazon Bedrock and OpenAI APIs.

My responsibilities include:

  • Requirement gathering
  • Architecture design
  • Prompt engineering
  • RAG implementation
  • Vector database integration
  • Function calling
  • Agent orchestration
  • Evaluation framework
  • Guardrails
  • Deployment
  • Monitoring
  • Cost optimization
  • Responsible AI

I primarily develop solutions using Python, LangChain, LangGraph, FastAPI, AWS Bedrock, Lambda, ECS/EKS, DynamoDB, S3, OpenSearch, Pinecone, and GitHub Actions.

2. What GenAI projects have you built?

Example Projects:

Enterprise Knowledge Assistant

Features

  • RAG
  • PDF ingestion
  • SharePoint documents
  • SQL database
  • Citation support
  • Chat interface

Technology

  • GPT-4
  • Bedrock
  • LangChain
  • Pinecone
  • FastAPI

Healthcare Assistant

Features

  • Medical guideline search
  • Clinical document summarization
  • ICD code lookup
  • Drug interaction explanation

Customer Support Chatbot

Features

  • Ticket summarization
  • Response generation
  • Knowledge search
  • CRM integration

Contract Review Assistant

Features

  • Clause extraction
  • Risk detection
  • Obligation identification
  • Compliance checking

Financial Document Analyzer

Features

  • SEC filings
  • Earnings reports
  • Risk summarization
  • KPI extraction

3. Explain your GenAI application architecture.

User



Frontend
React



FastAPI



Prompt Builder



Retriever



Vector Database



LLM



Guardrails



Response Formatter



User

4. Which LLMs have you used?

Possible Answer

I have worked with

  • GPT-4
  • GPT-4 Turbo
  • GPT-5
  • Claude
  • Llama 2
  • Llama 3
  • Mistral
  • Mixtral
  • Amazon Titan
  • Amazon Nova
  • Cohere Command
  • Jurassic
  • Gemini

5. Which platforms have you used?

Answer

  • OpenAI
  • Amazon Bedrock
  • Azure OpenAI
  • Google Vertex AI
  • Hugging Face
  • Ollama
  • Together AI

6. Why use Amazon Bedrock instead of OpenAI API?

Answer

Benefits

  • Managed service
  • Multiple foundation models
  • IAM integration
  • VPC support
  • Private networking
  • Guardrails
  • Knowledge Bases
  • Enterprise security
  • No infrastructure management

7. Explain a complete GenAI workflow.

User Question



API Gateway



FastAPI



Authentication



Prompt Builder



Retriever



Embedding Model



Vector Search



Context



LLM



Output Parser



Guardrails



Frontend

8. Explain Prompt Engineering.

Prompt engineering means designing prompts that consistently guide an LLM to produce accurate, relevant, and safe outputs.

Techniques

  • Zero-shot prompting
  • One-shot prompting
  • Few-shot prompting
  • Chain-of-thought (used carefully and generally not exposed in production)
  • Role prompting
  • Persona prompting
  • Delimiter-based prompting
  • XML/JSON structured prompts
  • Output schema enforcement
  • Self-consistency
  • ReAct prompting

9. How do you reduce hallucinations?

Answer

Multiple approaches:

  • RAG
  • Grounding
  • Better prompts
  • Temperature reduction
  • Output validation
  • Citation generation
  • Knowledge base lookup
  • Human review
  • Guardrails
  • Confidence scoring

10. What is RAG?

Answer

Retrieval-Augmented Generation combines information retrieval with language model generation.

Flow

Question



Embedding



Vector Search



Relevant Documents



Prompt



LLM



Answer

Benefits

  • Current knowledge
  • Lower hallucination
  • Domain-specific answers
  • No model retraining

11. Which embedding models have you used?

Examples

  • OpenAI text-embedding-3-large
  • Titan Embeddings
  • BGE
  • E5
  • Instructor XL
  • MiniLM

12. Which vector databases have you used?

  • Pinecone
  • OpenSearch
  • FAISS
  • Chroma
  • Weaviate
  • Milvus
  • Qdrant
  • pgvector

13. Explain chunking strategies.

Methods

Fixed-size chunking

1000 characters

Sliding window

Chunk 1

Chunk 2

Overlap

Chunk 3

Semantic chunking

Recursive chunking

Document-aware chunking

Section-based chunking

14. What is semantic search?

Traditional Search

Keyword matching

Semantic Search

Meaning matching

Uses embeddings.

15. Explain function calling.

LLMs decide whether to call external tools.

Example

User:

“Book a meeting.”

LLM

Calendar API

Meeting booked

Response

16. What tools have you integrated?

Examples

  • SQL databases
  • REST APIs
  • Salesforce
  • Jira
  • ServiceNow
  • SAP
  • SharePoint
  • Outlook
  • Gmail
  • Slack
  • Microsoft Teams

17. Explain AI Agents.

Agents

  • Reason
  • Plan
  • Decide
  • Use tools
  • Execute tasks
  • Iterate
  • Return results

18. Difference between chatbot and AI Agent?

ChatbotAI Agent
AnswersActs
Single responseMulti-step
No planningPlanning
No toolsTool usage
StatelessStateful
Limited reasoningAutonomous workflows

19. Have you built multi-agent systems?

Example

Research Agent

Planning Agent

Coding Agent

Review Agent

Reporting Agent

20. What frameworks have you used?

  • LangChain
  • LangGraph
  • AutoGen
  • CrewAI
  • Semantic Kernel
  • LlamaIndex
  • Haystack

21. Explain LangChain.

Features

  • Chains
  • Agents
  • Memory
  • Retrieval
  • Tools
  • Prompt templates
  • Output parsers

22. Explain LangGraph.

Advantages

  • Stateful workflows
  • Cyclic execution
  • Human approval
  • Checkpointing
  • Multi-agent orchestration
  • Durable execution

23. What is MCP?

Model Context Protocol (MCP) is an open protocol that standardizes how LLMs connect to external tools, data sources, and services. Instead of writing custom integrations for every application, MCP provides a consistent interface for discovering and invoking tools.

Benefits:

  • Standardized tool integration
  • Easier interoperability
  • Reusable connectors
  • Improved security and governance

24. How do you secure GenAI applications?

Security measures:

  • Authentication and authorization
  • Role-based access control (RBAC)
  • IAM policies
  • Encryption in transit and at rest
  • Secrets management
  • Prompt injection protection
  • Input/output validation
  • Data masking
  • Audit logging
  • Content filtering
  • Network isolation (VPC/private endpoints)

25. How do you evaluate LLM performance?

Common evaluation metrics:

  • Answer correctness
  • Groundedness
  • Faithfulness
  • Relevance
  • Context precision and recall
  • Hallucination rate
  • Toxicity
  • Latency
  • Cost per request
  • User satisfaction
  • Task completion rate

26. How do you optimize GenAI costs?

Strategies:

  • Select the smallest suitable model
  • Cache frequent responses
  • Optimize prompts
  • Limit output tokens
  • Use RAG instead of larger context windows
  • Batch embedding jobs
  • Stream responses
  • Monitor token usage
  • Route requests to different models based on complexity

27. How do you monitor GenAI applications?

Monitor:

  • Request volume
  • Token usage
  • Latency
  • Error rates
  • Model availability
  • Hallucination trends
  • User feedback
  • Prompt failures
  • Retrieval quality
  • Infrastructure health

Tools include cloud monitoring platforms, application observability tools, and LLM-specific tracing frameworks.

28. What are common challenges in production GenAI systems?

  • Hallucinations
  • Prompt injection
  • Retrieval failures
  • Context window limitations
  • High latency
  • Cost overruns
  • Data privacy concerns
  • Model version changes
  • Evaluation complexity
  • Scaling concurrent requests

29. Describe an end-to-end GenAI project.

Sample Answer:

“I built an enterprise knowledge assistant that allows employees to query internal documentation using natural language. Documents from SharePoint, PDFs, and S3 were ingested, cleaned, chunked, embedded, and stored in a vector database. A FastAPI backend handled authentication, retrieval, prompt construction, and LLM inference through Amazon Bedrock. We implemented citation-based responses, guardrails, logging, and monitoring. The solution reduced support ticket volume by approximately 40% and reduced average document search time from several minutes to a few seconds.”

30. What interview follow-up questions should you expect?

Be prepared to answer:

  • Why did you choose RAG over fine-tuning?
  • How do you evaluate retrieval quality?
  • How do you handle conflicting documents?
  • How do you implement hybrid search?
  • How do you optimize chunk size?
  • How do you prevent prompt injection attacks?
  • How do you design an AI agent architecture?
  • How do you manage conversation memory?
  • How do you deploy LLM applications on AWS?
  • How do you build multi-tenant GenAI applications?
  • How do you implement Responsible AI and governance?
  • How do you support human-in-the-loop workflows?
  • How do you version prompts and models?
  • How do you perform A/B testing across LLMs?
  • How do you select between GPT, Claude, Llama, and other models?
  • How do you debug poor LLM responses?

Interview Tips for Senior AI Architect Roles

For senior-level interviews, emphasize:

  • End-to-end architecture design rather than only prompt writing.
  • Business outcomes (cost savings, productivity gains, user adoption).
  • Production readiness, including CI/CD, monitoring, observability, and rollback strategies.
  • Security, governance, and compliance (especially in regulated industries such as healthcare and finance).
  • Trade-off analysis between model choice, latency, accuracy, and cost.
  • Experience with agentic AI, RAG, tool calling, MCP, evaluation frameworks, and Responsible AI.

These topics are the core areas interviewers typically assess when evaluating candidates for Senior AI Engineer, AI Architect, Principal AI Engineer, Applied AI Engineer, and GenAI Solution Architect roles.

GenAI/LLM development experience interview questions typically assess fundamentals, practical implementation (RAG, fine-tuning, agents), production/deployment, evaluation, safety, and real-world problem-solving. Interviewers prioritize hands-on experience over theory—be ready to discuss projects, trade-offs, failures, and metrics.

I’ve organized ~50+ common questions (drawn from frequent 2025–2026 interview patterns) into categories with concise, interview-ready answers. Tailor responses to your experience.

1. LLM Fundamentals & Architecture

Q: What is the difference between a base model and an instruction-tuned model? A base model is trained on next-token prediction over large corpora for text completion. An instruction-tuned model undergoes further supervised fine-tuning (SFT) on instruction-response pairs, often with RLHF/RLAIF or DPO, to follow user intent, be helpful, and safe. Use instruction-tuned models for most apps unless doing heavy custom fine-tuning.

Q: Explain the (scaled dot-product) attention mechanism in Transformers and why scaling matters. Self-attention lets each token attend to all others via Query-Key dot products, producing weights for a weighted sum of Value vectors: score(Q, K) = softmax(QK^T / sqrt(d_k)) * V. The sqrt(d_k) scaling prevents large dot products (as d_k grows) from saturating softmax, preserving gradients and training stability. This parallel processing of long-range dependencies replaced RNNs. Multi-head attention captures different relations.

Q: What are KV cache, GQA/MQA, and their memory implications? During autoregressive generation, KV cache stores prior Key/Value tensors to avoid recomputation. Memory: ~2 * layers * heads * head_dim * seq_len * batch * bytes. GQA (Grouped-Query Attention, e.g., Llama 3) or MQA shares KV heads, reducing cache size (4x+) with minimal quality loss. Critical for long contexts (128k+).

Q: Explain positional encodings and evolutions (RoPE, etc.). Original: sinusoidal (fixed). RoPE (Rotary Position Embeddings) rotates Q/K vectors—better extrapolation to longer contexts, compatible with optimizations. Common in modern models.

Q: What are tokens, embeddings, BPE, and common issues? Tokens are subword units (BPE merges frequent pairs). Embeddings are vector representations capturing semantics. Issues: whitespace sensitivity, number fragmentation, non-Latin scripts.

Q: Pre-training vs. SFT vs. RLHF/DPO? Pre-training: unsupervised next-token on massive data (knowledge). SFT: instruction pairs (behavior). RLHF: reward model + PPO (alignment). DPO: simpler preference optimization, often replaces PPO (more stable).

Other common: Context window & “lost in the middle”; temperature/top-p/top-k sampling; Chinchilla scaling laws (optimal tokens ~20x parameters); MoE (Mixture of Experts) for capacity vs. active params; FlashAttention.

2. Prompt Engineering & In-Context Learning

Q: Zero-shot, few-shot, Chain-of-Thought (CoT)? When does CoT help? Zero-shot: task description only. Few-shot: examples. CoT: “think step by step”—boosts reasoning in larger models on arithmetic/multi-step tasks. Use verifiable CoT.

Q: Prompt injection and defenses? User input overrides system prompt. Defend with XML delimiters, input sanitization/classification, output validation/guardrails.

Q: How do you choose prompting vs. RAG vs. fine-tuning? Prompting: quick, cheap (start here). RAG: external/up-to-date knowledge + citations. Fine-tuning: style, format, domain behavior (consistent output). Hybrid often best.

3. Retrieval-Augmented Generation (RAG)

Q: How does RAG work? Core components and evaluation? Ingestion (chunking + embedding) → Vector store (similarity search) → Retrieve + augment prompt → Generate (with citations). Eval (RAGAS): Faithfulness, Answer Relevance, Context Precision/Recall. Hybrid search (vector + BM25), rerankers (cross-encoders).

Q: Chunking strategies? Lost in the middle? Fixed-size w/ overlap, semantic/hierarchical. Lost in the middle: models ignore middle context—mitigate by reranking (most relevant first) or fewer chunks.

Q: Failure modes and mitigations in production RAG? Irrelevant retrieval, chunk mismatch, embedding drift, no guardrails, hallucinations. Mitigations: hybrid search, rerankers, metadata filters, faithfulness checks, monitoring.

Q: RAG vs. fine-tuning? RAG: dynamic knowledge, citations, low cost to update. Fine-tuning: internalizes style/behavior (not ideal for facts). Often combine (fine-tune for format + RAG for data).

4. Fine-Tuning & Adaptation

Q: When and how to fine-tune? PEFT methods like LoRA/QLoRA? Use for domain style, format, or consistent behavior. LoRA: low-rank adapters (train small % of params). QLoRA: quantized for efficiency on consumer hardware. Trade-offs: catastrophic forgetting, data quality.

Q: Full fine-tuning vs. PEFT? Dataset prep? PEFT for efficiency. Prep: high-quality, diverse, formatted instruction pairs; dedup, clean. Hyperparams: learning rate, epochs, batch. Eval for regressions.

5. Agents & Advanced Architectures

Q: ReAct vs. Plan-and-Execute? Multi-agent systems? ReAct: interleave reasoning + tool use. Plan-and-Execute: upfront plan then act. Multi-agent: specialized roles (researcher + critic) for complex tasks. Use frameworks like LangGraph/CrewAI. Challenges: loops, cost, coordination.

Q: Tool use, function calling, agentic workflows? Give LLMs tools (APIs, code interpreter). Agent loops need error handling, memory, safety. Agentic > simple chatbot for multi-step goals.

6. Evaluation, Safety & Ethics

Q: How to evaluate LLMs/apps? LLM-as-Judge? Automated: faithfulness, ROUGE/BLEU (limited), RAGAS. Human prefs. LLM-as-Judge (stronger model scores outputs). Track hallucinations, bias, toxicity.

Q: Hallucination mitigation? RAG grounding, CoT, self-consistency, guardrails, output validation, citations, “I don’t know” prompts.

Q: Safety/alignment approaches? Prompt injection, bias? RLHF/DPO/Constitutional AI. Guardrails, moderation, red-teaming, input/output filters. Ethical: deepfakes, IP, fairness, transparency.

7. Deployment, Inference & Production

Q: Inference optimization (quantization, batching, vLLM)? Quantization (INT8/4), KV cache management, continuous/paged batching (vLLM + PagedAttention), speculative decoding, prefix caching. Trade-offs: latency vs. throughput, quality.

Q: Design a production RAG/LLM system (scalability, cost, monitoring)? Ingestion pipeline, vector DB (with hybrid), reranking, caching, load balancing, observability (latency per stage, faithfulness, token usage), auto-scaling, guardrails. Cost: prompt caching, smaller models + RAG.

Q: Deployment challenges? Latency, cost at scale, reliability, versioning, compliance. Tools: Docker/K8s, FastAPI, cloud (vLLM/TGI), MLOps.

8. Behavioral & Experience Questions

Q: Describe a challenging GenAI/LLM project. Challenges and solutions? (Use STAR.) E.g., “Built RAG for enterprise docs: hallucination issues → added reranker + faithfulness eval; scaled retrieval with hybrid search.” Discuss metrics improved, trade-offs.

Q: How do you handle production issues (e.g., drift, cost overruns)? Monitoring dashboards, periodic evals, A/B testing, fallback mechanisms, budget alerts.

Other common: Experience with Hugging Face, LangChain/LlamaIndex, vLLM, specific models (Llama, Mistral, GPT); multi-modal; future trends (agents, longer context, efficiency).

Preparation Tips:

  • Build projects: RAG app, fine-tune with QLoRA, simple agent.
  • Know trade-offs deeply (RAG vs. fine-tune, prompting strategies).
  • Practice system design and debugging scenarios.
  • Quantify impact (e.g., “reduced hallucinations by 40% via…”).

This covers the vast majority of questions.

This is a comprehensive guide to interview questions for roles involving GenAI/LLM Development (e.g., ML Engineer, AI Engineer, Applied Scientist).

I have organized this into tiers of difficulty and categories, complete with “Good” vs. “Great” answer frameworks.

Tier 1: The Fundamentals (Must-Know)

These are screening round questions. If you stumble here, you won’t progress.

Q1: Explain the Transformer architecture in 2 minutes.

  • Good Answer: It uses an encoder-decoder structure with self-attention mechanisms. It processes all tokens simultaneously rather than sequentially (like RNNs), using positional encodings to understand order. Multi-head attention allows the model to focus on different parts of the input simultaneously.
  • Great Answer: The core innovation is the Scaled Dot-Product Attention (Attention(Q,K,V)=softmax(QKT/dk)VAttention(Q,K,V)=softmax(QKT/dk​​)V). The scaling (dkdk​​) prevents the softmax from entering regions with extremely small gradients. Architecturally, it relies on Residual connections and Layer Normalization to stabilize training for very deep networks. I’d also point out that modern LLMs (like GPT) use only the Decoder stack with a causal mask, whereas BERT uses the Encoder stack.

Q2: What is the difference between GPT and BERT?

  • Good Answer: BERT is encoder-only and bidirectional (uses masked language modeling), making it great for understanding tasks (classification, NER). GPT is decoder-only and autoregressive (predicts the next token left-to-right), making it great for generation.
  • Great Answer: Beyond the architecture, the training objectives dictate their use cases. BERT’s MLM pretraining allows it to see “future” context, which is powerful for embedding search. GPT’s causal LM objective makes it ideal for in-context learning (few-shot prompting). In production, GPT is harder to serve for low-latency tasks because generation is sequential (O(n) complexity per token), whereas BERT inference is a single forward pass.

Q3: What is Temperature, Top-K, and Top-P (Nucleus Sampling)?

  • Good Answer: They control randomness in generation. Temperature scales the logits before softmax (high temp = more random). Top-K samples from the K most likely tokens. Top-P samples from the smallest set of tokens whose cumulative probability exceeds P.
  • Great Answer: Temperature changes the shape of the probability distribution (it doesn’t truncate). Top-K is brittle because the number of plausible tokens varies by context (e.g., the first word of a sentence has many options, but the last word of “The capital of France is…” has few). Top-P is superior because it dynamically adjusts the vocabulary size based on confidence. In practice, I usually set Temperature=0.1 or 0.2 (for deterministic code) and Top-P=0.9, keeping Top-K=0 (disabled).

Tier 2: The Development Workflow (Hands-On)

These test your actual engineering experience building with these models.

Q4: Walk me through your typical RAG (Retrieval-Augmented Generation) pipeline. Where did you encounter bottlenecks?

  • Good Answer: We ingest documents, chunk them, embed them with an embedding model, store them in a vector DB (like Pinecone), and at query time, we retrieve similar chunks and stuff them into the context window of the LLM.
  • Great Answer: The bottlenecks were two-fold:
    1. Chunking strategy: Fixed-size chunking broke semantic meaning. We implemented RecursiveCharacterTextSplitter with semantic overlap, and later moved to Document Summary Indexing (indexing the summary, retrieving the full doc).
    2. Retrieval quality: Naive embedding often retrieved irrelevant chunks. We implemented HyDE (Hypothetical Document Embeddings) to generate a fake answer first and embed that for retrieval, improving recall by 15%. We also added a Re-rank stage (using Cohere/Cross-encoders) after initial vector search to filter out false positives before feeding tokens to the LLM.

Q5: How do you evaluate an LLM application (RAG/Agent) offline?

  • Good Answer: We use held-out test sets and compare outputs using metrics like BLEU, ROUGE, or BERTScore.
  • Great Answer: BLEU/ROUGE are terrible for LLMs because they punish rephrasing. We use a three-pronged approach:
    1. Component-wise metrics: Hit-rate and Mean Reciprocal Rank (MRR) for the retriever.
    2. LLM-as-a-Judge: Using a strong model (e.g., GPT-4) to score outputs on correctness (factuality) and completeness against a golden answer. We run these on a dataset of ~500 diverse queries.
    3. RAGAS (RAG Assessment): Specifically tracking Faithfulness (does the answer stay grounded in the context?) and Answer Relevancy. We built a small internal labeling tool to spot-check 50 of the worst-performing “LLM-as-a-Judge” scores to catch biases.

Q6: How do you handle context window limitations when processing long documents (e.g., a 500-page PDF)?

  • Good Answer: We split the document into chunks and only feed the relevant chunks to the LLM.
  • Great Answer: Simply stuffing chunks fails for global reasoning. We use a Hierarchical Indexing strategy:
    • We index the document structure (headings/chapters).
    • For summarization, we use “Map-Reduce” (LangChain): summarizing each chunk (Map), then summarizing the summaries (Reduce).
    • For Q&A over huge docs, we use Parent Document Retrieval: we embed only the smaller child chunks for high accuracy matching, but retrieve the larger parent chunk (including surrounding context) to give the LLM enough detail to reason with.

Tier 3: Advanced Engineering & Optimization (Senior Level)

These differentiate a “prompt engineer” from a “production ML engineer.”

Q7: How do you reduce hallucinations in a production system?

  • Good Answer: Use RAG with high-quality data, and prompt engineer to say “If you don’t know, say you don’t know.”
  • Great Answer: We use a multi-layer defense:
    1. Prevention: We use Self-consistency (sampling multiple times and taking the majority answer) for math/logic tasks.
    2. Mitigation: We implement a fact-checking step where we pass the generated answer + retrieved chunks to a smaller, fine-tuned NLI (Natural Language Inference) model to check if the answer is entailed by the context.
    3. Systemic: We built an evals pipeline that specifically tracks “Contradiction” rates over time. If hallucinations spike, we trigger a rollback to the previous model version or log the offending queries for dataset augmentation.

Q8: You have to serve an LLM (e.g., Llama 3 70B) with a $200/month budget. How do you do it?

  • Trick question: Llama 70B requires ~140GB of VRAM (FP16). $200/month covers a single A100 (24GB) or T4.
  • Great Answer:
    • Impossible on one GPU. I would use Quantization (load in 4-bit via bitsandbytes) to drop VRAM to ~40GB.
    • Even then, a T4 (16GB) won’t cut it. I would use vLLM for PagedAttention and serve it across 2x T4s (24GB total) using tensor parallelism.
    • To save compute, I’d implement a “router” using a much smaller model (e.g., a BERT classifier) that classifies the user intent. 80% of simple queries get routed to a cheap 7B model (e.g., Mistral), and only complex queries hit the 70B quantized model. This keeps costs under budget.

Q9: What is PagedAttention and why is it revolutionary?

  • Good Answer: It’s a technique used in vLLM that manages the KV cache of the transformer to reduce memory fragmentation.
  • Great Answer: In autoregressive generation, the memory used for keys/values grows linearly with sequence length. Traditional systems pre-allocate contiguous memory for a max sequence length (wasting 60-80% of memory). PagedAttention partitions the KV cache into fixed-size “pages” that are stored non-contiguously, similar to virtual memory in an OS. This allows for memory sharing across multiple generation sequences (useful for parallel sampling), reducing memory usage by up to 80% and increasing throughput significantly.

Q10: You need to fine-tune a 7B parameter model but only have a single 24GB GPU. What do you do?

  • Good Answer: Use LoRA (Low-Rank Adaptation).
  • Great Answer:
    1. Use QLoRA (quantized 4-bit base model + LoRA adapters) which fits easily into 24GB.
    2. Implement Gradient Checkpointing to trade compute for memory.
    3. Use a constant batch size of 1 with Gradient Accumulation over 4-8 steps to simulate a larger batch without increasing memory.
    4. Use the AdamW 8-bit optimizer.
      Crucially, I would ensure the dataset is formatted using a chat template (e.g., ChatML) and that the loss is calculated only on the assistant’s response tokens (by setting labels = -100 for user tokens) to prevent the model from learning to imitate the user’s questions.

Tier 4: Situational & Behavioral (The “Firefighting” Questions)

Q11: Your prompt works perfectly in a notebook but fails when deployed to production. Why?

  • Great Answer:
    1. Version drift: The production environment likely has a different transformers version, or the model is not loaded in eval() mode (disabling dropout).
    2. System prompt leakage: The deployed chat template might have a different system prompt than my notebook.
    3. Determinism: I set torch.manual_seed() in the notebook but not in the deployment script, leading to slightly different sampling paths (even with temperature=0, GPU floating-point non-determinism exists).
    4. Tokenization: The production code might not be using add_special_tokens=True or truncation correctly, altering the input meaning.

Q12: A business stakeholder asks: “Why did the AI answer this question incorrectly? Can we just add it to the prompt?”

  • Great Answer: I would respond, “We can add a rule to the system prompt today as a hotfix (e.g., ‘Always say X when asked about Y’). However, prompt engineering is fragile and doesn’t scale. The root cause is likely a lack of contextual data in the RAG pipeline. The correct engineering fix is to improve the retrieval step for this specific query type or add this specific Q&A pair to a fine-tuning dataset for the next release. Let’s fix it with prompt engineering now, but we need an engineering ticket to improve the data pipeline.”

Tier 5: The “Whiteboard” Prompt Engineering Challenge

Q13: Write a prompt for a “Data Analyst Agent” that writes SQL. The prompt must handle ambiguous column names.

  • Great Answer Structure:textSYSTEM: You are a SQL expert. You have access to the following Postgres schema: [SCHEMA]: {schema} [RULES]: 1. Never use SELECT *. Always specify columns. 2. If a user asks for “sales” and there are multiple columns (gross_sales, net_sales), ALWAYS ask a clarifying question before generating SQL. Do not guess. 3. Output ONLY valid JSON with keys: “sql_query”, “clarification_needed” (bool), and “message”. 4. If using a date filter, assume UTC timezone. USER: {user_query}Why this is great: It builds guardrails (no SELECT *), forces ambiguity resolution (don’t guess), and defines a strict output format (JSON) for easy parsing in production.

Summary Cheat Sheet: Key Buzzwords to Drop

  • Training: LoRA, QLoRA, DeepSpeed, ZeRO-3, Flash Attention 2, Gradient Checkpointing, Dataset Curation.
  • Inference: vLLM, Tensor Parallelism, Continuous Batching, PagedAttention, KV Cache, Speculative Decoding.
  • RAG: HyDE, Re-ranking (Cross-encoders), Multi-vector retrieval, Parent Document Retriever, RAPTOR (hierarchical summarization).
  • Evals: LLM-as-a-Judge, RAGAS, Faithfulness, Answer Relevancy, Context Precision, BLEURT.

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top