Audience: Engineers with AWS architecture, API/microservices, and basic AI/ML background
Level: Beginner → Advanced | Current as of: June 2026

How to use this guide: read Sections 1–5 once for depth, drill Section 2 Q&A out loud, then use Section 6 (revision sheet) the morning of the interview. Section 7 is what separates a “pass” from a “strong hire.”
1. FUNDAMENTALS
1.1 What is AWS Bedrock (in plain English)
- A fully managed, serverless service that gives you access to many foundation models (FMs) from multiple AI providers through one API.
- You don’t provision GPUs, manage clusters, or host weights — you call a model like an API endpoint.
- Think of it as the “single front door” to GenAI on AWS, with AWS-native security (IAM), networking (VPC/PrivateLink), and observability (CloudWatch/CloudTrail) built in.
- One-line interview definition: “Bedrock is AWS’s managed, multi-provider foundation-model platform — you consume models via a unified API and add RAG, agents, guardrails, fine-tuning, and evaluation as managed building blocks, without managing any inference infrastructure.”
1.2 Key features & components
- Model access — single API to 100+ FMs across ~13 providers (see 1.4).
- Inference APIs —
InvokeModel/InvokeModelWithResponseStream(provider-specific bodies) andConverse/ConverseStream(model-agnostic, recommended for new builds). OpenAI-styleChat Completions/Responsesand AnthropicMessagesformats are also supported for specific models. - Knowledge Bases — managed RAG: ingestion, chunking, embedding, vector indexing, retrieval — all handled for you.
- Bedrock Agents — config-first, fully managed agents that run a Reason-Act (ReAct) loop, call tools (Lambda/APIs), and query Knowledge Bases.
- AgentCore (GA Oct 2025) — modular, framework- and model-agnostic production agent runtime (more in 1.5).
- Guardrails — content filters, denied topics, PII detection/redaction, word filters, and contextual grounding/hallucination checks.
- Customization — fine-tuning, continued pre-training, and (newer) reinforcement fine-tuning; plus custom model import.
- Model Evaluation — automated + human eval to compare models on cost/accuracy/latency.
- Prompt Management & Flows — versioned prompts and low-code orchestration.
1.3 Architecture (conceptual)
┌─────────────────────────────────────────────┐
Client → │ App / API Gateway / Lambda / ECS / EKS │
└───────────────┬─────────────────────────────┘
│ IAM-signed (SigV4) calls
▼
┌─────────────────────────────────────────────┐
│ AMAZON BEDROCK │
│ Converse / InvokeModel | Guardrails │
│ Knowledge Bases (RAG) | Agents/AgentCore │
│ Customization | Evaluation │
└───────┬───────────────────────────┬─────────┘
│ │
┌───────────▼─────────┐ ┌──────────▼───────────┐
│ Foundation Models │ │ Vector store │
│ (Anthropic, Nova, │ │ (OpenSearch Svrless, │
│ Llama, Mistral...) │ │ Aurora pgvector, │
└─────────────────────┘ │ Pinecone, Redis) │
└───────────────────────┘
Cross-cutting: VPC endpoints (PrivateLink), KMS, CloudWatch, CloudTrail
1.4 Supported foundation models (2026)
Bedrock’s differentiator is model diversity through one API. Current providers include:
- Anthropic — Claude family (e.g., Opus, Sonnet, Haiku tiers) — strong reasoning, long context, tool use.
- Amazon — Nova (Micro, Lite, Pro, Premier; Nova 2) and Titan (text, embeddings, image). Nova Pro offers large context (~300K) and multimodal input; deep AWS integration.
- Meta — Llama family (cost-effective, open-weight, good for batch).
- Mistral AI — Large/Small (efficient, multilingual).
- Cohere — Command (text) + Embed (embeddings/RAG).
- AI21 Labs — Jamba.
- Stability AI — image generation.
- Others — DeepSeek, Qwen, Writer, Luma AI (video), TwelveLabs (video understanding), OpenAI (select models).
Interview tip: you don’t need exact model IDs. You DO need to say which model class fits which job (reasoning vs cheap-bulk vs embeddings vs image/video) and that you can swap models without rewriting code when using
Converse.
1.5 Core concepts you must be able to define
- Prompt Engineering — crafting inputs (instructions, context, few-shot examples, output format) to steer model behavior without changing weights. Cheapest lever.
- RAG (Retrieval-Augmented Generation) — inject relevant, up-to-date private data into the prompt at query time via a vector search, so the model answers from your knowledge. Reduces hallucination, no retraining, data stays current.
- Fine-tuning — adapt a base model to your domain/style using labeled examples. Higher effort/cost; note: custom fine-tuned text models on Bedrock require Provisioned Throughput to serve.
- Agents — LLM + reasoning loop that can plan, call tools/APIs, and use memory to complete multi-step tasks autonomously.
- AgentCore vs Bedrock Agents:
- Bedrock Agents — fully managed, configuration-first, AWS runs the ReAct loop. Fast to ship, less control. Best for straightforward use cases.
- AgentCore — you write the agent (LangGraph, CrewAI, LlamaIndex, Strands, or custom) and deploy on managed infra. Framework- & model-agnostic. Components: Runtime (per-session microVM, strong isolation, up to ~8 hr), Memory (short/long-term), Identity (OAuth/IAM for calling 3rd-party APIs on a user’s behalf), Gateway (turns Lambdas/APIs into MCP tools), Browser, Code Interpreter, Policy (Cedar), Observability. Best for complex/multi-agent production systems.
2. INTERVIEW QUESTIONS & ANSWERS
A. BEGINNER (15 Q&A)
1. What problem does Bedrock solve?
Lets teams build GenAI apps without hosting models or managing GPUs, using one API across many providers, with AWS security and billing built in.
2. Is Bedrock serverless?
Yes — no infrastructure to provision for on-demand inference. You pay per token / per request.
3. What is a foundation model?
A large model pre-trained on broad data, adaptable to many tasks via prompting, RAG, or fine-tuning.
4. Name three model providers on Bedrock.
Anthropic (Claude), Amazon (Nova/Titan), Meta (Llama). Also Mistral, Cohere, AI21, Stability, etc.
5. How do you call a model?
Via the Bedrock Runtime API — InvokeModel (provider-specific body) or the model-agnostic Converse API; streaming variants exist for token-by-token output.
6. InvokeModel vs Converse — why does Converse matter?Converse gives a uniform request/response schema across models, so you can swap models with no code change. InvokeModel requires each provider’s specific JSON body.
7. What are tokens?
Chunks of text (~¾ of a word in English). You’re billed per input and output token; context windows are measured in tokens.
8. What is a context window?
Max tokens (prompt + response) a model can consider at once. Larger windows allow more documents/history.
9. What is RAG in one sentence?
Retrieve relevant private data with a vector search and feed it into the prompt so the model answers from your data.
10. RAG vs fine-tuning — when use which?
RAG for fresh/changing facts and knowledge; fine-tuning for style, format, or domain behavior. Often combined.
11. What is a Knowledge Base in Bedrock?
A managed RAG pipeline: it ingests docs (e.g., from S3), chunks + embeds them, stores vectors, and retrieves on query.
12. What are Guardrails?
A safety layer: block harmful content, deny topics, redact PII, and check answers are grounded in source data (anti-hallucination).
13. Give a common Bedrock use case.
Enterprise Q&A chatbot over internal docs (RAG), document summarization, code assist, content generation, classification/extraction.
14. How is Bedrock secured by default?
IAM (SigV4) auth, no customer data used to train base models, encryption with KMS, VPC/PrivateLink for private access, CloudTrail audit logs.
15. Is there a free tier?
No permanent free tier — pay-as-you-go from the first call. New accounts may get limited promotional AWS credits.
B. INTERMEDIATE (18 Q&A)
1. Walk through Bedrock’s pricing modes.
- On-Demand — pay per input/output token, no commitment. Best default / variable load.
- Batch — async bulk jobs, results to S3, ~50% cheaper than on-demand.
- Provisioned Throughput — reserved capacity in Model Units, hourly billing, 1- or 6-month commitments. For steady high volume or latency guarantees (and required to serve custom fine-tuned text models).
- Prompt Caching — cache static prompt prefixes (system prompt, RAG context, few-shot) for up to ~90% savings on repeated input tokens.
- Customization — fine-tuning training (per token) + weight storage (monthly/GB) + provisioned inference.
2. On-demand vs provisioned — how do you decide?
Compare sustained utilization. Provisioned wins roughly above ~80–85% steady utilization or when you need guaranteed latency/no throttling; below that, on-demand/batch is cheaper.
3. What IAM controls matter for Bedrock?
- Restrict
bedrock:InvokeModelto specific model ARNs (control which models are usable). - Separate roles for invoke vs admin (creating KBs, guardrails, fine-tunes).
- Use resource policies, condition keys, and least privilege; attach Guardrail ID enforcement where supported.
4. How do you keep traffic off the public internet?
VPC interface endpoints (PrivateLink) for Bedrock runtime + agent endpoints, so calls stay on the AWS network. Use SCPs to deny non-VPCE access.
5. How is data privacy handled?
Your prompts/outputs are not used to train the base FMs; data is encrypted in transit and at rest (KMS). Fine-tuning uses a private copy of the model.
6. What are the main inference APIs and endpoints?
APIs: InvokeModel, Converse, plus model-specific Messages / Chat Completions / Responses. Served from the bedrock-runtime endpoint (control-plane ops like creating KBs use the bedrock endpoint).
7. How do streaming responses work and why use them?ConverseStream / InvokeModelWithResponseStream return tokens incrementally → lower perceived latency (time-to-first-token) for chat UIs.
8. How does cross-region inference help?
Cross-region inference profiles route requests across a geography to improve availability and throughput / reduce throttling. Routing is determined by the profile you call; expect a small cost/latency consideration vs single-region.
9. How do you integrate Bedrock into a microservices app?
Put a thin service (Lambda/ECS/EKS) behind API Gateway; it assumes an IAM role, calls Converse, applies a Guardrail, logs to CloudWatch. Keep model choice configurable (env var / parameter store) to swap models.
10. How does a Bedrock Knowledge Base ingest data?
Source (e.g., S3) → chunking → embeddings (e.g., Titan/Cohere embed) → vector store (OpenSearch Serverless, Aurora pgvector, Pinecone, Redis, etc.) → retrieval API (Retrieve / RetrieveAndGenerate).
11. Which vector stores does Bedrock support for KBs?
OpenSearch Serverless (default/managed), Aurora PostgreSQL pgvector, Pinecone, Redis Enterprise, MongoDB Atlas, and Neptune Analytics — among others.
12. What’s RetrieveAndGenerate vs Retrieve?Retrieve returns matching chunks (you build the prompt). RetrieveAndGenerate does retrieval and the LLM answer in one managed call, with citations.
13. How do Bedrock Agents call your systems?
Via action groups mapped to Lambda functions or OpenAPI schemas; the agent plans, calls the tool, reads results, and continues the ReAct loop. It can also query attached Knowledge Bases.
14. How do you observe/monitor Bedrock?
CloudWatch metrics (invocations, latency, token counts, throttles) + CloudTrail for API audit; model invocation logging to S3/CloudWatch; for agents, AgentCore Observability/traces.
15. How do you handle throttling / quotas?
Request quota increases; use provisioned throughput or cross-region profiles; implement exponential backoff + jitter; queue + batch where latency allows.
16. What is a Model Unit (MU)?
The unit of provisioned throughput capacity (defined input/output tokens-per-minute for a model), billed hourly with a commitment.
17. How do Guardrails reduce hallucination specifically?
Contextual grounding checks compare the answer to the retrieved source context and score grounding/relevance; low-grounding responses can be blocked or flagged.
18. Embeddings — what role do they play?
Convert text to vectors for semantic similarity search in RAG. Choice of embedding model (dimensionality, cost, multilingual) affects retrieval quality and vector-store cost.
C. ADVANCED (18 Q&A)
1. Design considerations for a production RAG pipeline.
- Chunking strategy (size/overlap, semantic vs fixed) — biggest quality lever.
- Embedding model choice (quality vs cost vs dimensions vs multilingual).
- Vector store (managed OpenSearch Serverless for simplicity vs Aurora pgvector for cost/SQL vs Pinecone for scale).
- Retrieval tuning — top-k, hybrid (keyword + vector), re-ranking, metadata filters.
- Grounding + citations via Guardrails and
RetrieveAndGenerate. - Freshness — incremental re-ingestion / sync jobs.
2. When would you NOT use a managed Knowledge Base?
When you need custom chunking/re-ranking logic, multi-stage retrieval, or an existing vector platform — then orchestrate retrieval yourself and call Converse directly (more control, more ops).
3. How do you cut latency on Bedrock?
- Stream tokens (reduce time-to-first-token).
- Prompt caching for static prefixes.
- Smaller/faster model tier for simple tasks (model routing).
- Provisioned throughput to remove cold-start/throttle variance.
- Reduce prompt size (trim context, tighter top-k), region proximity, parallelize independent calls.
4. Multi-model strategy — explain.
Route each request to the cheapest model that meets quality — e.g., small/Haiku-class for classification & simple chat, mid for general, large/Opus-class for hard reasoning. Implement with a router (rules or a small classifier). Converse makes swapping trivial. Typical savings: 40–60%.
5. Real-time vs batch inference — trade-offs.
- Real-time (on-demand/provisioned): low latency, interactive; higher per-token cost.
- Batch: async, ~50% cheaper, results to S3; for bulk classification, backfills, evaluations, summarizing archives. Choose by latency tolerance.
6. How do you control cost at scale (top levers)?
Model right-sizing/routing → prompt caching → batch for non-interactive jobs → prompt trimming → provisioned throughput only above break-even → watch hidden costs (vector store minimums, KB embeddings, Guardrails, data transfer).
7. What are the “hidden” costs of a Bedrock RAG/agent stack?
Vector store baseline (e.g., OpenSearch Serverless minimum capacity units can dominate small workloads), embedding ingestion, Guardrail evaluation per request, Agent/AgentCore runtime + memory, cross-region premium, and egress.
8. How do you design for high availability and throttle resilience?
Cross-region inference profiles, provisioned capacity for the critical path, backoff+jitter, a fallback model, request queue (SQS) for spiky bursts, and circuit breakers.
9. How do Guardrails fit a defense-in-depth design?
Apply at input (block jailbreak/PII) and output (block harmful content, redact PII, grounding check). Centralize one Guardrail policy ID; enforce via IAM so apps can’t bypass it.
10. How do you evaluate which model to ship?
Use Bedrock Model Evaluation (automatic metrics + human/LLM-as-judge) on a representative eval set scoring accuracy, latency, and cost; re-run when models update.
11. Fine-tuning vs RAG vs prompt engineering — decision framework.
Start with prompting → add RAG for private/fresh knowledge → fine-tune only for persistent style/format/domain behavior that prompting can’t achieve. Remember fine-tuned text models need provisioned throughput.
12. How would you secure an enterprise, multi-tenant Bedrock app?
Per-tenant IAM roles/scoping, metadata filters so RAG only returns a tenant’s docs, KMS keys per tenant where needed, Guardrails enforced, VPC endpoints, CloudTrail per-tenant audit, and AgentCore session isolation (per-session microVM) for agents.
13. Architecture for agents calling third-party SaaS securely.
AgentCore Identity (OAuth/IAM) to call Salesforce/GitHub on a user’s behalf; AgentCore Gateway to expose internal Lambdas/APIs as MCP tools; Policy (Cedar) to constrain actions; observability for every tool call.
14. How do you prevent prompt injection in RAG/agents?
Treat retrieved content as untrusted: separate instructions from data, use Guardrails, restrict tool permissions (least privilege), validate/whitelist tool inputs, and never let retrieved text grant new capabilities.
15. How does prompt caching actually save money, and what’s cacheable?
You mark stable prefixes (system prompt, few-shot, large RAG context) as cache points; subsequent calls reuse them at up to ~90% off input tokens. Best when many requests share large, static context.
16. Streaming + Guardrails — what’s the catch?
Output guardrails must evaluate streamed text; you may buffer/scan in chunks, which can add slight latency or delay blocking decisions — design the UX for partial-then-validated output.
17. How do you scale a chatbot to high concurrent traffic?
API Gateway + autoscaling compute, provisioned throughput on the hot path, cross-region profiles, caching of common answers, async/batch for non-interactive parts, SQS buffering, and graceful model fallback under throttle.
18. Bedrock Agents vs AgentCore — when do you pick each in design?
Agents = quick, managed, simple flows, minimal code. AgentCore = production multi-agent, custom frameworks, non-Bedrock models, strict isolation, custom memory/identity. You can also start on Agents and graduate to AgentCore.
3. SCENARIO-BASED QUESTIONS (10)
S1. Design an internal HR/policy chatbot.
- API Gateway → Lambda (auth via Cognito/IAM) → Bedrock
RetrieveAndGenerateover a Knowledge Base built from S3 HR docs (OpenSearch Serverless). - Guardrails: PII redaction + grounding check + denied topics. Stream responses. Log to CloudWatch.
- Model: mid-tier for cost; escalate to a stronger model only for complex queries.
S2. Build a RAG pipeline with S3 + OpenSearch (managed path).
- Ingest: S3 → KB ingestion job → chunk → Titan/Cohere embeddings → OpenSearch Serverless vector index.
- Query:
Retrieve(top-k + metadata filter) → assemble prompt →Converse; or one-shotRetrieveAndGeneratewith citations. - Ops: scheduled sync on doc updates; monitor retrieval relevance.
S3. RAG with Pinecone (existing vector platform).
- Use Pinecone as the KB vector store (or self-managed retrieval). Embed with a Bedrock embedding model, upsert to Pinecone, retrieve top-k, hybrid + re-rank, then
Converse. Choose this when you already run Pinecone or need its scaling/filtering.
S4. Secure enterprise AI solution with IAM.
- Least-privilege roles: app role limited to
bedrock:InvokeModelon specific model ARNs + specific KB/Guardrail ARNs. - VPC endpoints (PrivateLink) only; SCP denies public access. KMS encryption. CloudTrail audit. Enforce a single Guardrail ID. No cross-account model access without explicit policy.
S5. Handle high-traffic GenAI workload (e.g., 10k+ rps spikes).
- Provisioned throughput on the critical path + on-demand burst; cross-region inference profiles; SQS queue for non-interactive work; response caching for FAQs; backoff+jitter; model routing to cheaper tiers; autoscaling compute.
S6. Cut a runaway Bedrock bill by ~50%.
- Audit token usage → route 40–60% of traffic to a cheaper model tier → enable prompt caching on static context → move bulk jobs to Batch (50% off) → trim prompts → reconsider provisioned vs on-demand at actual utilization → review vector-store minimums.
S7. Reduce hallucinations in a customer-facing assistant.
- Strong RAG (better chunking/re-ranking, metadata filters) + contextual grounding Guardrail + require citations + lower temperature + “answer only from context, else say you don’t know” instruction + eval harness to track grounding score.
S8. Multi-step agent that books travel across internal + external APIs.
- AgentCore: Runtime (isolated session) + Gateway (internal booking Lambda as MCP tool) + Identity (OAuth to external airline/SaaS on user’s behalf) + Memory (preferences) + Policy (Cedar limits on spend/actions) + Guardrails + observability/traces.
S9. Batch-classify millions of support tickets nightly.
- Bedrock Batch inference: assemble JSONL input in S3 → batch job with a small/cheap model → results to S3 → downstream analytics. ~50% cost savings, no latency pressure.
S10. Domain-specific tone/format the prompt can’t reliably produce.
- Fine-tune a base model on curated examples (style/format), serve via provisioned throughput, keep RAG for facts, and A/B against the prompt-only baseline with Model Evaluation before rollout.
4. HANDS-ON / PRACTICAL
4.1 Reference architecture (text) — enterprise RAG chatbot
User → CloudFront → API Gateway → Lambda (Cognito auth, assumes IAM role)
│
├─→ Bedrock Guardrail (input check)
├─→ Bedrock Knowledge Base → OpenSearch Serverless (vectors)
│ ▲ ingestion: S3 docs → chunk → Titan Embeddings
├─→ Bedrock Converse (Claude/Nova) ← retrieved context
└─→ Bedrock Guardrail (output: PII redact + grounding) → stream to user
Cross-cutting: VPC endpoints (PrivateLink), KMS, CloudWatch metrics, CloudTrail audit
4.2 Example API flow — InvokeModel / Converse
import boto3, json
rt = boto3.client("bedrock-runtime", region_name="us-east-1")
# --- Model-agnostic (recommended): Converse ---
resp = rt.converse(
modelId="anthropic.claude-3-5-sonnet-20240620-v1:0", # swap freely
messages=[{"role": "user", "content": [{"text": "Summarize this contract clause: ..."}]}],
inferenceConfig={"maxTokens": 500, "temperature": 0.2},
# guardrailConfig={"guardrailIdentifier": "gr-xxxx", "guardrailVersion": "1"},
)
print(resp["output"]["message"]["content"][0]["text"])
# --- Provider-specific: InvokeModel (Anthropic body) ---
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 500,
"messages": [{"role": "user", "content": "Explain RAG in 2 lines."}],
}
out = rt.invoke_model(modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
body=json.dumps(body))
print(json.loads(out["body"].read())["content"][0]["text"])
Flow: client → IAM SigV4 auth → Bedrock runtime → model → (optional Guardrail) → response (or token stream via converse_stream).
4.3 Prompt examples — good vs bad
| Bad prompt | Good prompt | |
|---|---|---|
| Clarity | “Tell me about our refund policy.” | “You are a support assistant. Using ONLY the context below, answer the refund question. If the answer isn’t in the context, say ‘I don’t have that information.’ Context: {{retrieved_chunks}} Question: {{user_q}}” |
| Format | “Give me the data.” | “Return JSON with keys: summary (string), risk_level (low/med/high). No prose.” |
| Grounding | “What’s the best plan?” | “Recommend a plan based only on the pricing table provided; cite the row you used.” |
Principles: role + task + constraints + only-from-context instruction + explicit output format + few-shot examples + low temperature for factual tasks.
4.4 Mini-project ideas (portfolio-ready)
- Docs Q&A bot over your own PDFs (S3 + KB + Guardrails + Streamlit/React).
- Model router that picks Haiku/Sonnet/Opus-class by query complexity and logs cost saved.
- Batch summarizer for a public dataset using Batch inference (show 50% cost win).
- Tool-using agent (AgentCore or Agents) that queries a weather API + your Lambda.
- Eval harness comparing 3 models on accuracy/latency/cost with Model Evaluation.
5. COMPARISONS (high-yield for interviews)
5.1 Bedrock vs SageMaker
| Bedrock | SageMaker | |
|---|---|---|
| Purpose | Consume/host FMs via API, build GenAI apps | Build/train/deploy custom ML at the infra level |
| Control | Managed, serverless, less control | Full control of training + infrastructure |
| Effort | Low — “call a model” | High — pipelines, endpoints, tuning |
| Best for | RAG, chatbots, agents, GenAI features | Custom models, classic ML, deep customization |
| Together | Fine-tune on SageMaker → import into Bedrock | — |
5.2 Bedrock vs Azure OpenAI
| Bedrock | Azure OpenAI | |
|---|---|---|
| Models | Many providers (Anthropic, Nova, Llama, Mistral, Cohere, OpenAI…) | Primarily OpenAI models |
| Lock-in/choice | Swap models via one API | Centered on OpenAI flagship |
| Ecosystem | AWS-native (IAM, VPC, CloudWatch) | Azure-native (Entra ID, M365 integration) |
| Pick when | You want model diversity / already on AWS | You’re deep in Microsoft/M365 + want GPT models |
5.3 Bedrock vs Google Vertex AI
| Bedrock | Vertex AI | |
|---|---|---|
| Models | Multi-provider marketplace | Centered on Gemini + Model Garden |
| Strength | Model diversity, one secure API | Tight Google Cloud + data/BigQuery integration |
| ML depth | GenAI-focused (pair with SageMaker for full ML) | End-to-end ML platform (training + serving) |
| Pick when | Multi-model strategy / AWS shop | GCP shop / Gemini-first / heavy data-on-GCP |
One-liner: “Azure and Vertex steer you to their in-house flagship (GPT / Gemini). Bedrock’s edge is using Claude, Llama, Nova, etc. behind one API and one set of IAM controls — but that edge only matters if you’re already on AWS.”
6. QUICK REVISION SHEET (1–2 line answers)
- Bedrock = managed, multi-provider FM API; serverless inference, no GPUs to run.
- APIs =
InvokeModel(provider-specific) vsConverse(model-agnostic, swap models); streaming variants for low TTFT. - Pricing modes = On-Demand · Batch (-50%) · Provisioned Throughput (hourly, commitment) · Prompt Caching (-up to 90% input) · Customization.
- Provisioned wins above ~80–85% sustained utilization or for latency guarantees; required for custom fine-tuned text models.
- RAG = retrieve private data via vector search → inject into prompt. Fixes stale facts + hallucination, no retraining.
- Knowledge Bases = managed RAG (chunk → embed → vector store → retrieve);
RetrieveAndGeneratedoes retrieval+answer with citations. - Vector stores = OpenSearch Serverless (default), Aurora pgvector, Pinecone, Redis, MongoDB, Neptune.
- Guardrails = content filters + denied topics + PII redaction + contextual grounding (anti-hallucination); enforce one ID via IAM.
- Agents = managed ReAct loop, action groups (Lambda/OpenAPI), KB access. AgentCore = framework/model-agnostic prod runtime (Runtime microVM, Memory, Identity, Gateway/MCP, Browser, Code Interpreter, Policy/Cedar, Observability).
- Security = IAM/SigV4, model-ARN scoping, VPC PrivateLink, KMS, CloudTrail; base models not trained on your data.
- Latency levers = stream · cache · smaller model · provisioned · trim prompt · region proximity.
- Cost levers = model routing · prompt caching · batch · prompt trimming · right provisioning · watch vector-store minimums.
- Fine-tune vs RAG vs prompt = prompt first → RAG for knowledge → fine-tune for style/format.
- Cross-region inference profiles = better availability/throughput, less throttling.
- Bedrock vs SageMaker = call a model vs build a model (fine-tune on SM → import to Bedrock).
- vs Azure/Vertex = Bedrock = model diversity; Azure = OpenAI; Vertex = Gemini.
7. FOLLOW-UP / TRICKY CROSS-QUESTIONS
- “You said RAG fixes hallucination — but does it eliminate it?” → No. It reduces it; the model can still misread or ignore context. Add grounding checks + citations + “answer only from context.”
- “If on-demand pricing matches the provider’s direct API, why use Bedrock?” → Operational value: one API for many providers, IAM/SigV4, VPC, CloudWatch/CloudTrail, managed RAG/Guardrails/Agents — not raw price.
- “Why not just fine-tune instead of RAG?” → Fine-tuning doesn’t add fresh facts, is costly, and (for text) needs provisioned throughput; RAG keeps data current without retraining.
- “What breaks first under load?” → Throttling/quota limits → mitigate with provisioned throughput, cross-region profiles, backoff+jitter, queueing.
- “Where do most RAG costs hide?” → Vector store baseline (OpenSearch Serverless minimums), embedding ingestion, per-request Guardrail eval — often more than inference at low volume.
- “Converse vs InvokeModel — any reason to still use InvokeModel?” → For provider-specific params/features not yet surfaced in Converse, or legacy code; otherwise prefer Converse.
- “How do you stop prompt injection from retrieved docs?” → Treat retrieved text as untrusted data, separate from instructions; least-privilege tools; Guardrails; never let context grant new capabilities.
- “Agents vs AgentCore — isn’t AgentCore always better?” → No. Agents ship faster with near-zero code for simple flows; AgentCore is for complex/custom/multi-agent prod needs and adds infra cost/complexity.
- “Streaming with output Guardrails — what’s the risk?” → A blocked phrase may stream before evaluation completes; you buffer/scan chunks, trading some latency for safety.
- “Multi-region for compliance vs cross-region inference — same thing?” → No. Cross-region inference profiles are for throughput/availability; data-residency/compliance needs deliberate region selection and may conflict with cross-region routing.
- “Temperature 0 guarantees the same answer?” → Lowers variability, not a hard guarantee; outputs can still differ slightly across model versions/runs.
- “Can a fine-tuned text model run on-demand?” → Generally no — custom text models require provisioned throughput; factor that into cost.
TOP 10 MUST-REMEMBER CONCEPTS BEFORE THE INTERVIEW
- Bedrock = serverless, multi-provider FM API — consume models, don’t host them; AWS security baked in.
Converse>InvokeModelfor new builds — model-agnostic schema lets you swap models with no code change.- Five pricing modes — On-Demand, Batch (-50%), Provisioned Throughput, Prompt Caching (-up to 90%), Customization; provisioned wins above ~80–85% utilization.
- RAG via Knowledge Bases — chunk → embed → vector store → retrieve; cures stale facts without retraining; use
RetrieveAndGeneratewith citations. - Prompt → RAG → Fine-tune decision order; fine-tuned text models need provisioned throughput.
- Guardrails — filters + denied topics + PII redaction + contextual grounding (anti-hallucination); enforce via IAM.
- Agents vs AgentCore — managed ReAct loop vs framework/model-agnostic production runtime (Runtime, Memory, Identity, Gateway/MCP, Policy).
- Security model — IAM model-ARN scoping, VPC PrivateLink, KMS, CloudTrail; base models not trained on your data.
- Latency & cost levers — stream, cache, route to smaller models, batch, provision, trim prompts; watch hidden vector-store costs.
- Bedrock vs SageMaker vs Azure/Vertex — call vs build; model diversity vs OpenAI/Gemini-centric — and pick based on the rest of your stack.
Tip: rehearse Sections 3 and 7 aloud — scenario design and crisp answers to cross-questions are what land senior roles.

