This guide covers Databricks, Apache Spark, Delta Lake, Data Engineering, Data Architecture, AWS Integration, Security, Performance Optimization, Streaming, DevOps, and Real-World Scenarios.

1. Databricks Fundamentals

Q1. What is Databricks?

Answer

Databricks is a cloud-based unified analytics platform built on Apache Spark that provides:

Data Engineering
Data Science
Machine Learning
Data Warehousing
AI/GenAI workloads

Key Components:

Databricks Workspace
Clusters
Notebooks
Delta Lake
Unity Catalog
Databricks SQL
MLflow

Benefits:

Auto-scaling
High performance
Collaborative development
Managed Spark environment

Q2. Why use Databricks instead of traditional Spark?

Answer

Traditional Spark Challenges:

Cluster management
Dependency management
Scaling complexity

Databricks Advantages:

Managed Spark
Auto-scaling clusters
Delta Lake support
Collaborative notebooks
Optimized runtime
Security integration

Q3. What are Databricks Workspaces?

Answer

Workspace is the collaborative environment where users create:

Notebooks
Dashboards
Libraries
Jobs

Functions:

Code development
Data exploration
Collaboration
Pipeline management

Q4. What languages are supported in Databricks?

Answer

Supported Languages:

Python
SQL
Scala
R

Example:

df = spark.read.csv("/data/file.csv")
display(df)

Q5. What are Databricks Clusters?

Answer

Clusters are compute resources used to run workloads.

Types:

All-purpose clusters
Job clusters

Components:

Driver Node
Worker Nodes

2. Databricks Architecture

Q6. Explain Databricks Architecture.

Answer

Architecture Layers:

Control Plane

Managed by Databricks:

Notebook services
Job scheduler
Cluster manager

Data Plane

Managed in AWS account:

Flow:

Users → Workspace → Spark Cluster → S3 Storage

Q7. What is Driver Node?

Answer

Driver Node:

Runs Spark Context
Schedules tasks
Maintains metadata

Responsibilities:

DAG creation
Job execution planning
Task coordination

Q8. What are Worker Nodes?

Answer

Worker nodes perform:

Data processing
Task execution
Shuffle operations

Each worker contains:

Executors
CPU
Memory

Q9. What is DBFS?

Answer

Databricks File System (DBFS) is a distributed file system abstraction.

Example:

dbutils.fs.ls("/mnt/raw-data")

Use Cases:

Store files
Mount S3
Temporary data

Q10. What is Databricks Runtime?

Answer

Optimized Spark runtime provided by Databricks.

Includes:

Spark
Delta Lake
ML libraries
Performance optimizations

3. Apache Spark Concepts

Q11. What is Apache Spark?

Answer

Open-source distributed processing framework.

Features:

In-memory processing
Fault tolerance
Scalability

Modules:

Spark Core
SQL
Streaming
MLlib
GraphX

Q12. What is RDD?

Answer

RDD (Resilient Distributed Dataset)

Characteristics:

Immutable
Distributed
Fault tolerant

Example:

rdd = spark.sparkContext.parallelize([1,2,3])

Q13. What is DataFrame?

Answer

Distributed table-like structure.

Benefits:

Optimized execution
Catalyst optimizer
Easier development

Example:

df = spark.read.parquet("/data")

Q14. DataFrame vs RDD?

Feature	RDD	DataFrame
Optimization	No	Yes
Schema	No	Yes
Performance	Lower	Higher
Ease of use	Complex	Easy

Q15. What is Spark DAG?

Answer

DAG = Directed Acyclic Graph

Spark converts transformations into DAG before execution.

Stages:

Transformations
DAG
Stages
Tasks

4. Delta Lake

Q16. What is Delta Lake?

Answer

Open-source storage layer providing:

ACID transactions
Schema enforcement
Time travel
Data versioning

Q17. Benefits of Delta Lake?

Answer

Major Benefits:

Reliable pipelines
Faster queries
Data consistency
Streaming support

Q18. What is Time Travel?

Answer

Query historical versions.

Example:

SELECT * FROM sales VERSION AS OF 10;

Use Cases:

Auditing
Rollback
Recovery

Q19. What is Schema Enforcement?

Answer

Prevents bad data insertion.

Example:

If table expects integer and string arrives → rejected.

Q20. What is Schema Evolution?

Answer

Allows schema updates.

Example:

.option("mergeSchema","true")

Q21. What is OPTIMIZE command?

Answer

Compacts small files.

OPTIMIZE sales;

Benefits:

Faster queries
Reduced metadata

Q22. What is Z-Ordering?

Answer

Improves query performance.

OPTIMIZE sales
ZORDER BY(customer_id);

Reduces data scanned.

Q23. What is VACUUM?

Answer

Removes old files.

VACUUM sales RETAIN 168 HOURS;

Q24. What are ACID transactions?

Answer

Atomicity
Consistency
Isolation
Durability

Supported by Delta Lake.

5. AWS + Databricks Integration

Q25. How does Databricks integrate with AWS?

Answer

Services:

S3
IAM
Glue
Redshift
Kinesis
Lambda
SNS
SQS

Q26. Why use S3 with Databricks?

Answer

Data Lake Storage:

Benefits:

Unlimited scalability
Durable
Cost effective

Example:

spark.read.parquet("s3://bucket/path")

Q27. How does IAM work with Databricks?

Answer

IAM Roles provide secure access.

Example:

Cluster assumes IAM Role
Access S3 securely

No hardcoded credentials.

Q28. How do you connect Databricks to Redshift?

Answer

Methods:

JDBC
Spark Connector

Example:

df.write \
.format("jdbc") \
.option("url", redshift_url)

Q29. What is an Instance Profile?

Answer

AWS IAM Role attached to Databricks clusters.

Benefits:

Secure authentication
No secrets required

Q30. How would you secure S3 access?

Answer

Best Practices:

IAM roles
Bucket policies
KMS encryption
Private endpoints

6. Data Engineering

Q31. Explain ETL in Databricks.

Answer

ETL Flow:

Extract → Transform → Load

Example:

S3 → Databricks → Delta Lake

Q32. What is ELT?

Answer

Extract → Load → Transform

Preferred in cloud architectures.

Q33. How do you ingest JSON data?

df = spark.read.json("/input")

Q34. How do you ingest CSV files?

df = spark.read.option("header","true").csv("/data")

Q35. How do you handle bad records?

Answer

Options:

.option("mode","PERMISSIVE")

Modes:

PERMISSIVE
DROPMALFORMED
FAILFAST

Q36. How do you handle duplicate records?

Answer

df.dropDuplicates()

Q37. What is repartition?

Answer

Increases/decreases partitions.

df.repartition(10)

Q38. What is coalesce?

Answer

Reduce partitions without shuffle.

df.coalesce(5)

7. Performance Optimization

Q39. How do you optimize Spark jobs?

Answer

Methods:

Partitioning
Caching
Broadcast joins
AQE
Delta optimization

Q40. What is caching?

df.cache()

Stores data in memory.

Q41. What is persistence?

Answer

Stores data:

Memory
Disk

df.persist()

Q42. What is Broadcast Join?

Answer

Small table copied to workers.

broadcast(df_small)

Improves join performance.

Q43. What is Data Skew?

Answer

Uneven data distribution.

Causes:

Slow tasks
Executor imbalance

Solutions:

Salting
Repartitioning

Q44. What is Adaptive Query Execution (AQE)?

Answer

Runtime optimization feature.

Benefits:

Dynamic partition sizing
Join optimization

Q45. How do you identify bottlenecks?

Answer

Use:

Spark UI
Ganglia
Query Plan
Event logs

8. Streaming

Q46. What is Structured Streaming?

Answer

Spark’s streaming engine.

Supports:

Exactly-once processing
Fault tolerance

Q47. What are streaming sources?

Answer

Kafka
Kinesis
Delta Lake
S3

Q48. What is checkpointing?

Answer

Stores processing state.

.option("checkpointLocation","/checkpoint")

Q49. What is watermarking?

Answer

Handles late arriving data.

Example:

.withWatermark("timestamp","10 minutes")

Q50. Difference between batch and streaming?

Batch	Streaming
Historical	Real-time
Scheduled	Continuous

9. Unity Catalog

Q51. What is Unity Catalog?

Answer

Centralized governance solution.

Features:

Data discovery
Access control
Lineage
Auditing

Q52. Benefits of Unity Catalog?

Answer

Central governance
Fine-grained access
Cross-workspace sharing

Q53. Explain hierarchy.

Answer

Metastore
 └ Catalog
    └ Schema
       └ Table

Q54. What is Data Lineage?

Answer

Tracks:

Source → Transformation → Destination

10. Security

Q55. How is Databricks secured?

Answer

Security Layers:

IAM
VPC
Encryption
Unity Catalog
Private Link

Q56. Encryption at Rest?

Answer

AWS KMS encrypts:

S3 data
Metadata

Q57. Encryption in Transit?

Answer

TLS/SSL.

Q58. What is Private Link?

Answer

Private AWS connectivity without internet exposure.

Q59. Explain Role-Based Access Control.

Answer

Access assigned through roles.

Examples:

Admin
Data Engineer
Analyst

Q60. How do you audit activities?

Answer

Using:

Audit Logs
CloudTrail
Unity Catalog

11. CI/CD & DevOps

Q61. How do you deploy Databricks code?

Answer

Tools:

GitHub
Azure DevOps
Jenkins
Terraform

Q62. What is Databricks Repos?

Answer

Git integration inside Databricks.

Q63. What is Terraform?

Answer

Infrastructure as Code tool.

Used to create:

Workspaces
Clusters
Jobs

Q64. What are Databricks Jobs?

Answer

Workflow automation service.

Q65. How do you schedule jobs?

Answer

Cron expressions.

Example:

0 0 * * *

12. Scenario-Based Questions

Q66. Small files problem?

Answer

Use:

OPTIMIZE table_name

Q67. Pipeline suddenly became slow. What would you check?

Answer

Spark UI
Data skew
Cluster sizing
Recent code changes
Shuffle volume

Q68. How would you process 10TB daily?

Answer

Partitioning
Delta Lake
Auto-scaling clusters
Parallel processing

Q69. How do you design a Bronze-Silver-Gold architecture?

Answer

Bronze:
Raw data

Silver:
Cleaned data

Gold:
Business-ready aggregates

Q70. How would you migrate on-prem Hadoop to Databricks?

Answer

Steps:

Move data to S3
Convert to Delta
Rebuild ETL
Optimize
Validate

13. Advanced Databricks Questions

Q71. Explain Photon Engine.

Answer

Vectorized query engine.

Benefits:

Faster SQL
Lower costs
Better performance

Q72. What is Delta Live Tables?

Answer

Managed ETL framework.

Features:

Quality checks
Lineage
Incremental processing

Q73. What is Auto Loader?

Answer

Incremental file ingestion.

cloudFiles

Supports billions of files.

Q74. Difference between Auto Loader and Batch Ingestion?

Answer

Auto Loader:

Incremental
Event-driven

Batch:

Full scans

Q75. What is Change Data Feed?

Answer

Tracks inserts, updates, deletes.

Useful for CDC pipelines.

14. AWS Data Engineering Scenarios

Q76. Build a real-time pipeline using AWS and Databricks.

Answer

Architecture:

Kinesis → Databricks Streaming → Delta Lake → Power BI/Tableau

Q77. Design a Data Lakehouse.

Answer

AWS S3
↓
Bronze
↓
Silver
↓
Gold
↓
BI/ML

Q78. How do you handle GDPR deletion?

Answer

Delta delete
Vacuum
Audit logging

Q79. How do you implement CDC?

Answer

Tools:

AWS DMS
Debezium
Delta CDF

Q80. Explain Medallion Architecture.

Answer

Bronze → Silver → Gold

Most common Databricks architecture.

15. Expert-Level Questions

Q81. Explain Catalyst Optimizer.

Answer

Spark SQL optimization engine.

Stages:

Analysis
Logical Plan
Optimization
Physical Plan

Q82. Explain Tungsten.

Answer

Spark execution engine optimization.

Benefits:

Better memory management
CPU efficiency

Q83. What causes shuffle?

Answer

Join
GroupBy
OrderBy
Distinct

Q84. Explain Narrow vs Wide Transformations.

Answer

Narrow:

Map
Filter

Wide:

Join
GroupBy

Q85. Explain Executor Memory Tuning.

Answer

Optimize:

Executor count
Executor memory
Core allocation

Q86. How do you optimize joins?

Answer

Broadcast joins
Bucketing
Partition pruning

Q87. Explain Dynamic File Pruning.

Answer

Reduces file scans during joins.

Q88. What is Delta Log?

Answer

Transaction log.

Stored under:

_delta_log/

Q89. Explain Merge Operation.

Answer

UPSERT support.

MERGE INTO

Q90. Explain CDC Merge Pattern.

Answer

Insert, Update, Delete using MERGE.

16. Leadership & Architect Questions

Q91. How would you reduce Databricks costs?

Answer

Spot instances
Auto termination
Photon
Optimize jobs

Q92. How would you design enterprise governance?

Answer

Unity Catalog
RBAC
Audit logs
Data lineage

Q93. How do you support multiple teams?

Answer

Separate catalogs
Shared governance
Workload isolation

Q94. How would you migrate 500TB to Databricks?

Answer

Phased migration:

Assessment
Data movement
Validation
Cutover

Q95. What KPIs would you monitor?

Answer

Job duration
Cluster utilization
Cost per workload
Data freshness

17. Frequently Asked Hands-On Coding Questions

Q96. Read Delta Table

df = spark.read.format("delta").load(path)

Q97. Write Delta Table

df.write.format("delta").save(path)

Q98. Merge Delta Table

MERGE INTO target t
USING source s
ON t.id=s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Q99. Optimize Table

OPTIMIZE sales;

Q100. Vacuum Table

VACUUM sales RETAIN 168 HOURS;

Top 20 Questions Most Frequently Asked in Senior Databricks + AWS Interviews

Explain Databricks Architecture.
Difference between Delta Lake and Data Lake.
Explain Medallion Architecture.
What is Unity Catalog?
What is Photon Engine?
Explain Auto Loader.
What is Delta Live Tables?
Explain CDC using Delta Lake.
How does Databricks integrate with AWS?
What is an Instance Profile?
Explain Spark DAG.
What is Data Skew?
Explain AQE.
Difference between Repartition and Coalesce.
Explain Broadcast Join.
Explain Catalyst Optimizer.
Explain MERGE INTO.
How do you optimize a slow Spark job?
Design a real-time AWS + Databricks pipeline.
Design a secure enterprise Lakehouse architecture.

For senior-level Databricks Engineer, Data Engineer, Solution Architect, and AWS Data Platform interviews (10–15+ years experience), candidates are also commonly tested on advanced Lakehouse architecture, Unity Catalog governance, Delta Live Tables, streaming design patterns, cost optimization, CI/CD, Terraform, and GenAI integration with Databricks.

This guide covers a comprehensive set of interview questions for Databricks Engineer roles, emphasizing candidates with AWS experience. Questions are grouped by category, progressing from foundational to advanced/scenario-based. Answers include key details, best practices, and AWS-specific integrations.

1. Core Databricks and Architecture

Q: What is Databricks, and how does it differ from standard Apache Spark? A: Databricks is a unified data analytics and AI platform built on Apache Spark. It provides a managed, collaborative environment with notebooks, optimized runtime (Databricks Runtime), and features like Delta Lake for reliability. Unlike open-source Spark (which requires manual cluster management), Databricks offers auto-scaling, built-in security (Unity Catalog), workflows orchestration, and a lakehouse architecture combining data lakes and warehouses.

Q: Explain the Databricks Lakehouse architecture. A: It layers Delta Lake (storage with ACID, schema enforcement, time travel) on cloud object storage, with compute via Spark clusters, governance via Unity Catalog, and tools for ETL, BI, and ML. On AWS, storage uses S3, compute uses EC2 instances managed by Databricks.

Q: What are the different cluster types in Databricks, and when do you use them? A:

All-Purpose Clusters: Interactive development/notebooks (multi-user).
Job Clusters: Automated, ephemeral jobs (cost-efficient, auto-terminate).
SQL Warehouses: For SQL queries and BI tools (serverless or pro). On AWS, configure with instance types (e.g., i3 for storage-heavy), spot instances for cost savings, and auto-scaling policies.

Q: How does Databricks integrate with AWS services? A:

S3: Primary storage (mount via DBFS or direct paths like s3://bucket/).
IAM: Instance profiles for secure access (least privilege roles).
VPC/PrivateLink: Secure networking without public internet.
CloudWatch: Monitoring metrics/logs.
Glue: Catalog integration (or use Unity Catalog).
EMR: Comparison point—Databricks is easier for Spark but has DBU costs.

2. Delta Lake and Data Storage

Q: What is Delta Lake, and how does it provide ACID transactions on S3? A: Delta Lake is an open storage layer adding reliability to data lakes. It uses transaction logs (_delta_log) for ACID properties: Atomicity (all-or-nothing), Consistency, Isolation (MVCC), Durability. On S3 (eventually consistent), it achieves this via optimistic concurrency and checkpointing.

Q: Explain Time Travel, VACUUM, and OPTIMIZE in Delta Lake. A:

Time Travel: SELECT * FROM table VERSION AS OF 5 or TIMESTAMP AS OF for auditing/recovery.
OPTIMIZE: Compacts small files (Z-Ordering for clustering).
VACUUM: Removes old files (default 7-day retention; use DRY RUN first). Best practice: Schedule OPTIMIZE + ZORDER on high-query columns.

Q: How do you handle schema evolution in Delta Lake? A: Use mergeSchema or overwriteSchema options. Set spark.databricks.delta.schema.autoMerge.enabled = true. Supports add columns; restrictive for removals (use replaceWhere carefully).

Q: Delta Lake vs. Parquet? A: Delta adds ACID, schema enforcement, versioning, and unified batch/streaming on top of Parquet files.

3. Spark and Performance Optimization

Q: How do you optimize Spark jobs in Databricks? A:

Partitioning and bucketing.
Caching/persistence (df.cache() or MEMORY_AND_DISK).
Broadcast joins for small tables (spark.sql.autoBroadcastJoinThreshold).
Adaptive Query Execution (AQE) — enabled by default in recent runtimes.
Photon acceleration (vectorized engine).
File compaction and Z-ordering.

Q: Explain small files problem and how to solve it on S3. A: Many small files increase metadata overhead and slow listing. Fix with OPTIMIZE, Auto Loader (with cloudFiles for incremental), or repartition before write. Monitor with Spark UI.

Q: Difference between transformations and actions? Lazy evaluation? A: Transformations (e.g., filter, select) build DAG lazily. Actions (e.g., count, write) trigger execution. This optimizes by avoiding unnecessary computation.

Q: How do you handle skew in joins? A: Salting (add random key), broadcast if possible, or AQE skew join optimization.

4. AWS-Specific and Integration

Q: How do you securely access S3 from Databricks? A: Use IAM instance profiles attached to clusters (assume roles). Avoid access keys. Enable S3 encryption (SSE-KMS). Use Unity Catalog for governance.

Q: Databricks vs. AWS EMR? A: Databricks offers better UX, Delta Lake, unified analytics/ML, and easier management but higher cost (DBUs + EC2). EMR is cheaper for pure Hadoop/Spark batch, deeper AWS-native, but requires more ops effort (e.g., no native notebooks).

Q: How would you design an ETL pipeline from S3/on-prem to Databricks? A:

Ingest with Auto Loader (cloudFiles).
Transform in notebooks or Delta Live Tables (DLT).
Orchestrate with Databricks Workflows (Jobs).
Load to Delta tables on S3.
Use AWS Glue/SNS for notifications or Lambda triggers.

Q: Explain networking and security best practices on AWS Databricks. A: VPC peering/PrivateLink, security groups, IP access lists, cluster policies (restrict instance types), secrets in Databricks Secret Scope or AWS Secrets Manager. Least privilege IAM.

5. Orchestration, Streaming, and Advanced Features

Q: What are Databricks Workflows and Delta Live Tables (DLT)? A: Workflows: Orchestration with tasks, dependencies, alerts. DLT: Declarative pipelines with expectations (quality), auto materialization, change data capture. Use DLT for reliable streaming/batch.

Q: How do you implement streaming in Databricks? A: Structured Streaming with Auto Loader for S3 sources. Use foreachBatch for merges. Watermarking for late data. Output to Delta tables.

Q: Unity Catalog — what is it and why use it? A: Centralized governance: 3-level namespace (catalog.schema.table), RBAC, data lineage, auditing. Replaces Hive metastore. Essential for multi-team/secure environments.

6. Scenario-Based and Troubleshooting (AWS Focus)

Q: A job reading from S3 is slow. How do you troubleshoot? A: Check Spark UI (stages, tasks), data skew, small files, network (cross-AZ), instance types. Enable S3A committer, increase parallelism, use predicate pushdown.

Q: How do you optimize costs in Databricks on AWS? A: Job clusters over all-purpose, auto-termination, spot instances, cluster policies, right-size (Photon for faster runs), monitor DBU usage, schedule with Workflows. Use serverless SQL where possible.

Q: Handle concurrent writes to the same Delta table? A: Delta’s optimistic concurrency handles it (retries on conflict). Use MERGE INTO for upserts.

Q: Design a solution for streaming + batch without resource contention. A: Separate clusters/pools or use serverless. Multi-task workflows with different cluster configs. DLT for unified pipelines.

Q: Experience with CI/CD for Databricks pipelines? A: Use Databricks CLI, Terraform for IaC (workspaces, clusters, jobs), Git integration for notebooks, dbx or Databricks Asset Bundles (DABs) for deployment.

7. Behavioral and Experience Questions

Describe a complex pipeline you built on Databricks + AWS.
How did you handle production issues (e.g., job failures due to S3 throttling)?
Experience with MLflow for model management?
How do you ensure data quality and governance?

Preparation Tips

Hands-on: Practice with Databricks Community Edition or AWS trial (Auto Loader, DLT, Workflows, Unity Catalog).
Know Spark UI deeply, Delta commands (DESCRIBE HISTORY, GENERATE symlink_manifest for Athena).
AWS: IAM roles, S3 best practices, VPC, CloudWatch.
Emphasize lakehouse benefits over traditional data lakes/warehouses.

This covers the most common and critical topics based on real interview patterns. Tailor answers to your experience, and be ready for live coding (PySpark/Delta SQL) or system design. Good luck!

Part 1: AWS + Databricks Integration (Core)

Q1: How does Databricks integrate with the AWS ecosystem?

Answer: Databricks runs on AWS in a customer’s VPC (or Databricks-managed VPC). Key integrations:

Data Lake: Uses S3 as primary storage (Delta Lake).
IAM Roles: Instance profiles for EC2 to access S3, Glue Catalog, etc.
Networking: VPC peering or PrivateLink for secure communication.
Metadata: Glue Metastore can be used as Hive metastore.
Monitoring: CloudWatch metrics, S3 access logs, CloudTrail.
Security: AWS KMS for encryption, Secrets Manager for credentials.

Q2: How do you configure Databricks to assume an IAM role to access S3?

Answer:

Create an IAM role with policy allowing s3:GetObject, PutObject, ListBucket.
Attrust policy allowing ec2.amazonaws.com and Databricks account ID.
In Databricks, create an Instance Profile (upload role ARN).
Attach instance profile to cluster (in Advanced options → Instance Profile).
Access: spark.read.format("delta").load("s3://bucket/path")

Q3: How would you set up a secure Databricks workspace on AWS?

Answer:

Network: Deploy in your own VPC (no public IPs), enable PrivateLink for UI/API access.
Security Groups: Restrict ingress to corporate VPN/Databricks control plane.
Storage: Enable S3 bucket policies and KMS encryption.
Auth: SSO with AWS IAM Identity Center or SAML 2.0.
Secrets: Use Databricks Secrets backed by AWS Secrets Manager.
Audit: Enable CloudTrail + Databricks audit logs.

Part 2: Delta Lake & Table Management

Q4: What is Delta Lake and how does it interact with S3?

Answer: Delta Lake is an ACID table storage layer on top of Parquet files.

Storage: Delta table = s3://bucket/path/table/ containing .parquet files + _delta_log/.
Write guarantees: ACID, time travel, schema enforcement.
Optimization: Z-ordering, vacuum, optimize (compaction).
AWS: S3 is eventually consistent, but Delta Lake uses commit logs to ensure consistency.

Q5: Explain how to perform time travel on a Delta table stored in S3.

Answer:

sql

-- By version
SELECT * FROM my_table VERSION AS OF 5

-- By timestamp
SELECT * FROM my_table TIMESTAMP AS OF '2025-01-01'

-- Python
df = spark.read.format("delta") \
  .option("versionAsOf", 5) \
  .load("s3://path/table")

Under the hood: Delta log stores transaction history. Time travel reads the table state at that version.

Q6: What is `VACUUM` in Delta Lake, and how do you manage it on AWS S3?

Answer: VACUUM removes old file versions (not needed for time travel).

Default retention: 7 days (cannot be less than 168 hours in production).
Command: VACUUM delta.`/mnt/table\ RETAIN 168 HOURS`
S3 cost impact: Reduces storage cost by deleting unreferenced files.
Caution: Don’t vacuum if you have concurrent readers on older versions.

Part 3: Performance Tuning on AWS

Q7: How do you handle data skew when joining large tables on Databricks/AWS?

Answer:

Salting: Add a random salt key to distribute skewed key.
Auto-optimize: spark.sql.adaptive.skewJoin.enabled=true (AQE).
Broadcast hint: For small table (<10GB) → /*+ BROADCAST(small_df) */.
Cluster sizing: Use spot instances for non-critical shuffle partitions.
Partition pruning: Use partition columns (e.g., year/month/day) on S3 paths.

Q8: Explain how you would optimize a Spark job reading many small files from S3.

Answer:

Problem: S3 list + open overhead → many tasks.
Solution:
- OPTIMIZE (Delta) to coalesce small files.
- spark.sql.files.maxPartitionBytes=256MB
- spark.sql.files.openCostInBytes=4MB
- Use Auto Loader with file notification mode (SQS) to avoid listing.
- Bucketing: CLUSTER BY key.

Q9: What is Photon and when should you enable it on AWS?

Answer: Photon is Databricks’ native vectorized query engine (C++).

When to enable: Complex SQL aggregations, joins, window functions; Parquet/Delta scans.
Not beneficial: UDF-heavy workloads, row-by-row ops.
AWS: Works with all EC2 instances (optimized on i3/enhanced networking).
Toggle: spark.databricks.photon.enabled=true (SQL Analytics or DBR 9+).

Part 4: Data Engineering & ETL

Q10: How do you implement incremental ETL from S3 to Delta Lake?

Answer:

Auto Loader (structured streaming):

python

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.schemaLocation", "s3://checkpoint/") \
  .load("s3://raw-bucket/") \
  .writeStream.format("delta") \
  .option("checkpointLocation", "s3://checkpoint/") \
  .table("bronze_table")

For batch: Use merge (upsert) with last_modified timestamp.

Q11: Explain idempotent writes in Databricks on S3.

Answer: Idempotent = same result if run multiple times.

Streaming: checkpointLocation in S3 ensures exactly-once.
Batch: Use INSERT OVERWRITE with partition or MERGE with unique key.
Delta: MERGE INTO target USING updates ON key WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT *
S3 note: Avoid rename-based commits; Delta’s transaction log handles atomicity.

Q12: You are asked to reduce S3 GET/LIST request costs. How?

Answer:

Use Delta’s OPTIMIZE to reduce number of files.
Enable S3 inventory + partition pruning to limit scanned prefixes.
Use Auto Loader’s file notification mode (SQS) instead of directory listing.
Increase spark.sql.files.maxPartitionBytes to reduce task count.
Cache frequently accessed tables using spark.table(...).cache().

Part 5: Migration to Databricks on AWS

Q13: How would you migrate Hive tables from AWS EMR to Databricks?

Answer:

Metadata: Use Glue Metastore → attach same Glue catalog in Databricks.
Data: Leave data in S3; change table location to Delta format.
Convert to Delta:

sql

CONVERT TO DELTA parquet.`s3://old/table/path`

Performance: Run OPTIMIZE and ANALYZE TABLE.
Validation: Compare row counts, checksums.

Q14: What is a deep clone vs shallow clone in Delta and their AWS cost implications?

Answer:

Shallow clone: Copies only metadata (pointers to existing Parquet files). No extra S3 storage → cheap, fast.
Deep clone: Physically copies all data → new S3 objects, higher storage cost.
Use shallow clone for testing/branching; deep clone for archival or breaking dependency.

Part 6: Security & Governance

Q15: How do you implement column-level access control in Databricks on AWS?

Answer:

Dynamic View: Create view with masking logic.
Unity Catalog (UC): Best approach.
- Create metastore in AWS root account.
- Use UC’s GRANT SELECT (col1, col2) ON table TO user.
- Integrate with AWS IAM for identity passthrough.
Legacy: spark.sql.ansi.enabled + row/col filters via views.

Q16: How do you rotate AWS access keys used by Databricks jobs?

Answer:

Recommended: Use Instance Profiles (IAM role attached to EC2) → no keys to rotate.
If keys must be used: Store in Databricks Secrets + AWS Secrets Manager.
Rotate process:
1. Generate new keys in AWS IAM.
2. Update secret in AWS Secrets Manager.
3. Databricks secret scope auto-refreshes (if configured with rotation).
4. Avoid hardcoding.

Part 7: Cost Optimization

Q17: How do you reduce costs of S3 + Databricks?

Answer:

Storage:
- Enable S3 lifecycle policies (move old Delta files to Glacier).
- Run VACUUM regularly.
- Use OPTIMIZE for fewer, larger files.
Compute:
- Use Spot Instances for non-critical tasks.
- Enable Cluster auto-scaling (min→max).
- Use SQL Serverless (pay per query, no cluster management).
- Terminate idle clusters (set auto-termination to 30 min).
Data transfer: Keep compute in same AZ as S3 bucket.

Q18: Explain difference between `OPTIMIZE` and `ZORDER BY`.

Answer:

OPTIMIZE: Compacts small files into larger ones (improves read speed).
ZORDER BY: Clusters related data in same files (improves skip index for filters).
Example:

sql

OPTIMIZE my_table
ZORDER BY (event_date, user_id)

On AWS: Both generate new Parquet files in S3; old files removed after VACUUM.

Part 8: Monitoring & Troubleshooting

Q19: How do you debug a slow Spark job reading from S3?

Answer:

Spark UI: Look for:
- High task count → many small files.
- Skewed tasks → data skew.
- Large input read time → S3 latency.
S3 metrics: CloudWatch → high GET/LIST latency, 503 throttling.
Fix:
- Enable s3a.fast.upload and spark.hadoop.fs.s3a.connection.maximum=100.
- Use OPTIMIZE or REPARTITION.
- Increase spark.sql.shuffle.partitions dynamically.

Q20: How would you troubleshoot “S3 request rate exceeded” from Databricks?

Answer:

Cause: Too many parallel requests to the same S3 prefix.
Diagnose: Check Spark UI → S3 task retries, CloudWatch ThrottlingException.
Fixes:
- Add partition columns (e.g., year=2025/month=02).
- Use S3 request price prefix: s3://bucket/prefix1/, prefix2/.
- Enable fs.s3a.attempts.maximum=20 and fs.s3a.retry.interval=500ms.
- Reduce spark.sql.shuffle.partitions if too high.

Part 9: Scenario-Based

Q21: Your AWS Databricks job fails with “java.net.SocketTimeoutException: Connect to s3.amazonaws.com:443”. Why?

Answer:

Possible causes:
- VPC/security group blocking outbound HTTPS to S3.
- S3 gateway endpoint missing from route tables.
- Instance profile missing S3 permissions.
- Network ACLs on subnet.
Fix: Add S3 VPC endpoint, verify IAM role, check security group egress.

Q22: Design a GDPR-compliant data pipeline using Databricks on AWS.

Answer:

S3 buckets: Encrypted at rest with KMS (customer-managed key).
Delta tables: Enable delta.enableChangeDataFeed = true for audit.
PII masking: Unity Catalog dynamic views / column masking.
Delete user data: Use MERGE or DELETE FROM + VACUUM to purge.
Logging: CloudTrail + Databricks audit logs to S3 (immutable).
Access: IAM + SCIM to enforce least privilege.

Part 10: Coding & SQL (Examples)

Q23: Write a Databricks notebook cell to read from Kinesis (AWS) and upsert into Delta.

Answer:

python

stream_df = spark.readStream \
  .format("kinesis") \
  .option("streamName", "my-stream") \
  .option("endpointUrl", "https://kinesis.us-east-1.amazonaws.com") \
  .option("awsAccessKey", dbutils.secrets.get("aws", "key")) \
  .option("awsSecretKey", dbutils.secrets.get("aws", "secret")) \
  .load()

query = stream_df.writeStream \
  .foreachBatch(lambda df, epoch: df.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/mnt/delta/table")) \
  .option("checkpointLocation", "s3://checkpoint/kinesis/") \
  .start()

Q24: Write SQL to find duplicate records in a Delta table and deduplicate keeping latest based on `updated_at`.

Answer:

sql

WITH deduped AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) as rn
  FROM my_delta_table
)
DELETE FROM my_delta_table
WHERE (id, updated_at) IN (
  SELECT id, updated_at FROM deduped WHERE rn > 1
);
-- Or use MERGE / INSERT OVERWRITE

Final Tips for Interview

Area	Must-Know
AWS	S3 consistency, IAM roles, VPC endpoints, KMS, Glue Metastore
Databricks	Delta Lake, Unity Catalog, Photon, Auto Loader, Structured Streaming
Performance	Partitioning, Z-order, OPTIMIZE, AQE, bucketing
Security	Instance profile, secrets, private link, audit logs
Cost	Spot instances, auto-scaling, lifecycle policies, VACUUM