Spark SQL is Apache Spark’s module for working with structured data. It provides a familiar SQL interface (and the DataFrame/Dataset API) while leveraging Spark’s distributed, in-memory processing engine for high performance and scalability.
It unifies relational processing with Spark’s functional programming model, allowing seamless mixing of SQL queries with other Spark operations (e.g., MLlib, Streaming). The same execution engine is used regardless of whether you use SQL, DataFrames, or Datasets.
Key Concepts and Architecture
- DataFrames: A distributed collection of data organized into named columns (like a relational table or pandas DataFrame). They provide rich optimizations via the Catalyst optimizer and Tungsten execution engine. DataFrames are the primary abstraction in Spark SQL.
- Datasets: An extension of DataFrames (Scala/Java primarily) that adds type safety while retaining the performance benefits. In Python (PySpark), DataFrames provide most Dataset-like benefits due to dynamic typing.
- SparkSession: The unified entry point (introduced in Spark 2.0) for working with Spark SQL, DataFrames, etc. Example in PySpark:Python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate() - Catalyst Optimizer: The extensible query optimizer at the heart of Spark SQL. It performs analysis, logical optimization (rule-based, e.g., predicate pushdown, constant folding), physical planning (cost-based), and code generation (Whole-Stage Code Generation via Tungsten). This enables significant performance gains over raw RDDs.
- Tungsten: Focuses on memory/CPU efficiency with off-heap memory management, cache-friendly data layouts, and bytecode generation.
- Execution Model: Queries are lazy (transformations build a DAG until an action triggers computation). Spark handles fault tolerance via lineage.
Comparison to Hive:
- Spark SQL is generally faster (in-memory, optimized execution) and supports mixing SQL with programmatic code.
- Hive (on MapReduce/Tez) is more batch-oriented; Spark SQL reuses Hive metastore and is largely compatible with HiveQL but adds many optimizations.
- Spark SQL excels in iterative workloads, ML integration, and interactive queries.
Vs. Presto/Trino: Presto is often faster for low-latency, interactive ad-hoc SQL analytics (no heavy shuffling for joins in some cases). Spark is better for unified batch + streaming + ML pipelines and complex ETL.
Core Features
- Unified Data Access: Read/write from Parquet, ORC, Avro, JSON, CSV, JDBC, Hive tables, etc., with schema inference or explicit schemas.
- SQL Support: ANSI SQL compliance (with extensions). Supports DDL, DML, queries, window functions, common table expressions (CTEs), etc.
- Integration: Hive metastore compatibility, UDFs/UDAFs (Scala/Java/Python), integration with Arrow for pandas.
- Data Sources: Built-in support for many formats; custom sources via APIs.
- Performance Features: Caching (in-memory columnar), broadcast joins, Adaptive Query Execution (AQE — dynamic re-optimization at runtime, enabled by default in recent versions), partition pruning, predicate pushdown, skew handling.
Getting Started (PySpark Example)
Python
# Create SparkSession
spark = SparkSession.builder.getOrCreate()
# Read data
df = spark.read.parquet("path/to/data")
df.createOrReplaceTempView("my_table")
# SQL query
result = spark.sql("""
SELECT category, COUNT(*) as cnt, AVG(sales) as avg_sales
FROM my_table
WHERE date >= '2025-01-01'
GROUP BY category
HAVING cnt > 100
""")
result.show()
result.write.parquet("output/")DataFrames API vs. SQL
You can freely mix them. DataFrame operations (e.g., df.filter(), df.groupBy().agg()) are often more idiomatic in code, while SQL is great for analysts or complex joins.
Advanced Topics
- Partitioning & Bucketing: Control data layout for faster joins/queries (e.g., df.write.bucketBy(…)).
- Window Functions: ROW_NUMBER(), RANK(), LAG(), etc., over partitions.
- User-Defined Functions (UDFs): Scalar, aggregate; Pandas UDFs (vectorized) for performance.
- Streaming: Structured Streaming for SQL on streams.
- Delta Lake / Lakehouse: Often used with Spark SQL for ACID transactions, time travel, etc. (common in Databricks).
Learning Path:
- Official docs: Spark SQL Programming Guide and SQL Reference.
- Hands-on: Databricks Community Edition, local Spark, or SparkByExamples.
- Books/Tutorials: “Spark: The Definitive Guide”, Edureka/Databricks blogs.
- Practice: ETL pipelines, joins on large datasets, performance tuning.
Performance Tuning & Best Practices
- Use DataFrames/Datasets over RDDs.
- Cache/persist judiciously (df.cache()).
- Tune partitions (spark.sql.shuffle.partitions, AQE for dynamic coalescing).
- Handle skew (salting, AQE skew join optimization).
- Use broadcast joins for small tables (spark.sql.autoBroadcastJoinThreshold).
- Columnar formats (Parquet/ORC) with predicate pushdown.
- Avoid wide tables, excessive withColumn in loops, unnecessary count() or collect().
- Monitor with Spark UI (stages, tasks, shuffle read/write, GC time).
Common Issues: OOM (tune memory, off-heap), shuffle spills, data skew, small file problems.
Spark SQL Interview Questions & Answers (Detailed)
Here is a curated selection of common and advanced questions (drawn from real interviews).
- What is Spark SQL and its advantages? Spark SQL provides SQL interface + DataFrame API on Spark’s engine. Advantages: unified processing, Catalyst optimizations, Hive compatibility, mixing SQL/code, fault tolerance, scalability.
- Difference between RDD, DataFrame, and Dataset?
- RDD: Low-level, no schema, full control, less optimized.
- DataFrame: Schema, Catalyst + Tungsten optimizations, SQL support.
- Dataset: Type-safe DataFrame (Scala/Java). DataFrames are preferred for most structured work.
- How do you create/register a temporary table/view? df.createOrReplaceTempView(“table_name”) or createOrReplaceGlobalTempView. Then spark.sql(“SELECT * FROM table_name”).
- Explain Catalyst Optimizer phases. Analysis → Logical Optimization → Physical Planning (cost-based) → Code Generation. It enables predicate pushdown, join reordering, etc.
- What is AQE (Adaptive Query Execution)? Runtime re-optimization using actual statistics: coalesces small partitions, handles skew, chooses join strategies dynamically. Enabled via spark.sql.adaptive.enabled.
- Broadcast Join vs. Shuffle Join? When to use? Broadcast: Small table sent to all executors (fast for small < ~10-100MB). Shuffle (Sort-Merge): Both sides shuffled. Use hints or config to control.
- How to handle data skew? Salting (add random suffix to key), AQE skew optimization, broadcast if possible, repartition by skewed column + random.
- repartition vs. coalesce? repartition(n): Full shuffle, can increase/decrease partitions. coalesce(n): Reduces partitions with minimal shuffle (merges locally).
- What is lazy evaluation? Benefits? Transformations build DAG; computation only on action. Allows whole-query optimization by Catalyst.
- How to optimize a slow join query? Check Spark UI for skew/shuffle; use broadcast/AQE; ensure proper partitioning; filter early; use Parquet; tune memory/shuffle partitions.
Other Common Questions:
- Explain SparkSession vs. SparkContext.
- Window functions vs. groupBy.
- UDF performance (Pandas UDFs better).
- Caching levels and when to unpersist.
- Reading partitioned data (partition pruning).
- Difference between saveAsTable and writing to filesystem.
- Handling schema evolution.
- Integration with Hive metastore.
- Scenario: Optimize a wide join on terabyte-scale data.
Advanced/Scenario-Based:
- Diagnose OOM or long-running tasks from UI/logs.
- Design a pipeline for incremental loads.
- Handle late-arriving data in Structured Streaming with Spark SQL.
- Compare Spark SQL to Trino/Presto for your use case.
Preparation Tips: Practice on large datasets, use EXPLAIN/EXPLAIN EXTENDED/EXPLAIN COST, understand Spark UI deeply, and work on real ETL projects. Review official performance tuning guide.
For the absolute latest (Spark 4.x+), always check the official documentation, as features like AQE and ANSI compliance continue to evolve. Hands-on practice is the best way to master it!
Here is a comprehensive guide to Spark SQL, covering everything from its core architecture to advanced performance tuning and common interview questions. Spark SQL is Apache Spark’s module for structured data processing, widely used for its ability to blend SQL queries with programmatic DataFrames and its high-performance execution engine .
📚 Part 1: In-Depth Learning Details for Spark SQL
1. Core Architecture: Catalyst & Tungsten
Spark SQL’s performance comes from two main components under the hood.
Catalyst Optimizer
This is the query optimizer. It takes your SQL or DataFrame code and transforms it through several phases :
- Parsing: Converts SQL into an “unresolved logical plan” (checks syntax).
- Analysis: Uses catalog (metadata) to resolve column/table names.
- Logical Optimization: Applies rule-based optimizations like Predicate Pushdown (filters move closer to data) and Column Pruning (only read needed columns).
- Physical Planning: Generates multiple physical execution plans and selects the best one using a cost model (CBO).
- Code Generation: Compiles parts of the query to Java bytecode (Whole-stage CodeGen) for speed.
Tungsten Engine
This is the execution engine focused on memory and CPU efficiency :
- Off-Heap Memory: Manages memory directly (outside JVM) to reduce Garbage Collection (GC) pauses.
- Cache Locality: Uses compact binary formats for better CPU cache efficiency.
- Vectorization: Processes data in batches (column by column) rather than row by row (used for Parquet/ORC).
2. Core Concepts: DataFrames, Datasets, and SQL
- DataFrame: A distributed collection of rows organized into named columns (like a table in Python Pandas or R). It is a
Dataset[Row]. - Dataset: A type-safe version available only in Java and Scala (e.g.,
Dataset<Person>). It combines RDD benefits (type safety) with Spark SQL optimizations . - SparkSession: The entry point for all Spark SQL functionality (
sparkavailable in the shell) .
3. Performance Tuning Guide
Optimizing Spark SQL requires focusing on data organization, query patterns, and configuration.
Storage Format & Partitions
- Use Columnar Formats: Always use Parquet or ORC. They support compression (Snappy/ZSTD) and predicate pushdown .
- Partitioning: Partition your data on columns used for filtering (e.g.,
date,region). Avoid creating too many tiny files (the “small file problem”) .
Join Strategies (Crucial for Speed)
Choosing the right join method is vital :
- Broadcast Hash Join (BHJ): Fastest. Use when one table is small (< 10 MB by default). Use hint:
SELECT /*+ BROADCAST(t) */ .... Avoid OOM by keeping broadcast table under 1GB. - Sort Merge Join (SMJ): Default for joining two large tables. Requires shuffling data. Expensive but robust.
- Shuffle Hash Join (SHJ): Replaced by SMJ usually (unless sortable keys are missing).
BROADCAST: Forces broadcast join (small table).MERGE: Forces sort-merge join (large tables).SHUFFLE_HASH: Forces shuffle hash join.
Advanced Feature: Adaptive Query Execution (AQE)
Enabled by default in Spark 3.x (spark.sql.adaptive.enabled=true). It fixes issues at runtime :
- Coalescing Partitions: Automatically reduces the number of shuffle partitions to avoid tiny tasks.
- Handling Skew: Automatically splits skewed partitions (e.g., a key with 90% of data).
- Switching Join Types: Converts a Sort-Merge join to Broadcast join dynamically if one side is small.
Data Skew Handling
If AQE doesn’t fix skew automatically (or you are on older Spark) :
- Salting: Add a random number to the skewed key to spread the load.sql– Concept: Join on concat(key, ‘_’, random(0,N))
4. SQL Features Deep Dive
Window Functions
Used to calculate over a group of rows (frame) without collapsing them .
- Ranking:
ROW_NUMBER(),RANK(),DENSE_RANK(). - Analytic:
LAG(),LEAD()(look at previous/next rows). - Syntax:
function() OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN ... AND ...)
🎤 Part 2: Interview Questions & Detailed Answers
Here are some critical Spark SQL interview questions categorized by difficulty.
Conceptual & Architecture
Q1: Explain the Catalyst Optimizer phases.
Answer:
The Catalyst optimizer transforms queries in four phases :
- Parse: SQL string → Abstract Syntax Tree (AST).
- Analyze: AST + Catalog (metadata) → Resolved Logical Plan.
- Optimize: Apply rules (Predicate Pushdown, Constant Folding) → Optimized Logical Plan.
- Physical Planning: Logical Plan → Multiple Physical Plans → Select best using Cost Model (CBO).
- CodeGen: Physical plan → Java Bytecode (using
WholeStageCodegen).
Q2: What is the difference between df.cache() and df.persist()?
Answer:
cache(): Defaults toMEMORY_AND_DISK. Stores data in memory, spills to disk if memory full.persist(): Allows you to specifyStorageLevel(e.g.,DISK_ONLY,MEMORY_ONLY,OFF_HEAP). Use when you need to control replication level or memory usage .
Q3: How does Adaptive Query Execution (AQE) resolve data skew?
Answer:
AQE handles skew during JOIN or GROUP BY . If Spark detects that a partition size exceeds a threshold (skewed factor) during shuffle, it splits that skewed partition into smaller sub-partitions. It then replicates the “Build” side to join these sub-partitions concurrently, preventing one straggling task from delaying the whole job.
Performance & Optimization
Q4: A Spark SQL join is running for hours. One task takes 30 mins while others finish in 5 seconds. Why? How to fix?
Answer:
This is Data Skew. One key (e.g., user_id = null or location = 'US') has disproportionally more data.
Fix:
- Enable AQE:
spark.sql.adaptive.skewJoin.enabled=true(handles it automatically in Spark 3.x). - Salting: Add a random suffix to the skewed key to split it across multiple tasks, then join.
- Isolate Skew: Filter out the skewed key, process it separately, and union back.
Q5: You are joining two huge tables. Explain SortMergeJoin vs BroadcastHashJoin.
Answer:
- BroadcastHashJoin: Works only if one table fits in executor memory (< 1GB recommended). It sends the small table to all nodes (minimal shuffle). Fastest but memory-intensive .
- SortMergeJoin: Handles any size. Shuffles both tables so that same keys land in same partition, sorts them, then merges. Involves heavy network I/O but is robust and memory-safe for big data.
Q6: How to avoid SELECT * in production?
Answer:SELECT * forces Spark to read all columns from Parquet/ORC. Column Pruning optimization fails. This results in higher I/O, more memory usage, and slower network transfers. Always explicitly list required columns (e.g., SELECT id, name FROM ...) .
Code & Logic
Q7: Write a query to find the second-highest salary per department.
Answer:
Use the DENSE_RANK() window function .
sql
SELECT department, employee_name, salary
FROM (
SELECT department, employee_name, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
) ranked
WHERE rank = 2;Q8: Difference between RANK() and ROW_NUMBER()?
Answer:
ROW_NUMBER(): Assigns a unique sequential number per row (even if ties).RANK(): Same rank for ties, leaves gaps (1,2,2,4).DENSE_RANK(): Same rank for ties, no gaps (1,2,2,3).
25 Additional Quick Q&A
| Question | Short Answer |
|---|---|
1. What is a SparkSession? | The entry point for reading data and executing SQL (replaces SQLContext/HiveContext) |
| 2. List four file formats. | Parquet, ORC, JSON, CSV. (Parquet/ORC recommended for analytics) |
3. What is a UDF? | User Defined Function. Allows custom logic in SQL/DataFrame |
| 4. Why is a Python UDF slow? | Involves serialization (Py4J) between JVM and Python process for every row |
| 5. How to use a SQL hint? | SELECT /*+ hint_name */ ... (e.g., BROADCAST, REPARTITION) |
6. Define EXPLAIN command. | Shows the logical and physical execution plan of a query |
| 7. Config for Auto Broadcast? | spark.sql.autoBroadcastJoinThreshold (default 10MB) |
| 8. What is Predicate Pushdown? | Filters are pushed to the data source (Parquet) to skip reading irrelevant data |
| 9. Partition column guidelines? | High cardinality? Yes. Avoid too many partitions (thousands) |
10. Difference: repartition vs coalesce? | repartition = Full Shuffle (increase/decrease partitions). coalesce = Avoids shuffle (decrease only) |
| 11. What is Tungsten? | Memory management engine using off-heap storage and code generation |
| 12. How to handle missing Parquet schema evolution? | Use mergeSchema option or read as String and cast |
| 13. What is Dynamic Partition Pruning (DPP)? | Optimizer automatically filters partitions in a large table based on a small table join condition |
| 14. Config to control Shuffle partitions? | spark.sql.shuffle.partitions (default 200) |
| 15. How to see Spark UI? | Port 4040 (default) |
| 16. Create a temporary view? | df.createOrReplaceTempView("my_view") |
17. What is a Skewed Join? | When one key has massively more data than others, causing one task to run forever |
| 18. Why are small files bad? | Causes too many tasks (overhead), inefficient predicate pushdown |
| 19. How to write a single output file? | df.coalesce(1).write... (Danger: causes driver bottleneck) |
20. What is FileNotFoundException in Shuffle? | Usually means a node died or disk failure; retry with checkpointing |
21. Difference: DROP vs TRUNCATE? | Drop removes table definition. Truncate removes data only |
22. What is a Broadcast Variable? | Read-only copy of data sent to all nodes (used in Broadcast Join) |
23. Can you run UPDATE or DELETE? | Not in standard Spark SQL (unless using Delta Lake/Hudi) |
24. What is Whole-Stage Codegen? | Compiles multiple operators (Filter + Project) into a single function to avoid virtual function calls |
| 25. Best config for memory? | spark.executor.memory = 16-32GB. spark.memory.offHeap.size = 8-16GB |
Spark SQL is one of the most important topics for Data Engineer, Big Data Engineer, Databricks Engineer, AWS Glue Developer, Azure Data Engineer, and Senior Spark Developer interviews.
1. What is Spark SQL?
Spark SQL is a module of Apache Spark that allows you to process structured and semi-structured data using SQL queries.
It provides:
- SQL Interface
- DataFrame API
- Dataset API
- Query Optimization
- Hive Integration
- JDBC Connectivity
Instead of writing complex RDD transformations, developers can use SQL-like syntax.
Example
df = spark.read.csv("employees.csv", header=True)
df.createOrReplaceTempView("employees")
result = spark.sql("""
SELECT department,
COUNT(*) as emp_count
FROM employees
GROUP BY department
""")Why Spark SQL?
Traditional SQL Databases:
- Limited scalability
- Vertical scaling
Spark SQL:
- Distributed processing
- Parallel execution
- In-memory computing
- Petabyte scale processing
Spark SQL Architecture
SQL Query
|
Catalyst Optimizer
|
Logical Plan
|
Optimized Logical Plan
|
Physical Plan
|
Tungsten Engine
|
ExecutionComponents of Spark SQL
1. DataFrame
A distributed collection of data organized into named columns.
Similar to:
Table
Excel Sheet
Pandas DataFrameExample:
df = spark.read.parquet("/data/sales")2. Dataset
Type-safe version of DataFrame.
Available mainly in Scala and Java.
Example:
case class Employee(id:Int,name:String)
val ds = spark.read.json("emp.json").as[Employee]3. SQL Engine
Allows execution of SQL queries.
spark.sql("""
SELECT * FROM employees
""")SparkSession
Entry point to Spark SQL.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("SparkSQL") \
.getOrCreate()Reading Data
CSV
df = spark.read \
.option("header","true") \
.csv("/data/file.csv")JSON
df = spark.read.json("/data/file.json")Parquet
df = spark.read.parquet("/data/file.parquet")ORC
df = spark.read.orc("/data/file.orc")Delta
df = spark.read.format("delta").load("/data")Writing Data
df.write.mode("overwrite").parquet("/output")Modes:
- append
- overwrite
- ignore
- errorifexists
Creating Temporary Views
Temp View
Session scoped.
df.createOrReplaceTempView("emp")spark.sql("SELECT * FROM emp")Global Temp View
Application scoped.
df.createGlobalTempView("emp")SELECT * FROM global_temp.empSpark SQL Data Types
Numeric
ByteType
ShortType
IntegerType
LongType
FloatType
DoubleType
DecimalTypeString
StringTypeDate & Time
DateType
TimestampTypeComplex Types
ArrayType
MapType
StructTypeExample:
{
"name":"John",
"skills":["Python","Spark"]
}Common SQL Operations
Select
SELECT * FROM employeesFilter
SELECT *
FROM employees
WHERE salary > 100000Group By
SELECT department,
COUNT(*)
FROM employees
GROUP BY departmentOrder By
SELECT *
FROM employees
ORDER BY salary DESCJoins in Spark SQL
Inner Join
SELECT *
FROM emp e
INNER JOIN dept d
ON e.dept_id=d.dept_idLeft Join
SELECT *
FROM emp e
LEFT JOIN dept d
ON e.dept_id=d.dept_idRight Join
SELECT *
FROM emp e
RIGHT JOIN dept d
ON e.dept_id=d.dept_idFull Join
SELECT *
FROM emp e
FULL OUTER JOIN dept d
ON e.dept_id=d.dept_idCross Join
SELECT *
FROM emp CROSS JOIN deptWindow Functions
Frequently asked in interviews.
Example:
SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY dept
ORDER BY salary DESC
) rn
FROM employeesRank
RANK()Dense Rank
DENSE_RANK()Lead
LEAD()Lag
LAG()Common Built-In Functions
String Functions
UPPER()
LOWER()
TRIM()
CONCAT()
SUBSTRING()Date Functions
CURRENT_DATE()
DATE_ADD()
DATEDIFF()
MONTH()
YEAR()Aggregate Functions
SUM()
AVG()
COUNT()
MAX()
MIN()Catalyst Optimizer
Most important Spark SQL interview topic.
Catalyst performs:
Analysis
Validate query
Resolve columns
Resolve tablesLogical Optimization
Predicate Pushdown
Constant Folding
Column PruningPhysical Planning
Choose execution strategyCode Generation
Generate optimized JVM bytecodeExample of Predicate Pushdown
Bad:
SELECT *
FROM sales
WHERE amount > 1000Without pushdown:
Entire dataset read.
With pushdown:
Only required rows read.
Benefits:
- Less IO
- Faster execution
Tungsten Engine
Spark’s execution engine.
Features:
Off-Heap Memory
Reduces JVM GC overhead.
Binary Processing
More compact storage.
Whole Stage Code Generation
Generates native bytecode.
Benefits:
- Faster execution
- Lower memory usage
Partitioning
Critical interview topic.
Repartition
Creates full shuffle.
df = df.repartition(200)Coalesce
Avoids full shuffle.
df = df.coalesce(20)Caching
df.cache()or
df.persist()Storage Levels:
MEMORY_ONLY
MEMORY_AND_DISK
DISK_ONLYBroadcast Join
Small table sent to all executors.
from pyspark.sql.functions import broadcast
df.join(broadcast(dim), "id")Advantages:
- Avoids shuffle
- Faster joins
Adaptive Query Execution (AQE)
Introduced in Spark 3.
Features:
- Dynamic partition adjustment
- Join strategy switching
- Skew handling
Enable:
spark.conf.set(
"spark.sql.adaptive.enabled",
"true")Explain Plan
df.explain(True)Shows:
Parsed Plan
Logical Plan
Optimized Plan
Physical PlanSpark SQL Performance Optimization
Use Parquet
Good:
Parquet
ORC
DeltaBad:
CSV
JSONColumn Pruning
Good:
SELECT id,name
FROM employeesBad:
SELECT *
FROM employeesFilter Early
SELECT *
FROM sales
WHERE sale_date='2025-01-01'Broadcast Small Tables
broadcast(dim_table)Bucketing
CREATE TABLE sales
USING parquet
CLUSTERED BY (customer_id)
INTO 100 BUCKETSSpark SQL Interview Questions & Answers
Q1. What is Spark SQL?
Answer
Spark SQL is a Spark module used for structured data processing. It provides SQL queries, DataFrames, Datasets, Catalyst Optimizer, and integration with Hive.
Q2. Difference between RDD and DataFrame?
| RDD | DataFrame |
|---|---|
| Low-level | High-level |
| No optimization | Catalyst optimization |
| Slower | Faster |
| No schema | Schema based |
Q3. Difference between DataFrame and Dataset?
| DataFrame | Dataset |
|---|---|
| Untyped | Typed |
| Python Supported | Scala/Java Mainly |
| Faster Development | Compile-time checks |
Q4. What is Catalyst Optimizer?
Catalyst is Spark SQL’s query optimization framework.
Functions:
- Query analysis
- Logical optimization
- Physical optimization
- Code generation
Q5. What is Tungsten?
Execution engine that improves:
- Memory management
- CPU efficiency
- Serialization
Q6. Explain Predicate Pushdown.
Filters are pushed to storage layer.
Instead of:
Read 1 TB
Filter laterSpark reads only required rows.
Q7. What is Column Pruning?
Reading only required columns.
Example:
SELECT id,nameinstead of
SELECT *Q8. What is Broadcast Join?
Small table distributed to all executors.
Used when:
Dimension table < 10 MB (config dependent)Q9. What causes data skew?
When one partition contains significantly more data than others.
Example:
Customer A → 10M rows
Others → 100 rowsQ10. How do you fix data skew?
Methods:
- Salting
- AQE
- Broadcast Join
- Repartitioning
Q11. What is AQE?
Adaptive Query Execution dynamically optimizes query execution during runtime.
Q12. Difference Between Repartition and Coalesce?
| Repartition | Coalesce |
|---|---|
| Shuffle | Less shuffle |
| Increase/decrease partitions | Mostly decrease |
| Expensive | Cheaper |
Q13. Explain Window Functions.
Perform calculations across rows without collapsing records.
Examples:
ROW_NUMBER()
RANK()
DENSE_RANK()
LAG()
LEAD()Q14. What is a Temporary View?
Session-scoped SQL view.
df.createOrReplaceTempView("emp")Q15. What is a Global Temporary View?
Accessible across sessions.
global_temp.viewnameSenior-Level Spark SQL Interview Questions
How would you optimize a 5 TB Spark SQL job?
Answer:
- Use Parquet/Delta
- Partition correctly
- Enable AQE
- Broadcast dimensions
- Avoid SELECT *
- Tune shuffle partitions
- Cache reused datasets
- Analyze explain plan
- Handle skew
- Use Z-Ordering (Delta)
How does Spark SQL execute a query internally?
Answer:
SQL Query
→ Parser
→ Logical Plan
→ Catalyst Optimizer
→ Optimized Logical Plan
→ Physical Plan
→ Tungsten Engine
→ DAG Scheduler
→ Tasks
→ Executors
→ OutputMust-Know Real-Time Spark SQL Scenarios
Scenario 1: Slow Join Query
Problem:
Fact Table = 5 TB
Dimension = 20 MBSolution:
broadcast(dim)Scenario 2: Too Many Small Files
Solution:
df.coalesce(10)Scenario 3: Data Skew
Solution:
Salting
AQE
Skew Join OptimizationScenario 4: Memory Issues
Solution:
Persist MEMORY_AND_DISK
Increase executor memory
Avoid unnecessary cacheTop 25 Spark SQL Topics for Interviews
- Spark SQL Architecture
- SparkSession
- DataFrames
- Datasets
- Catalyst Optimizer
- Tungsten
- AQE
- Joins
- Broadcast Join
- Window Functions
- Partitioning
- Bucketing
- Caching
- Persist
- Explain Plan
- Predicate Pushdown
- Column Pruning
- Query Optimization
- Data Skew
- Shuffle
- Temp Views
- Hive Metastore
- Parquet
- Delta Lake Integration
- Performance Tuning
These topics cover roughly 90–95% of Spark SQL questions asked in Data Engineering, Databricks, AWS Glue, Azure Data Engineering, and Senior Big Data interviews.


