Apache Spark SQL – Complete Learning Guide + Interview Questions & Answers

SPARK SQL

Spark SQL is Apache Spark’s module for working with structured data. It provides a familiar SQL interface (and the DataFrame/Dataset API) while leveraging Spark’s distributed, in-memory processing engine for high performance and scalability.

It unifies relational processing with Spark’s functional programming model, allowing seamless mixing of SQL queries with other Spark operations (e.g., MLlib, Streaming). The same execution engine is used regardless of whether you use SQL, DataFrames, or Datasets.

Key Concepts and Architecture

  • DataFrames: A distributed collection of data organized into named columns (like a relational table or pandas DataFrame). They provide rich optimizations via the Catalyst optimizer and Tungsten execution engine. DataFrames are the primary abstraction in Spark SQL.
  • Datasets: An extension of DataFrames (Scala/Java primarily) that adds type safety while retaining the performance benefits. In Python (PySpark), DataFrames provide most Dataset-like benefits due to dynamic typing.
  • SparkSession: The unified entry point (introduced in Spark 2.0) for working with Spark SQL, DataFrames, etc. Example in PySpark:Pythonfrom pyspark.sql import SparkSession spark = SparkSession.builder.appName("MyApp").getOrCreate()
  • Catalyst Optimizer: The extensible query optimizer at the heart of Spark SQL. It performs analysis, logical optimization (rule-based, e.g., predicate pushdown, constant folding), physical planning (cost-based), and code generation (Whole-Stage Code Generation via Tungsten). This enables significant performance gains over raw RDDs.
  • Tungsten: Focuses on memory/CPU efficiency with off-heap memory management, cache-friendly data layouts, and bytecode generation.
  • Execution Model: Queries are lazy (transformations build a DAG until an action triggers computation). Spark handles fault tolerance via lineage.

Comparison to Hive:

  • Spark SQL is generally faster (in-memory, optimized execution) and supports mixing SQL with programmatic code.
  • Hive (on MapReduce/Tez) is more batch-oriented; Spark SQL reuses Hive metastore and is largely compatible with HiveQL but adds many optimizations.
  • Spark SQL excels in iterative workloads, ML integration, and interactive queries.

Vs. Presto/Trino: Presto is often faster for low-latency, interactive ad-hoc SQL analytics (no heavy shuffling for joins in some cases). Spark is better for unified batch + streaming + ML pipelines and complex ETL.

Core Features

  • Unified Data Access: Read/write from Parquet, ORC, Avro, JSON, CSV, JDBC, Hive tables, etc., with schema inference or explicit schemas.
  • SQL Support: ANSI SQL compliance (with extensions). Supports DDL, DML, queries, window functions, common table expressions (CTEs), etc.
  • Integration: Hive metastore compatibility, UDFs/UDAFs (Scala/Java/Python), integration with Arrow for pandas.
  • Data Sources: Built-in support for many formats; custom sources via APIs.
  • Performance Features: Caching (in-memory columnar), broadcast joins, Adaptive Query Execution (AQE — dynamic re-optimization at runtime, enabled by default in recent versions), partition pruning, predicate pushdown, skew handling.

Getting Started (PySpark Example)

Python

# Create SparkSession
spark = SparkSession.builder.getOrCreate()

# Read data
df = spark.read.parquet("path/to/data")
df.createOrReplaceTempView("my_table")

# SQL query
result = spark.sql("""
  SELECT category, COUNT(*) as cnt, AVG(sales) as avg_sales
  FROM my_table
  WHERE date >= '2025-01-01'
  GROUP BY category
  HAVING cnt > 100
""")

result.show()
result.write.parquet("output/")

DataFrames API vs. SQL

You can freely mix them. DataFrame operations (e.g., df.filter(), df.groupBy().agg()) are often more idiomatic in code, while SQL is great for analysts or complex joins.

Advanced Topics

  • Partitioning & Bucketing: Control data layout for faster joins/queries (e.g., df.write.bucketBy(…)).
  • Window Functions: ROW_NUMBER(), RANK(), LAG(), etc., over partitions.
  • User-Defined Functions (UDFs): Scalar, aggregate; Pandas UDFs (vectorized) for performance.
  • Streaming: Structured Streaming for SQL on streams.
  • Delta Lake / Lakehouse: Often used with Spark SQL for ACID transactions, time travel, etc. (common in Databricks).

Learning Path:

  1. Official docs: Spark SQL Programming Guide and SQL Reference.
  2. Hands-on: Databricks Community Edition, local Spark, or SparkByExamples.
  3. Books/Tutorials: “Spark: The Definitive Guide”, Edureka/Databricks blogs.
  4. Practice: ETL pipelines, joins on large datasets, performance tuning.

Performance Tuning & Best Practices

  • Use DataFrames/Datasets over RDDs.
  • Cache/persist judiciously (df.cache()).
  • Tune partitions (spark.sql.shuffle.partitions, AQE for dynamic coalescing).
  • Handle skew (salting, AQE skew join optimization).
  • Use broadcast joins for small tables (spark.sql.autoBroadcastJoinThreshold).
  • Columnar formats (Parquet/ORC) with predicate pushdown.
  • Avoid wide tables, excessive withColumn in loops, unnecessary count() or collect().
  • Monitor with Spark UI (stages, tasks, shuffle read/write, GC time).

Common Issues: OOM (tune memory, off-heap), shuffle spills, data skew, small file problems.

Spark SQL Interview Questions & Answers (Detailed)

Here is a curated selection of common and advanced questions (drawn from real interviews).

  1. What is Spark SQL and its advantages? Spark SQL provides SQL interface + DataFrame API on Spark’s engine. Advantages: unified processing, Catalyst optimizations, Hive compatibility, mixing SQL/code, fault tolerance, scalability.
  2. Difference between RDD, DataFrame, and Dataset?
    • RDD: Low-level, no schema, full control, less optimized.
    • DataFrame: Schema, Catalyst + Tungsten optimizations, SQL support.
    • Dataset: Type-safe DataFrame (Scala/Java). DataFrames are preferred for most structured work.
  3. How do you create/register a temporary table/view? df.createOrReplaceTempView(“table_name”) or createOrReplaceGlobalTempView. Then spark.sql(“SELECT * FROM table_name”).
  4. Explain Catalyst Optimizer phases. Analysis → Logical Optimization → Physical Planning (cost-based) → Code Generation. It enables predicate pushdown, join reordering, etc.
  5. What is AQE (Adaptive Query Execution)? Runtime re-optimization using actual statistics: coalesces small partitions, handles skew, chooses join strategies dynamically. Enabled via spark.sql.adaptive.enabled.
  6. Broadcast Join vs. Shuffle Join? When to use? Broadcast: Small table sent to all executors (fast for small < ~10-100MB). Shuffle (Sort-Merge): Both sides shuffled. Use hints or config to control.
  7. How to handle data skew? Salting (add random suffix to key), AQE skew optimization, broadcast if possible, repartition by skewed column + random.
  8. repartition vs. coalesce? repartition(n): Full shuffle, can increase/decrease partitions. coalesce(n): Reduces partitions with minimal shuffle (merges locally).
  9. What is lazy evaluation? Benefits? Transformations build DAG; computation only on action. Allows whole-query optimization by Catalyst.
  10. How to optimize a slow join query? Check Spark UI for skew/shuffle; use broadcast/AQE; ensure proper partitioning; filter early; use Parquet; tune memory/shuffle partitions.

Other Common Questions:

  • Explain SparkSession vs. SparkContext.
  • Window functions vs. groupBy.
  • UDF performance (Pandas UDFs better).
  • Caching levels and when to unpersist.
  • Reading partitioned data (partition pruning).
  • Difference between saveAsTable and writing to filesystem.
  • Handling schema evolution.
  • Integration with Hive metastore.
  • Scenario: Optimize a wide join on terabyte-scale data.

Advanced/Scenario-Based:

  • Diagnose OOM or long-running tasks from UI/logs.
  • Design a pipeline for incremental loads.
  • Handle late-arriving data in Structured Streaming with Spark SQL.
  • Compare Spark SQL to Trino/Presto for your use case.

Preparation Tips: Practice on large datasets, use EXPLAIN/EXPLAIN EXTENDED/EXPLAIN COST, understand Spark UI deeply, and work on real ETL projects. Review official performance tuning guide.

For the absolute latest (Spark 4.x+), always check the official documentation, as features like AQE and ANSI compliance continue to evolve. Hands-on practice is the best way to master it!

Here is a comprehensive guide to Spark SQL, covering everything from its core architecture to advanced performance tuning and common interview questions. Spark SQL is Apache Spark’s module for structured data processing, widely used for its ability to blend SQL queries with programmatic DataFrames and its high-performance execution engine .


📚 Part 1: In-Depth Learning Details for Spark SQL

1. Core Architecture: Catalyst & Tungsten

Spark SQL’s performance comes from two main components under the hood.

Catalyst Optimizer
This is the query optimizer. It takes your SQL or DataFrame code and transforms it through several phases :

  1. Parsing: Converts SQL into an “unresolved logical plan” (checks syntax).
  2. Analysis: Uses catalog (metadata) to resolve column/table names.
  3. Logical Optimization: Applies rule-based optimizations like Predicate Pushdown (filters move closer to data) and Column Pruning (only read needed columns).
  4. Physical Planning: Generates multiple physical execution plans and selects the best one using a cost model (CBO).
  5. Code Generation: Compiles parts of the query to Java bytecode (Whole-stage CodeGen) for speed.

Tungsten Engine
This is the execution engine focused on memory and CPU efficiency :

  • Off-Heap Memory: Manages memory directly (outside JVM) to reduce Garbage Collection (GC) pauses.
  • Cache Locality: Uses compact binary formats for better CPU cache efficiency.
  • Vectorization: Processes data in batches (column by column) rather than row by row (used for Parquet/ORC).

2. Core Concepts: DataFrames, Datasets, and SQL

  • DataFrame: A distributed collection of rows organized into named columns (like a table in Python Pandas or R). It is a Dataset[Row] .
  • Dataset: A type-safe version available only in Java and Scala (e.g., Dataset<Person>). It combines RDD benefits (type safety) with Spark SQL optimizations .
  • SparkSession: The entry point for all Spark SQL functionality (spark available in the shell) .

3. Performance Tuning Guide

Optimizing Spark SQL requires focusing on data organization, query patterns, and configuration.

Storage Format & Partitions

  • Use Columnar Formats: Always use Parquet or ORC. They support compression (Snappy/ZSTD) and predicate pushdown .
  • Partitioning: Partition your data on columns used for filtering (e.g., dateregion). Avoid creating too many tiny files (the “small file problem”) .

Join Strategies (Crucial for Speed)
Choosing the right join method is vital :

  • Broadcast Hash Join (BHJ): Fastest. Use when one table is small (< 10 MB by default). Use hint: SELECT /*+ BROADCAST(t) */ .... Avoid OOM by keeping broadcast table under 1GB.
  • Sort Merge Join (SMJ): Default for joining two large tables. Requires shuffling data. Expensive but robust.
  • Shuffle Hash Join (SHJ): Replaced by SMJ usually (unless sortable keys are missing).

Join Hints Cheat Sheet 

  • BROADCAST : Forces broadcast join (small table).
  • MERGE : Forces sort-merge join (large tables).
  • SHUFFLE_HASH : Forces shuffle hash join.

Advanced Feature: Adaptive Query Execution (AQE)
Enabled by default in Spark 3.x (spark.sql.adaptive.enabled=true). It fixes issues at runtime :

  • Coalescing Partitions: Automatically reduces the number of shuffle partitions to avoid tiny tasks.
  • Handling Skew: Automatically splits skewed partitions (e.g., a key with 90% of data).
  • Switching Join Types: Converts a Sort-Merge join to Broadcast join dynamically if one side is small.

Data Skew Handling
If AQE doesn’t fix skew automatically (or you are on older Spark) :

  • Salting: Add a random number to the skewed key to spread the load.sql– Concept: Join on concat(key, ‘_’, random(0,N))

4. SQL Features Deep Dive

Window Functions
Used to calculate over a group of rows (frame) without collapsing them .

  • Ranking: ROW_NUMBER()RANK()DENSE_RANK().
  • Analytic: LAG()LEAD() (look at previous/next rows).
  • Syntax: function() OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN ... AND ...)

🎤 Part 2: Interview Questions & Detailed Answers

Here are some critical Spark SQL interview questions categorized by difficulty.

Conceptual & Architecture

Q1: Explain the Catalyst Optimizer phases.
Answer:
The Catalyst optimizer transforms queries in four phases :

  1. Parse: SQL string → Abstract Syntax Tree (AST).
  2. Analyze: AST + Catalog (metadata) → Resolved Logical Plan.
  3. Optimize: Apply rules (Predicate Pushdown, Constant Folding) → Optimized Logical Plan.
  4. Physical Planning: Logical Plan → Multiple Physical Plans → Select best using Cost Model (CBO).
  5. CodeGen: Physical plan → Java Bytecode (using WholeStageCodegen).

Q2: What is the difference between df.cache() and df.persist()?
Answer:

  • cache(): Defaults to MEMORY_AND_DISK. Stores data in memory, spills to disk if memory full.
  • persist(): Allows you to specify StorageLevel (e.g., DISK_ONLYMEMORY_ONLYOFF_HEAP). Use when you need to control replication level or memory usage .

Q3: How does Adaptive Query Execution (AQE) resolve data skew?
Answer:
AQE handles skew during JOIN or GROUP BY . If Spark detects that a partition size exceeds a threshold (skewed factor) during shuffle, it splits that skewed partition into smaller sub-partitions. It then replicates the “Build” side to join these sub-partitions concurrently, preventing one straggling task from delaying the whole job.

Performance & Optimization

Q4: A Spark SQL join is running for hours. One task takes 30 mins while others finish in 5 seconds. Why? How to fix?
Answer:
This is Data Skew. One key (e.g., user_id = null or location = 'US') has disproportionally more data.
Fix:

  1. Enable AQE: spark.sql.adaptive.skewJoin.enabled=true (handles it automatically in Spark 3.x).
  2. Salting: Add a random suffix to the skewed key to split it across multiple tasks, then join.
  3. Isolate Skew: Filter out the skewed key, process it separately, and union back.

Q5: You are joining two huge tables. Explain SortMergeJoin vs BroadcastHashJoin.
Answer:

  • BroadcastHashJoin: Works only if one table fits in executor memory (< 1GB recommended). It sends the small table to all nodes (minimal shuffle). Fastest but memory-intensive .
  • SortMergeJoin: Handles any size. Shuffles both tables so that same keys land in same partition, sorts them, then merges. Involves heavy network I/O but is robust and memory-safe for big data.

Q6: How to avoid SELECT * in production?
Answer:
SELECT * forces Spark to read all columns from Parquet/ORC. Column Pruning optimization fails. This results in higher I/O, more memory usage, and slower network transfers. Always explicitly list required columns (e.g., SELECT id, name FROM ....

Code & Logic

Q7: Write a query to find the second-highest salary per department.
Answer:
Use the DENSE_RANK() window function .

sql

SELECT department, employee_name, salary
FROM (
    SELECT department, employee_name, salary,
           DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
    FROM employees
) ranked
WHERE rank = 2;

Q8: Difference between RANK() and ROW_NUMBER()?
Answer:

  • ROW_NUMBER(): Assigns a unique sequential number per row (even if ties).
  • RANK(): Same rank for ties, leaves gaps (1,2,2,4).
  • DENSE_RANK(): Same rank for ties, no gaps (1,2,2,3).

25 Additional Quick Q&A

QuestionShort Answer
1. What is a SparkSession?The entry point for reading data and executing SQL (replaces SQLContext/HiveContext)
2. List four file formats.Parquet, ORC, JSON, CSV. (Parquet/ORC recommended for analytics)
3. What is a UDF?User Defined Function. Allows custom logic in SQL/DataFrame
4. Why is a Python UDF slow?Involves serialization (Py4J) between JVM and Python process for every row
5. How to use a SQL hint?SELECT /*+ hint_name */ ... (e.g., BROADCASTREPARTITION)
6. Define EXPLAIN command.Shows the logical and physical execution plan of a query
7. Config for Auto Broadcast?spark.sql.autoBroadcastJoinThreshold (default 10MB)
8. What is Predicate Pushdown?Filters are pushed to the data source (Parquet) to skip reading irrelevant data
9. Partition column guidelines?High cardinality? Yes. Avoid too many partitions (thousands)
10. Difference: repartition vs coalesce?repartition = Full Shuffle (increase/decrease partitions). coalesce = Avoids shuffle (decrease only)
11. What is Tungsten?Memory management engine using off-heap storage and code generation
12. How to handle missing Parquet schema evolution?Use mergeSchema option or read as String and cast
13. What is Dynamic Partition Pruning (DPP)?Optimizer automatically filters partitions in a large table based on a small table join condition
14. Config to control Shuffle partitions?spark.sql.shuffle.partitions (default 200)
15. How to see Spark UI?Port 4040 (default)
16. Create a temporary view?df.createOrReplaceTempView("my_view")
17. What is a Skewed Join?When one key has massively more data than others, causing one task to run forever
18. Why are small files bad?Causes too many tasks (overhead), inefficient predicate pushdown
19. How to write a single output file?df.coalesce(1).write... (Danger: causes driver bottleneck)
20. What is FileNotFoundException in Shuffle?Usually means a node died or disk failure; retry with checkpointing
21. Difference: DROP vs TRUNCATE?Drop removes table definition. Truncate removes data only
22. What is a Broadcast Variable?Read-only copy of data sent to all nodes (used in Broadcast Join)
23. Can you run UPDATE or DELETE?Not in standard Spark SQL (unless using Delta Lake/Hudi)
24. What is Whole-Stage Codegen?Compiles multiple operators (Filter + Project) into a single function to avoid virtual function calls
25. Best config for memory?spark.executor.memory = 16-32GB. spark.memory.offHeap.size = 8-16GB

Spark SQL is one of the most important topics for Data Engineer, Big Data Engineer, Databricks Engineer, AWS Glue Developer, Azure Data Engineer, and Senior Spark Developer interviews.


1. What is Spark SQL?

Spark SQL is a module of Apache Spark that allows you to process structured and semi-structured data using SQL queries.

It provides:

  • SQL Interface
  • DataFrame API
  • Dataset API
  • Query Optimization
  • Hive Integration
  • JDBC Connectivity

Instead of writing complex RDD transformations, developers can use SQL-like syntax.

Example

df = spark.read.csv("employees.csv", header=True)

df.createOrReplaceTempView("employees")

result = spark.sql("""
SELECT department,
COUNT(*) as emp_count
FROM employees
GROUP BY department
""")

Why Spark SQL?

Traditional SQL Databases:

  • Limited scalability
  • Vertical scaling

Spark SQL:

  • Distributed processing
  • Parallel execution
  • In-memory computing
  • Petabyte scale processing

Spark SQL Architecture

SQL Query
|
Catalyst Optimizer
|
Logical Plan
|
Optimized Logical Plan
|
Physical Plan
|
Tungsten Engine
|
Execution

Components of Spark SQL

1. DataFrame

A distributed collection of data organized into named columns.

Similar to:

Table
Excel Sheet
Pandas DataFrame

Example:

df = spark.read.parquet("/data/sales")

2. Dataset

Type-safe version of DataFrame.

Available mainly in Scala and Java.

Example:

case class Employee(id:Int,name:String)

val ds = spark.read.json("emp.json").as[Employee]

3. SQL Engine

Allows execution of SQL queries.

spark.sql("""
SELECT * FROM employees
""")

SparkSession

Entry point to Spark SQL.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("SparkSQL") \
.getOrCreate()

Reading Data

CSV

df = spark.read \
.option("header","true") \
.csv("/data/file.csv")

JSON

df = spark.read.json("/data/file.json")

Parquet

df = spark.read.parquet("/data/file.parquet")

ORC

df = spark.read.orc("/data/file.orc")

Delta

df = spark.read.format("delta").load("/data")

Writing Data

df.write.mode("overwrite").parquet("/output")

Modes:

  • append
  • overwrite
  • ignore
  • errorifexists

Creating Temporary Views

Temp View

Session scoped.

df.createOrReplaceTempView("emp")
spark.sql("SELECT * FROM emp")

Global Temp View

Application scoped.

df.createGlobalTempView("emp")
SELECT * FROM global_temp.emp

Spark SQL Data Types

Numeric

ByteType
ShortType
IntegerType
LongType
FloatType
DoubleType
DecimalType

String

StringType

Date & Time

DateType
TimestampType

Complex Types

ArrayType
MapType
StructType

Example:

{
"name":"John",
"skills":["Python","Spark"]
}

Common SQL Operations

Select

SELECT * FROM employees

Filter

SELECT *
FROM employees
WHERE salary > 100000

Group By

SELECT department,
COUNT(*)
FROM employees
GROUP BY department

Order By

SELECT *
FROM employees
ORDER BY salary DESC

Joins in Spark SQL

Inner Join

SELECT *
FROM emp e
INNER JOIN dept d
ON e.dept_id=d.dept_id

Left Join

SELECT *
FROM emp e
LEFT JOIN dept d
ON e.dept_id=d.dept_id

Right Join

SELECT *
FROM emp e
RIGHT JOIN dept d
ON e.dept_id=d.dept_id

Full Join

SELECT *
FROM emp e
FULL OUTER JOIN dept d
ON e.dept_id=d.dept_id

Cross Join

SELECT *
FROM emp CROSS JOIN dept

Window Functions

Frequently asked in interviews.

Example:

SELECT *,
ROW_NUMBER() OVER
(
PARTITION BY dept
ORDER BY salary DESC
) rn
FROM employees

Rank

RANK()

Dense Rank

DENSE_RANK()

Lead

LEAD()

Lag

LAG()

Common Built-In Functions

String Functions

UPPER()
LOWER()
TRIM()
CONCAT()
SUBSTRING()

Date Functions

CURRENT_DATE()
DATE_ADD()
DATEDIFF()
MONTH()
YEAR()

Aggregate Functions

SUM()
AVG()
COUNT()
MAX()
MIN()

Catalyst Optimizer

Most important Spark SQL interview topic.

Catalyst performs:

Analysis

Validate query
Resolve columns
Resolve tables

Logical Optimization

Predicate Pushdown
Constant Folding
Column Pruning

Physical Planning

Choose execution strategy

Code Generation

Generate optimized JVM bytecode

Example of Predicate Pushdown

Bad:

SELECT *
FROM sales
WHERE amount > 1000

Without pushdown:
Entire dataset read.

With pushdown:
Only required rows read.

Benefits:

  • Less IO
  • Faster execution

Tungsten Engine

Spark’s execution engine.

Features:

Off-Heap Memory

Reduces JVM GC overhead.

Binary Processing

More compact storage.

Whole Stage Code Generation

Generates native bytecode.

Benefits:

  • Faster execution
  • Lower memory usage

Partitioning

Critical interview topic.

Repartition

Creates full shuffle.

df = df.repartition(200)

Coalesce

Avoids full shuffle.

df = df.coalesce(20)

Caching

df.cache()

or

df.persist()

Storage Levels:

MEMORY_ONLY
MEMORY_AND_DISK
DISK_ONLY

Broadcast Join

Small table sent to all executors.

from pyspark.sql.functions import broadcast

df.join(broadcast(dim), "id")

Advantages:

  • Avoids shuffle
  • Faster joins

Adaptive Query Execution (AQE)

Introduced in Spark 3.

Features:

  • Dynamic partition adjustment
  • Join strategy switching
  • Skew handling

Enable:

spark.conf.set(
"spark.sql.adaptive.enabled",
"true")

Explain Plan

df.explain(True)

Shows:

Parsed Plan
Logical Plan
Optimized Plan
Physical Plan

Spark SQL Performance Optimization

Use Parquet

Good:

Parquet
ORC
Delta

Bad:

CSV
JSON

Column Pruning

Good:

SELECT id,name
FROM employees

Bad:

SELECT *
FROM employees

Filter Early

SELECT *
FROM sales
WHERE sale_date='2025-01-01'

Broadcast Small Tables

broadcast(dim_table)

Bucketing

CREATE TABLE sales
USING parquet
CLUSTERED BY (customer_id)
INTO 100 BUCKETS

Spark SQL Interview Questions & Answers


Q1. What is Spark SQL?

Answer

Spark SQL is a Spark module used for structured data processing. It provides SQL queries, DataFrames, Datasets, Catalyst Optimizer, and integration with Hive.


Q2. Difference between RDD and DataFrame?

RDDDataFrame
Low-levelHigh-level
No optimizationCatalyst optimization
SlowerFaster
No schemaSchema based

Q3. Difference between DataFrame and Dataset?

DataFrameDataset
UntypedTyped
Python SupportedScala/Java Mainly
Faster DevelopmentCompile-time checks

Q4. What is Catalyst Optimizer?

Catalyst is Spark SQL’s query optimization framework.

Functions:

  • Query analysis
  • Logical optimization
  • Physical optimization
  • Code generation

Q5. What is Tungsten?

Execution engine that improves:

  • Memory management
  • CPU efficiency
  • Serialization

Q6. Explain Predicate Pushdown.

Filters are pushed to storage layer.

Instead of:

Read 1 TB
Filter later

Spark reads only required rows.


Q7. What is Column Pruning?

Reading only required columns.

Example:

SELECT id,name

instead of

SELECT *

Q8. What is Broadcast Join?

Small table distributed to all executors.

Used when:

Dimension table < 10 MB (config dependent)

Q9. What causes data skew?

When one partition contains significantly more data than others.

Example:

Customer A → 10M rows
Others → 100 rows

Q10. How do you fix data skew?

Methods:

  • Salting
  • AQE
  • Broadcast Join
  • Repartitioning

Q11. What is AQE?

Adaptive Query Execution dynamically optimizes query execution during runtime.


Q12. Difference Between Repartition and Coalesce?

RepartitionCoalesce
ShuffleLess shuffle
Increase/decrease partitionsMostly decrease
ExpensiveCheaper

Q13. Explain Window Functions.

Perform calculations across rows without collapsing records.

Examples:

ROW_NUMBER()
RANK()
DENSE_RANK()
LAG()
LEAD()

Q14. What is a Temporary View?

Session-scoped SQL view.

df.createOrReplaceTempView("emp")

Q15. What is a Global Temporary View?

Accessible across sessions.

global_temp.viewname

Senior-Level Spark SQL Interview Questions

How would you optimize a 5 TB Spark SQL job?

Answer:

  1. Use Parquet/Delta
  2. Partition correctly
  3. Enable AQE
  4. Broadcast dimensions
  5. Avoid SELECT *
  6. Tune shuffle partitions
  7. Cache reused datasets
  8. Analyze explain plan
  9. Handle skew
  10. Use Z-Ordering (Delta)

How does Spark SQL execute a query internally?

Answer:

SQL Query
→ Parser
→ Logical Plan
→ Catalyst Optimizer
→ Optimized Logical Plan
→ Physical Plan
→ Tungsten Engine
→ DAG Scheduler
→ Tasks
→ Executors
→ Output

Must-Know Real-Time Spark SQL Scenarios

Scenario 1: Slow Join Query

Problem:

Fact Table = 5 TB
Dimension = 20 MB

Solution:

broadcast(dim)

Scenario 2: Too Many Small Files

Solution:

df.coalesce(10)

Scenario 3: Data Skew

Solution:

Salting
AQE
Skew Join Optimization

Scenario 4: Memory Issues

Solution:

Persist MEMORY_AND_DISK
Increase executor memory
Avoid unnecessary cache

Top 25 Spark SQL Topics for Interviews

  1. Spark SQL Architecture
  2. SparkSession
  3. DataFrames
  4. Datasets
  5. Catalyst Optimizer
  6. Tungsten
  7. AQE
  8. Joins
  9. Broadcast Join
  10. Window Functions
  11. Partitioning
  12. Bucketing
  13. Caching
  14. Persist
  15. Explain Plan
  16. Predicate Pushdown
  17. Column Pruning
  18. Query Optimization
  19. Data Skew
  20. Shuffle
  21. Temp Views
  22. Hive Metastore
  23. Parquet
  24. Delta Lake Integration
  25. Performance Tuning

These topics cover roughly 90–95% of Spark SQL questions asked in Data Engineering, Databricks, AWS Glue, Azure Data Engineering, and Senior Big Data interviews.

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top