Complete PySpark Learning Guide + 150+ Interview Questions & Answers

PySpark

PySpark is one of the most important skills for Data Engineers, Data Architects, Big Data Engineers, Data Scientists, and Analytics Engineers. Most companies using AWS, Azure, Databricks, Microsoft Fabric, Hadoop, or modern Data Lakes rely heavily on PySpark.


1. What is PySpark?

PySpark is the Python API for Apache Spark.

Spark is a distributed computing framework designed for large-scale data processing.

It enables processing:

  • Terabytes/Petabytes of data
  • Structured data
  • Semi-structured data
  • Streaming data
  • Machine Learning workloads

Architecture

                Driver Program
|
SparkSession/SparkContext
|
Cluster Manager
(YARN/K8s/Standalone)
|
----------------------------
| | |
Executor1 Executor2 Executor3
| | |
Tasks Tasks Tasks

Components

  1. Driver
  2. Executor
  3. Cluster Manager
  4. Tasks
  5. Jobs
  6. Stages

2. Why Spark Instead of Hadoop MapReduce?

HadoopSpark
Disk-basedMemory-based
SlowFast
Batch onlyBatch + Streaming
Complex codingEasy APIs
No cachingSupports caching

Interview Question

Q: Why is Spark faster than Hadoop?

Answer:

Spark stores intermediate data in memory instead of repeatedly writing to disk.

This reduces I/O operations significantly and improves performance.


3. Spark Core Concepts

RDD (Resilient Distributed Dataset)

Fundamental Spark abstraction.

Characteristics:

  • Distributed
  • Immutable
  • Fault Tolerant
  • Parallel Processing

Example:

rdd = spark.sparkContext.parallelize([1,2,3,4,5])

DataFrame

Most commonly used structure.

df = spark.read.csv("employee.csv", header=True)

Advantages:

  • Optimized
  • SQL support
  • Catalyst Optimization
  • Tungsten Engine

Dataset

Available in Scala/Java.

Not directly supported in PySpark.


Interview Question

Q: RDD vs DataFrame?

RDDDataFrame
Low-level APIHigh-level API
No optimizationCatalyst optimized
SlowFaster
No schemaSchema supported

4. SparkSession

Entry point to Spark.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("Demo") \
.getOrCreate()

Interview Question

Q: Difference between SparkContext and SparkSession?

SparkContext

  • Old API
  • Core Spark functionality

SparkSession

  • Unified entry point
  • Replaces SQLContext
  • Replaces HiveContext

5. Reading Data

CSV

df = spark.read.csv(
"employee.csv",
header=True,
inferSchema=True
)

JSON

df = spark.read.json("employee.json")

Parquet

df = spark.read.parquet("employee.parquet")

Delta

df = spark.read.format("delta").load(path)

Interview Question

Q: Why Parquet is preferred?

Answer

Parquet is:

  • Columnar
  • Compressed
  • Faster reads
  • Supports Predicate Pushdown

6. DataFrame Transformations

Select

df.select("name","salary")

Filter

df.filter(df.salary > 50000)

WithColumn

from pyspark.sql.functions import col

df = df.withColumn(
"bonus",
col("salary") * 0.1
)

Drop

df.drop("bonus")

Interview Question

Q: What are Transformations?

Transformations create a new DataFrame but are not executed immediately.

Examples:

  • select()
  • filter()
  • join()
  • groupBy()

Spark uses Lazy Evaluation.


7. Actions

Actions trigger execution.

Examples:

count()
collect()
show()
first()
take()

Interview Question

Q: Difference between Transformation and Action?

Transformation:

df.filter(col("salary") > 50000)

No execution.

Action:

df.count()

Triggers execution.


8. Lazy Evaluation

Spark delays execution until action is called.

Example:

df1 = df.filter(...)
df2 = df1.select(...)

df2.show()

Actual execution happens only at show().


Benefits

  • Query optimization
  • Reduced computation
  • Better execution plans

9. DAG (Directed Acyclic Graph)

Spark creates DAG before execution.

Read File
|
Filter
|
Join
|
Aggregate
|
Write

Interview Question

Q: What is DAG?

A logical execution plan that Spark builds before running a job.


10. Narrow vs Wide Transformation

Narrow

No shuffle.

Examples:

map()
filter()
select()

Wide

Requires shuffle.

Examples:

join()
groupBy()
distinct()
repartition()

Interview Question

Q: Why are wide transformations expensive?

Because data moves across executors causing network shuffle.


11. Joins

Inner Join

df1.join(df2,"id","inner")

Left Join

df1.join(df2,"id","left")

Right Join

df1.join(df2,"id","right")

Full Join

df1.join(df2,"id","full")

Anti Join

df1.join(df2,"id","left_anti")

Interview Question

Q: Which join is used for CDC?

Left Anti Join

new_df.join(old_df,"id","left_anti")

Returns newly inserted records.


12. Broadcast Join

Used when one table is small.

from pyspark.sql.functions import broadcast

df.join(
broadcast(dim_df),
"id"
)

Interview Question

Q: Why use Broadcast Join?

Avoids shuffle and improves performance.


13. Aggregations

df.groupBy("department") \
.sum("salary")
df.groupBy("department") \
.agg({"salary":"avg"})

14. Window Functions

Important interview topic.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("dept") \
.orderBy("salary")

df.withColumn(
"rank",
row_number().over(windowSpec)
)

Common Window Functions

  • row_number()
  • rank()
  • dense_rank()
  • lag()
  • lead()

Interview Question

Q: Difference between rank and dense_rank?

Example:

Salary
100
100
90

Rank:

1
1
3

Dense Rank:

1
1
2

15. Partitioning

Partitions determine parallelism.

Check:

df.rdd.getNumPartitions()

Repartition

df.repartition(10)

Shuffle occurs.


Coalesce

df.coalesce(2)

No full shuffle.


Interview Question

Q: Repartition vs Coalesce?

RepartitionCoalesce
ShuffleMinimal shuffle
Increase/decreaseMostly decrease
ExpensiveCheaper

16. Caching and Persistence

Cache

df.cache()

Persist

from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK)

Interview Question

Q: Cache vs Persist?

Cache:

MEMORY_ONLY

Persist:

Multiple storage options.


17. UDFs

User Defined Functions.

from pyspark.sql.functions import udf

def bonus(salary):
return salary * 0.1

bonus_udf = udf(bonus)

Interview Question

Q: Why avoid UDFs?

  • Break Catalyst Optimization
  • Slower execution
  • Serialization overhead

Use built-in functions whenever possible.


18. Catalyst Optimizer

Spark SQL optimization engine.

Performs:

  • Predicate Pushdown
  • Constant Folding
  • Join Reordering
  • Projection Pruning

Interview Question

Q: What is Catalyst Optimizer?

Rule-based and cost-based query optimizer that generates optimized execution plans.


19. Tungsten Engine

Responsible for:

  • Memory optimization
  • Code generation
  • CPU optimization

20. Explain Plan

df.explain(True)

Shows:

  1. Parsed Plan
  2. Analyzed Plan
  3. Optimized Plan
  4. Physical Plan

21. Handling Skewed Data

Major interview topic.

Problem

CustomerID
1
1
1
1
1
1
...

One partition becomes huge.

Solutions

  1. Salting
  2. Broadcast Join
  3. AQE
  4. Repartition

Interview Question

Q: What is Data Skew?

Uneven distribution causing few executors to process most data.


22. Adaptive Query Execution (AQE)

Spark 3 feature.

spark.conf.set(
"spark.sql.adaptive.enabled",
"true"
)

Benefits:

  • Dynamic optimization
  • Better joins
  • Better partition sizing

23. PySpark Streaming

stream_df = spark.readStream \
.format("kafka") \
.load()

Components

  • Source
  • Transformation
  • Sink

Interview Question

Q: Difference between Batch and Streaming?

Batch:

Processes historical data.

Streaming:

Processes continuous data.


24. Delta Lake + PySpark

Widely asked in Databricks/Fabric interviews.

Features:

  • ACID Transactions
  • Time Travel
  • Schema Evolution
  • Merge Support

Merge

deltaTable.alias("target").merge(
sourceDF.alias("source"),
"target.id = source.id"
)

Interview Question

Q: How do you implement SCD Type 2 in PySpark?

Answer:

Using Delta Merge:

  1. Expire old records
  2. Insert new version
  3. Maintain start/end dates

25. Top Advanced Interview Questions

Q1: Explain Spark Job Lifecycle.

  1. Driver receives action.
  2. DAG created.
  3. Stages generated.
  4. Tasks assigned.
  5. Executors execute.
  6. Results returned.

Q2: Explain Shuffle.

Redistribution of data across partitions.

Occurs during:

  • Join
  • GroupBy
  • Distinct

Expensive due to network transfer.


Q3: How do you optimize a slow Spark job?

Answer:

  • Use Parquet
  • Avoid UDFs
  • Use Broadcast Join
  • Cache reused DataFrames
  • Reduce Shuffle
  • Enable AQE
  • Optimize partitions
  • Use Predicate Pushdown

Q4: What causes OutOfMemory errors?

  • Large collect()
  • Data skew
  • Too many cached datasets
  • Improper partition sizing

Q5: collect() vs take()

collect():

Returns all records

take():

Returns limited records

take() is safer.


Q6: How many partitions should you create?

Rule:

2–4 partitions per CPU core

Depends on:

  • Cluster size
  • Data volume
  • Workload

Real-Time Scenario Questions

Scenario 1

100M records join taking 2 hours.

Solution:

  • Broadcast small table
  • AQE
  • Repartition join key
  • Analyze skew

Scenario 2

One executor running for 45 minutes.

Cause:

Data skew.

Fix:

  • Salting
  • Repartitioning
  • AQE

Scenario 3

Daily Incremental Load

Approach:

Source → Bronze
Bronze → Silver
Silver → Gold

Use watermark or CDC columns.


Senior Data Engineer Questions

Q: Design ETL for 1 Billion Records Daily.

Answer:

  • Ingest via Kafka/Kinesis
  • Store in Data Lake
  • Process via PySpark
  • Delta Lake storage
  • Partition by date
  • Incremental processing
  • AQE enabled
  • Broadcast dimensions
  • Monitoring with Spark UI

Q: Explain Spark UI.

Tabs:

  1. Jobs
  2. Stages
  3. Storage
  4. Environment
  5. Executors
  6. SQL

Used for troubleshooting performance bottlenecks.


Most Important PySpark Interview Topics (Must Prepare)

Beginner

  • Spark Architecture
  • RDD
  • DataFrame
  • SparkSession
  • Transformations
  • Actions

Intermediate

  • Joins
  • Window Functions
  • Partitioning
  • Caching
  • UDFs
  • Parquet

Advanced

  • Catalyst Optimizer
  • Tungsten
  • AQE
  • Data Skew
  • Shuffle
  • Broadcast Join

Senior Level

  • Spark Internals
  • Job Lifecycle
  • Cluster Sizing
  • Delta Lake
  • SCD Type 2
  • Streaming Architecture
  • Performance Tuning
  • Lakehouse Design
  • Databricks/Fabric Integration

For U.S. Data Engineering interviews (AWS, Databricks, Microsoft Fabric, Healthcare, Banking, Retail, and Life Sciences), the highest-priority PySpark topics are Spark Internals, DataFrame APIs, Joins, Window Functions, Delta Lake, Performance Tuning, AQE, Data Skew handling, Streaming, and large-scale ETL design scenarios. These are asked far more frequently than basic syntax questions.

This comprehensive guide covers PySpark from foundational concepts to advanced optimization techniques, including detailed interview questions and answers. The material is organized for both systematic learning and quick reference for interview preparation .


🚀 Complete PySpark Learning & Interview Guide

Part 1: Learning PySpark – A Structured Roadmap

Mastering PySpark involves understanding its architecture, core concepts, and best practices. Here’s a step-by-step learning path, transitioning from a SQL mindset to distributed computing .

1. Foundation: Spark Architecture & Setup

Before writing code, understand the “What” and “Why” of Spark.

  • What is Apache Spark? A unified analytics engine for large-scale data processing. It’s known for its speed (in-memory computing) and ease of use .
  • Core Components: Spark Core, Spark SQL, Spark Streaming, MLlib (Machine Learning), and GraphX .
  • SparkSession: The unified entry point for all Spark operations. In older versions, SparkContext was used, but now SparkSession encapsulates it .pythonfrom pyspark.sql import SparkSession spark = SparkSession.builder.appName(“MyApp”).master(“local[*]”).getOrCreate()
  • Lazy Evaluation: Transformations are not executed immediately. Spark builds a DAG (Directed Acyclic Graph) of computations, and only an Action triggers the execution . This allows for query optimization.
  • Driver vs. Executors: The Driver coordinates the main function and splits work into tasks. Executors run the tasks on worker nodes.

2. Core Abstractions: RDDs vs DataFrames vs Datasets

This is the most critical concept. While you will primarily use DataFrames in PySpark, understanding RDDs helps with debugging .

FeatureRDD (Resilient Distributed Dataset)DataFrameDataset (Scala/Java only)
StructureLow-level, collection of Java objects. No schema.Row x Column format (like SQL table) with a schema.Typed, object-oriented.
OptimizationManual optimization. No Catalyst Optimizer.Uses Catalyst Optimizer and Tungsten for speed.Combines compile-time safety with Catalyst.
Ease of UseSteep learning curve (mapfilter).Easy. Uses SQL-like syntax and built-in functions.Similar to DataFrames.
When to UseNeed fine-grained control or unstructured data.Default choice for PySpark. Best for structured/semi-structured data and SQL.Not available in PySpark.
PerformanceSlower (no optimizations)Faster (catalyst + tungsten)Very Fast
  • Rule of Thumb: Use DataFrames. Use RDDs only when you need to manipulate data at a granular level that DataFrames don’t support .

3. DataFrame Operations: Transformations vs. Actions

  • Transformations (Lazy): Define the new DataFrame. Examples: select()filter()join()groupBy()orderBy().
  • Actions (Eager): Trigger computation and return results. Examples: show()count()collect()take(n)write().

Code Example:

python

# Transformation (Lazy - No execution yet)
df = spark.read.csv("data.csv", header=True, inferSchema=True)
filtered_df = df.filter(df.age > 18) 
selected_df = filtered_df.select("name", "age")

# Action (Execution Happens)
selected_df.show() 
total_count = selected_df.count() 

4. Advanced Concepts & Optimization

To pass interviews, you must know how to make PySpark fast.

  • Partitioning: Data is split into partitions. Ideal partition size is ~200MB.
    • repartition(n): Increases/decreases partitions. Causes a full shuffle (expensive). Use to increase parallelism before joins .
    • coalesce(n): Decreases partitions. Does not cause a full shuffle (fast). Use to reduce output file sizes .
  • Caching/Persisting: If you reuse a DataFrame multiple times, df.cache() or df.persist() stores it in memory/RAM, preventing recomputation .
  • Broadcast Joins: If one table is small (fits in driver memory), broadcast it. Spark sends a copy to all executors, avoiding expensive shuffles .pythonfrom pyspark.sql.functions import broadcast large_df.join(broadcast(small_df), “key”)
  • UDFs (User Defined Functions): Python UDFs are slow because they serialize data between JVM and Python (one row at a time). Prefer Pandas UDFs (Vectorized) or built-in Spark functions .
  • Spark UI: The port 4040 web interface is your debugging best friend. It shows exactly how long tasks took and how much data was shuffled.

Part 2: Interview Questions & Answers (Detailed)

Here are the top questions categorized by difficulty, with detailed answers you can use in interviews.

🔹 Beginner Level (Freshers)

Q1: What is PySpark?
Answer: PySpark is the Python API for Apache Spark. It allows Python developers to write Spark applications using Python, leveraging Spark’s distributed computing power. It provides libraries for SQL (pyspark.sql), streaming, and machine learning (MLlib). Under the hood, it uses Py4J to allow Python to interface with the Java Virtual Machine (JVM) running Spark .

Q2: What is the difference between map() and flatMap()?
Answer: Both are transformations.

  • map() returns a new RDD/Column by applying a function to each element, outputting exactly one element per input element.
  • flatMap() also applies a function, but the function can return a sequence of elements (e.g., a list). It then flattens this list into multiple elements.
    Example: If map splits a sentence into words, you get [['Hello', 'World'], ['Hi']]. If flatMap does it, you get ['Hello', 'World', 'Hi'].

Q3: How do you handle null values?
Answer: PySpark provides several methods:

  • df.dropna(): Drops rows containing null values.
  • df.fillna(value): Replaces nulls with a specific value (e.g., df.fillna(0, subset=["age"])).
  • df.na.drop() / df.na.fill(): Alternative syntax .

Q4: Explain withColumnRenamed vs withColumn.
Answer:

  • withColumnRenamed(existing, new): Used to change the name of an existing column.
  • withColumn(colName, colExpr): Used to add a new column or replace an existing column (if the name matches) based on an expression.

🔹 Intermediate Level (Experienced)

Q5: How does Lazy Evaluation work in Spark? Why is it beneficial?
Answer: Transformations (like mapfilter) are lazily evaluated, meaning Spark does not execute them immediately. Instead, it builds a DAG (Directed Acyclic Graph) of the computation plan. Only when an Action (like countshowwrite) is called does Spark execute the DAG.
Benefits:

  1. Optimization: Catalyst optimizer can reorder operations (e.g., push down filters) to reduce data transfer.
  2. Fault Tolerance: If a node fails, Spark knows exactly which partitions to recompute by looking at the lineage graph.
  3. Memory Management: Avoids loading intermediate data into memory unnecessarily .

Q6: My job is running slow. How would you optimize it?
Answer: I would follow a systematic debugging process:

  1. Check Spark UI: Look for “red” stages or long-running tasks to identify data skew or shuffle bottlenecks.
  2. Avoid Shuffles: If joins are causing shuffles, check if a Broadcast Join applies (small table).
  3. Filter Early: Push filters (predicate pushdown) as early as possible in the read/transform process to reduce data size.
  4. Tune Partitions: Ensure spark.sql.shuffle.partitions is set appropriately (default 200 is often too low for TB data, too high for MB data).
  5. Remove UDFs: Replace Python UDFs with Spark SQL built-in functions. If necessary, use Vectorized Pandas UDFs .

Q7: Explain Data Skew and how to fix it.
Answer: Data skew occurs when one partition contains significantly more data than others, causing a few executors to work much longer than others (straggler tasks).
Fixes:

  • Salting: Add a random prefix to the skewed key to break it into multiple keys, join with a salted lookup table, then remove the salt.
  • Broadcast Hash Join: If the skewed table is large but the other is small, broadcast the small one.
  • Isolate Skewed Values: Filter out the skewed key, process it separately (e.g., broadcast), and union it back with the rest.

Q8: What are the different Join types in PySpark?
Answer: PySpark supports standard SQL joins:

  • Inner: Returns rows with matching keys.
  • Left Outer / Right Outer: Returns all rows from left/right side, nulls for non-matching.
  • Full Outer: Returns all rows from both sides.
  • Left Semi: Returns rows from left side that exist in right side (does not include right columns).
  • Left Anti: Returns rows from left side that do not exist in right side (filters out matches) .
  • Cross: Cartesian product (use with caution).

🔹 Advanced Level & Scenario Based

Q9: You have 1,000 small files in a directory. Reading them is slow. Why?
Answer: This is the “Small File Problem”.

  • Spark has overhead in scheduling tasks. 1,000 files = 1,000 partitions (potentially).
  • Listing files in HDFS/S3 involves heavy NameNode/RPC operations.
  • Solution: Use spark.sql.files.maxPartitionBytes to combine small files during read, or use coalesce(1) after reading, or use file compaction techniques (e.g., Delta Lake OPTIMIZE).

Q10: df.cache() vs df.persist() vs checkpoint()?
Answer:

  • cache(): Defaults to storing DataFrame in memory (MEMORY_AND_DISK). Use for reusable intermediate results.
  • persist(StorageLevel): Allows specifying storage level (e.g., DISK_ONLYMEMORY_AND_DISK_SER). Gives fine-grained control over memory/disk usage.
  • checkpoint(): Saves the RDD/DataFrame to HDFS/S3 (reliable storage) and breaks the lineage graph. This is useful for very long DAGs to prevent stack overflow or to cut off lengthy recovery chains.

Q11: Write a code to calculate a cumulative sum using Window functions.
Answer:

python

from pyspark.sql.window import Window
from pyspark.sql.functions import sum, col

window_spec = Window.partitionBy("department").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)

df_with_cumsum = df.withColumn("cumulative_sales", sum("sales").over(window_spec))

Q12: How to convert a PySpark DataFrame to a Pandas DataFrame and vice versa?
Answer:

  • PySpark → Pandas:pandas_df = spark_df.toPandas()
    • Warning: This collects all data to the driver. Only use on small DataFrames.
  • Pandas → PySpark:spark_df = spark.createDataFrame(pandas_df)
    • Spark infers the schema automatically.
    • For large data, write Pandas to Parquet/CSV on disk, then use spark.read .

Part 3: Key Resources & Cheat Sheet

Essential Functions Quick Reference

CategoryFunctions
Read Dataspark.read.csv().parquet().json().jdbc()
Select/Filterdf.select()df.filter()df.where()df.drop()
Aggregationdf.groupBy()df.agg()df.pivot()
Joinsdf.join(other_df, on="key", how="inner")
WindowWindow.partitionBy().orderBy().rowsBetween()
Performancedf.cache()df.persist()df.unpersist()df.repartition()df.coalesce()

Official & Learning Resources

  • Official Documentation: spark.apache.org/docs/latest/api/python/ (Always check for version compatibility) .
  • Databricks Notebooks: Use community edition for hands-on practice. Follow the SQL-to-PySpark transition path .
  • Best Practices: Always prefer built-in functions over UDFs; use broadcast joins; avoid collect() on production data .

PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It combines Python’s ease of use with Spark’s distributed computing power for handling massive datasets efficiently across clusters.

PySpark supports Spark SQL, DataFrames, Structured Streaming, MLlib (machine learning), and more, making it popular for ETL, analytics, ML, and real-time processing.

1. Apache Spark Architecture (Core Concepts)

Spark uses a master-worker architecture:

  • Driver Program: Runs the main application, creates SparkSession/SparkContext, builds the DAG (Directed Acyclic Graph) of operations, and coordinates tasks. It can become a bottleneck (e.g., OOM on driver).
  • Cluster Manager: Allocates resources (e.g., Standalone, YARN, Kubernetes, Mesos).
  • Executors/Workers: Run tasks on nodes; handle data processing and storage in memory/disk.
  • Tasks, Stages, Jobs: A job consists of stages (narrow/wide transformations); stages break into tasks per partition.

Key Abstractions:

  • RDD (Resilient Distributed Dataset): Low-level, immutable, partitioned collection of objects. Fault-tolerant via lineage (recompute lost partitions). Supports transformations (lazy) and actions (eager).
  • DataFrame: High-level, structured data (like a table with named columns). Built on RDDs but optimized with Catalyst Optimizer (query planning) and Tungsten (memory/execution efficiency). Preferred for most use cases.
  • Dataset: Typed DataFrame (stronger in Scala/Java; limited type safety in Python).

RDD vs DataFrame vs Dataset (key differences):

  • RDD: Low-level, no schema, full control, more code/less optimization. Good for unstructured/custom logic.
  • DataFrame: High-level, schema, SQL-like, Catalyst optimizations, faster for structured data.
  • Dataset: Combines both (type-safe, object-oriented).

Lazy Evaluation: Transformations (e.g., filter, map, select) build the DAG but don’t execute until an action (e.g., count, collect, write). This enables optimization.

Narrow vs Wide Transformations:

  • Narrow: No shuffle (e.g., map, filter); data stays in same partition.
  • Wide: Requires shuffle (e.g., groupBy, join); data moves across nodes, expensive.

Partitioning: Data divided into partitions for parallelism. Tune with repartition() (increase, full shuffle) or coalesce() (decrease, avoid full shuffle).

Broadcast Variables & Accumulators: Broadcast for read-only shared data (avoids shuffle); accumulators for aggregation (e.g., counters).

2. Getting Started

Install via pip install pyspark (or use Databricks, EMR, etc.).

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .master("local[*]") \  # or yarn, k8s
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

Read/Write Data:

  • spark.read.csv/json/parquet/orc/jdbc/…
  • df.write.mode(“overwrite”).parquet(…)

Core Operations (DataFrame API):

  • Selection: df.select(“col1”, col(“col2”) + 1)
  • Filtering: df.filter(col(“age”) > 30)
  • Grouping/Aggregation: df.groupBy(“dept”).agg(avg(“salary”))
  • Joins: df1.join(df2, on=”id”, how=”inner”) (broadcast for small tables)
  • Window functions: For ranking, running totals, etc.

Spark SQL: Register as temp view and run SQL queries.

Pandas API on Spark (pyspark.pandas): Familiar pd.DataFrame syntax on Spark backend.

3. Key Modules

  • Spark SQL & DataFrames: Structured processing.
  • Structured Streaming: For real-time data (micro-batch or continuous).
  • MLlib: Scalable ML (feature engineering, classification, clustering, pipelines).
  • GraphX/GraphFrames: Graphs (less common in PySpark).
  • Spark Connect: For remote execution.

Data Types: Primitives, complex (Array, Map, Struct), semi-structured (JSON).

4. Best Practices & Optimization

  • Use explicit schemas for performance and consistency.
  • Prefer built-in functions over UDFs (UDFs are slower; use Pandas UDFs for vectorized Python code when needed).
  • Minimize shuffles: Broadcast joins, salting for skew, repartition strategically.
  • Caching/Persistence: df.cache() or persist(StorageLevel) for reused DataFrames.
  • Checkpointing: For long lineages (saves to disk).
  • File formats: Prefer Parquet (columnar, compressed, schema).
  • Monitor with Spark UI (stages, tasks, shuffle, skew).
  • Tune configs: Executor memory/cores, parallelism (spark.sql.shuffle.partitions), AQE (Adaptive Query Execution).
  • Avoid collecting large data to driver; use approximations or sampling.

Other tips: Leverage Catalyst, avoid Python loops, handle skew, use Delta Lake for reliability on Databricks.

5. Learning Resources

  • Official Docs: Spark 4.x User Guide (DataFrames, SQL, etc.).
  • Tutorials: Databricks, YouTube full courses, TutorialsPoint, Analytics Vidhya.
  • Books/Practice: “Spark: The Definitive Guide”, Databricks Academy, LeetCode-style problems.
  • Hands-on: Local setup, Databricks Community Edition, or Kaggle notebooks.

Practice on real datasets (CSV → Parquet ETL, joins, window funcs, ML pipelines).

6. Detailed PySpark Interview Questions & Answers

Here’s a curated list of common questions (freshers to advanced), based on popular sources.

Basics:

  1. What is PySpark? Advantages over Pandas? Python API for Spark. Scalable, distributed, fault-tolerant vs. Pandas (single-node, in-memory).
  2. How to create SparkSession? See code above. Entry point for DataFrames/SQL.
  3. RDD vs DataFrame? As detailed earlier. Use DataFrames for most cases.
  4. Lazy Evaluation? Transformations deferred until action; optimizes DAG.

Intermediate: 5. Transformations vs Actions? Transformations (lazy, e.g., map, filter); Actions (trigger computation, e.g., count, show).

  1. Handling Missing Data? dropna(), fillna(), Imputer from MLlib.
  2. Joins & Types? Inner, left, right, outer, cross. Use broadcast for small side.
  3. Caching vs Persist? cache() = MEMORY_ONLY; persist() allows levels (MEMORY_AND_DISK, etc.).
  4. Partitioning: repartition vs coalesce? repartition (full shuffle, any number); coalesce (minimize shuffle, reduce only).
  5. Window Functions? Window.partitionBy(“col”).orderBy(“col2”) + row_number(), rank(), etc.

Advanced: 11. Narrow vs Wide Transformations? Narrow: no shuffle; Wide: shuffle involved.

  1. Catalyst Optimizer? Rule-based query optimizer for DataFrames/SQL; logical → physical plan.
  2. Spark Driver Responsibilities? Orchestrates, schedules, UI, collects results.
  3. Skew Handling? Salting (add random key), broadcast, AQE, custom partitioning.
  4. UDFs Performance? Avoid; use pandas_udf or native funcs. Serialization overhead.
  5. Checkpointing? Materializes RDD/DataFrame to reliable storage; breaks lineage for recovery.
  6. Broadcast Variables Use Case? Small lookup tables, models — sent once to executors.
  7. Performance Tuning? Schema, Parquet, caching, AQE, shuffle partitions, avoid UDFs/driver OOM, monitor UI.
  8. Real-time Scenario: Process streaming data with watermarks, joins, aggregations in Structured Streaming.
  9. Differences: PySpark vs Spark (Scala)? Same engine; PySpark has overhead (Py4J) but easier for Python users/ML.

Coding Questions (common):

  • Rank employees by salary per dept.
  • Find users with >N events in window.
  • ETL: Clean, join, aggregate sales data.
  • Handle skewed joins or large groupBy.

Practice explaining code + execution plan (df.explain()).

For deeper dives, refer to official docs, Databricks blogs, and practice on clusters. Focus on why (optimizations, fault tolerance) over just how. Good luck with learning and interviews!

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top