PySpark is one of the most important skills for Data Engineers, Data Architects, Big Data Engineers, Data Scientists, and Analytics Engineers. Most companies using AWS, Azure, Databricks, Microsoft Fabric, Hadoop, or modern Data Lakes rely heavily on PySpark.
1. What is PySpark?
PySpark is the Python API for Apache Spark.
Spark is a distributed computing framework designed for large-scale data processing.
It enables processing:
- Terabytes/Petabytes of data
- Structured data
- Semi-structured data
- Streaming data
- Machine Learning workloads
Architecture
Driver Program
|
SparkSession/SparkContext
|
Cluster Manager
(YARN/K8s/Standalone)
|
----------------------------
| | |
Executor1 Executor2 Executor3
| | |
Tasks Tasks TasksComponents
- Driver
- Executor
- Cluster Manager
- Tasks
- Jobs
- Stages
2. Why Spark Instead of Hadoop MapReduce?
| Hadoop | Spark |
|---|---|
| Disk-based | Memory-based |
| Slow | Fast |
| Batch only | Batch + Streaming |
| Complex coding | Easy APIs |
| No caching | Supports caching |
Interview Question
Q: Why is Spark faster than Hadoop?
Answer:
Spark stores intermediate data in memory instead of repeatedly writing to disk.
This reduces I/O operations significantly and improves performance.
3. Spark Core Concepts
RDD (Resilient Distributed Dataset)
Fundamental Spark abstraction.
Characteristics:
- Distributed
- Immutable
- Fault Tolerant
- Parallel Processing
Example:
rdd = spark.sparkContext.parallelize([1,2,3,4,5])DataFrame
Most commonly used structure.
df = spark.read.csv("employee.csv", header=True)Advantages:
- Optimized
- SQL support
- Catalyst Optimization
- Tungsten Engine
Dataset
Available in Scala/Java.
Not directly supported in PySpark.
Interview Question
Q: RDD vs DataFrame?
| RDD | DataFrame |
|---|---|
| Low-level API | High-level API |
| No optimization | Catalyst optimized |
| Slow | Faster |
| No schema | Schema supported |
4. SparkSession
Entry point to Spark.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Demo") \
.getOrCreate()Interview Question
Q: Difference between SparkContext and SparkSession?
SparkContext
- Old API
- Core Spark functionality
SparkSession
- Unified entry point
- Replaces SQLContext
- Replaces HiveContext
5. Reading Data
CSV
df = spark.read.csv(
"employee.csv",
header=True,
inferSchema=True
)JSON
df = spark.read.json("employee.json")Parquet
df = spark.read.parquet("employee.parquet")Delta
df = spark.read.format("delta").load(path)Interview Question
Q: Why Parquet is preferred?
Answer
Parquet is:
- Columnar
- Compressed
- Faster reads
- Supports Predicate Pushdown
6. DataFrame Transformations
Select
df.select("name","salary")Filter
df.filter(df.salary > 50000)WithColumn
from pyspark.sql.functions import col
df = df.withColumn(
"bonus",
col("salary") * 0.1
)Drop
df.drop("bonus")Interview Question
Q: What are Transformations?
Transformations create a new DataFrame but are not executed immediately.
Examples:
- select()
- filter()
- join()
- groupBy()
Spark uses Lazy Evaluation.
7. Actions
Actions trigger execution.
Examples:
count()
collect()
show()
first()
take()Interview Question
Q: Difference between Transformation and Action?
Transformation:
df.filter(col("salary") > 50000)No execution.
Action:
df.count()Triggers execution.
8. Lazy Evaluation
Spark delays execution until action is called.
Example:
df1 = df.filter(...)
df2 = df1.select(...)
df2.show()Actual execution happens only at show().
Benefits
- Query optimization
- Reduced computation
- Better execution plans
9. DAG (Directed Acyclic Graph)
Spark creates DAG before execution.
Read File
|
Filter
|
Join
|
Aggregate
|
WriteInterview Question
Q: What is DAG?
A logical execution plan that Spark builds before running a job.
10. Narrow vs Wide Transformation
Narrow
No shuffle.
Examples:
map()
filter()
select()Wide
Requires shuffle.
Examples:
join()
groupBy()
distinct()
repartition()Interview Question
Q: Why are wide transformations expensive?
Because data moves across executors causing network shuffle.
11. Joins
Inner Join
df1.join(df2,"id","inner")Left Join
df1.join(df2,"id","left")Right Join
df1.join(df2,"id","right")Full Join
df1.join(df2,"id","full")Anti Join
df1.join(df2,"id","left_anti")Interview Question
Q: Which join is used for CDC?
Left Anti Join
new_df.join(old_df,"id","left_anti")Returns newly inserted records.
12. Broadcast Join
Used when one table is small.
from pyspark.sql.functions import broadcast
df.join(
broadcast(dim_df),
"id"
)Interview Question
Q: Why use Broadcast Join?
Avoids shuffle and improves performance.
13. Aggregations
df.groupBy("department") \
.sum("salary")df.groupBy("department") \
.agg({"salary":"avg"})14. Window Functions
Important interview topic.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("dept") \
.orderBy("salary")
df.withColumn(
"rank",
row_number().over(windowSpec)
)Common Window Functions
- row_number()
- rank()
- dense_rank()
- lag()
- lead()
Interview Question
Q: Difference between rank and dense_rank?
Example:
Salary
100
100
90Rank:
1
1
3Dense Rank:
1
1
215. Partitioning
Partitions determine parallelism.
Check:
df.rdd.getNumPartitions()Repartition
df.repartition(10)Shuffle occurs.
Coalesce
df.coalesce(2)No full shuffle.
Interview Question
Q: Repartition vs Coalesce?
| Repartition | Coalesce |
|---|---|
| Shuffle | Minimal shuffle |
| Increase/decrease | Mostly decrease |
| Expensive | Cheaper |
16. Caching and Persistence
Cache
df.cache()Persist
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)Interview Question
Q: Cache vs Persist?
Cache:
MEMORY_ONLYPersist:
Multiple storage options.
17. UDFs
User Defined Functions.
from pyspark.sql.functions import udf
def bonus(salary):
return salary * 0.1
bonus_udf = udf(bonus)Interview Question
Q: Why avoid UDFs?
- Break Catalyst Optimization
- Slower execution
- Serialization overhead
Use built-in functions whenever possible.
18. Catalyst Optimizer
Spark SQL optimization engine.
Performs:
- Predicate Pushdown
- Constant Folding
- Join Reordering
- Projection Pruning
Interview Question
Q: What is Catalyst Optimizer?
Rule-based and cost-based query optimizer that generates optimized execution plans.
19. Tungsten Engine
Responsible for:
- Memory optimization
- Code generation
- CPU optimization
20. Explain Plan
df.explain(True)Shows:
- Parsed Plan
- Analyzed Plan
- Optimized Plan
- Physical Plan
21. Handling Skewed Data
Major interview topic.
Problem
CustomerID
1
1
1
1
1
1
...One partition becomes huge.
Solutions
- Salting
- Broadcast Join
- AQE
- Repartition
Interview Question
Q: What is Data Skew?
Uneven distribution causing few executors to process most data.
22. Adaptive Query Execution (AQE)
Spark 3 feature.
spark.conf.set(
"spark.sql.adaptive.enabled",
"true"
)Benefits:
- Dynamic optimization
- Better joins
- Better partition sizing
23. PySpark Streaming
stream_df = spark.readStream \
.format("kafka") \
.load()Components
- Source
- Transformation
- Sink
Interview Question
Q: Difference between Batch and Streaming?
Batch:
Processes historical data.
Streaming:
Processes continuous data.
24. Delta Lake + PySpark
Widely asked in Databricks/Fabric interviews.
Features:
- ACID Transactions
- Time Travel
- Schema Evolution
- Merge Support
Merge
deltaTable.alias("target").merge(
sourceDF.alias("source"),
"target.id = source.id"
)Interview Question
Q: How do you implement SCD Type 2 in PySpark?
Answer:
Using Delta Merge:
- Expire old records
- Insert new version
- Maintain start/end dates
25. Top Advanced Interview Questions
Q1: Explain Spark Job Lifecycle.
- Driver receives action.
- DAG created.
- Stages generated.
- Tasks assigned.
- Executors execute.
- Results returned.
Q2: Explain Shuffle.
Redistribution of data across partitions.
Occurs during:
- Join
- GroupBy
- Distinct
Expensive due to network transfer.
Q3: How do you optimize a slow Spark job?
Answer:
- Use Parquet
- Avoid UDFs
- Use Broadcast Join
- Cache reused DataFrames
- Reduce Shuffle
- Enable AQE
- Optimize partitions
- Use Predicate Pushdown
Q4: What causes OutOfMemory errors?
- Large collect()
- Data skew
- Too many cached datasets
- Improper partition sizing
Q5: collect() vs take()
collect():
Returns all recordstake():
Returns limited recordstake() is safer.
Q6: How many partitions should you create?
Rule:
2–4 partitions per CPU coreDepends on:
- Cluster size
- Data volume
- Workload
Real-Time Scenario Questions
Scenario 1
100M records join taking 2 hours.
Solution:
- Broadcast small table
- AQE
- Repartition join key
- Analyze skew
Scenario 2
One executor running for 45 minutes.
Cause:
Data skew.
Fix:
- Salting
- Repartitioning
- AQE
Scenario 3
Daily Incremental Load
Approach:
Source → Bronze
Bronze → Silver
Silver → GoldUse watermark or CDC columns.
Senior Data Engineer Questions
Q: Design ETL for 1 Billion Records Daily.
Answer:
- Ingest via Kafka/Kinesis
- Store in Data Lake
- Process via PySpark
- Delta Lake storage
- Partition by date
- Incremental processing
- AQE enabled
- Broadcast dimensions
- Monitoring with Spark UI
Q: Explain Spark UI.
Tabs:
- Jobs
- Stages
- Storage
- Environment
- Executors
- SQL
Used for troubleshooting performance bottlenecks.
Most Important PySpark Interview Topics (Must Prepare)
Beginner
- Spark Architecture
- RDD
- DataFrame
- SparkSession
- Transformations
- Actions
Intermediate
- Joins
- Window Functions
- Partitioning
- Caching
- UDFs
- Parquet
Advanced
- Catalyst Optimizer
- Tungsten
- AQE
- Data Skew
- Shuffle
- Broadcast Join
Senior Level
- Spark Internals
- Job Lifecycle
- Cluster Sizing
- Delta Lake
- SCD Type 2
- Streaming Architecture
- Performance Tuning
- Lakehouse Design
- Databricks/Fabric Integration
For U.S. Data Engineering interviews (AWS, Databricks, Microsoft Fabric, Healthcare, Banking, Retail, and Life Sciences), the highest-priority PySpark topics are Spark Internals, DataFrame APIs, Joins, Window Functions, Delta Lake, Performance Tuning, AQE, Data Skew handling, Streaming, and large-scale ETL design scenarios. These are asked far more frequently than basic syntax questions.
This comprehensive guide covers PySpark from foundational concepts to advanced optimization techniques, including detailed interview questions and answers. The material is organized for both systematic learning and quick reference for interview preparation .
🚀 Complete PySpark Learning & Interview Guide
Part 1: Learning PySpark – A Structured Roadmap
Mastering PySpark involves understanding its architecture, core concepts, and best practices. Here’s a step-by-step learning path, transitioning from a SQL mindset to distributed computing .
1. Foundation: Spark Architecture & Setup
Before writing code, understand the “What” and “Why” of Spark.
- What is Apache Spark? A unified analytics engine for large-scale data processing. It’s known for its speed (in-memory computing) and ease of use .
- Core Components: Spark Core, Spark SQL, Spark Streaming, MLlib (Machine Learning), and GraphX .
- SparkSession: The unified entry point for all Spark operations. In older versions,
SparkContextwas used, but nowSparkSessionencapsulates it .pythonfrom pyspark.sql import SparkSession spark = SparkSession.builder.appName(“MyApp”).master(“local[*]”).getOrCreate() - Lazy Evaluation: Transformations are not executed immediately. Spark builds a DAG (Directed Acyclic Graph) of computations, and only an Action triggers the execution . This allows for query optimization.
- Driver vs. Executors: The Driver coordinates the main function and splits work into tasks. Executors run the tasks on worker nodes.
2. Core Abstractions: RDDs vs DataFrames vs Datasets
This is the most critical concept. While you will primarily use DataFrames in PySpark, understanding RDDs helps with debugging .
| Feature | RDD (Resilient Distributed Dataset) | DataFrame | Dataset (Scala/Java only) |
|---|---|---|---|
| Structure | Low-level, collection of Java objects. No schema. | Row x Column format (like SQL table) with a schema. | Typed, object-oriented. |
| Optimization | Manual optimization. No Catalyst Optimizer. | Uses Catalyst Optimizer and Tungsten for speed. | Combines compile-time safety with Catalyst. |
| Ease of Use | Steep learning curve (map, filter). | Easy. Uses SQL-like syntax and built-in functions. | Similar to DataFrames. |
| When to Use | Need fine-grained control or unstructured data. | Default choice for PySpark. Best for structured/semi-structured data and SQL. | Not available in PySpark. |
| Performance | Slower (no optimizations) | Faster (catalyst + tungsten) | Very Fast |
- Rule of Thumb: Use DataFrames. Use RDDs only when you need to manipulate data at a granular level that DataFrames don’t support .
3. DataFrame Operations: Transformations vs. Actions
- Transformations (Lazy): Define the new DataFrame. Examples:
select(),filter(),join(),groupBy(),orderBy(). - Actions (Eager): Trigger computation and return results. Examples:
show(),count(),collect(),take(n),write().
Code Example:
python
# Transformation (Lazy - No execution yet)
df = spark.read.csv("data.csv", header=True, inferSchema=True)
filtered_df = df.filter(df.age > 18)
selected_df = filtered_df.select("name", "age")
# Action (Execution Happens)
selected_df.show()
total_count = selected_df.count() 4. Advanced Concepts & Optimization
To pass interviews, you must know how to make PySpark fast.
- Partitioning: Data is split into partitions. Ideal partition size is ~200MB.
- Caching/Persisting: If you reuse a DataFrame multiple times,
df.cache()ordf.persist()stores it in memory/RAM, preventing recomputation . - Broadcast Joins: If one table is small (fits in driver memory), broadcast it. Spark sends a copy to all executors, avoiding expensive shuffles .pythonfrom pyspark.sql.functions import broadcast large_df.join(broadcast(small_df), “key”)
- UDFs (User Defined Functions): Python UDFs are slow because they serialize data between JVM and Python (one row at a time). Prefer Pandas UDFs (Vectorized) or built-in Spark functions .
- Spark UI: The port 4040 web interface is your debugging best friend. It shows exactly how long tasks took and how much data was shuffled.
Part 2: Interview Questions & Answers (Detailed)
Here are the top questions categorized by difficulty, with detailed answers you can use in interviews.
🔹 Beginner Level (Freshers)
Q1: What is PySpark?
Answer: PySpark is the Python API for Apache Spark. It allows Python developers to write Spark applications using Python, leveraging Spark’s distributed computing power. It provides libraries for SQL (pyspark.sql), streaming, and machine learning (MLlib). Under the hood, it uses Py4J to allow Python to interface with the Java Virtual Machine (JVM) running Spark .
Q2: What is the difference between map() and flatMap()?
Answer: Both are transformations.
map()returns a new RDD/Column by applying a function to each element, outputting exactly one element per input element.flatMap()also applies a function, but the function can return a sequence of elements (e.g., a list). It then flattens this list into multiple elements.
Example: Ifmapsplits a sentence into words, you get[['Hello', 'World'], ['Hi']]. IfflatMapdoes it, you get['Hello', 'World', 'Hi'].
Q3: How do you handle null values?
Answer: PySpark provides several methods:
df.dropna(): Drops rows containing null values.df.fillna(value): Replaces nulls with a specific value (e.g.,df.fillna(0, subset=["age"])).df.na.drop()/df.na.fill(): Alternative syntax .
Q4: Explain withColumnRenamed vs withColumn.
Answer:
withColumnRenamed(existing, new): Used to change the name of an existing column.withColumn(colName, colExpr): Used to add a new column or replace an existing column (if the name matches) based on an expression.
🔹 Intermediate Level (Experienced)
Q5: How does Lazy Evaluation work in Spark? Why is it beneficial?
Answer: Transformations (like map, filter) are lazily evaluated, meaning Spark does not execute them immediately. Instead, it builds a DAG (Directed Acyclic Graph) of the computation plan. Only when an Action (like count, show, write) is called does Spark execute the DAG.
Benefits:
- Optimization: Catalyst optimizer can reorder operations (e.g., push down filters) to reduce data transfer.
- Fault Tolerance: If a node fails, Spark knows exactly which partitions to recompute by looking at the lineage graph.
- Memory Management: Avoids loading intermediate data into memory unnecessarily .
Q6: My job is running slow. How would you optimize it?
Answer: I would follow a systematic debugging process:
- Check Spark UI: Look for “red” stages or long-running tasks to identify data skew or shuffle bottlenecks.
- Avoid Shuffles: If joins are causing shuffles, check if a Broadcast Join applies (small table).
- Filter Early: Push filters (predicate pushdown) as early as possible in the read/transform process to reduce data size.
- Tune Partitions: Ensure
spark.sql.shuffle.partitionsis set appropriately (default 200 is often too low for TB data, too high for MB data). - Remove UDFs: Replace Python UDFs with Spark SQL built-in functions. If necessary, use Vectorized Pandas UDFs .
Q7: Explain Data Skew and how to fix it.
Answer: Data skew occurs when one partition contains significantly more data than others, causing a few executors to work much longer than others (straggler tasks).
Fixes:
- Salting: Add a random prefix to the skewed key to break it into multiple keys, join with a salted lookup table, then remove the salt.
- Broadcast Hash Join: If the skewed table is large but the other is small, broadcast the small one.
- Isolate Skewed Values: Filter out the skewed key, process it separately (e.g., broadcast), and union it back with the rest.
Q8: What are the different Join types in PySpark?
Answer: PySpark supports standard SQL joins:
- Inner: Returns rows with matching keys.
- Left Outer / Right Outer: Returns all rows from left/right side, nulls for non-matching.
- Full Outer: Returns all rows from both sides.
- Left Semi: Returns rows from left side that exist in right side (does not include right columns).
- Left Anti: Returns rows from left side that do not exist in right side (filters out matches) .
- Cross: Cartesian product (use with caution).
🔹 Advanced Level & Scenario Based
Q9: You have 1,000 small files in a directory. Reading them is slow. Why?
Answer: This is the “Small File Problem”.
- Spark has overhead in scheduling tasks. 1,000 files = 1,000 partitions (potentially).
- Listing files in HDFS/S3 involves heavy NameNode/RPC operations.
- Solution: Use
spark.sql.files.maxPartitionBytesto combine small files during read, or usecoalesce(1)after reading, or use file compaction techniques (e.g., Delta LakeOPTIMIZE).
Q10: df.cache() vs df.persist() vs checkpoint()?
Answer:
cache(): Defaults to storing DataFrame in memory (MEMORY_AND_DISK). Use for reusable intermediate results.persist(StorageLevel): Allows specifying storage level (e.g.,DISK_ONLY,MEMORY_AND_DISK_SER). Gives fine-grained control over memory/disk usage.checkpoint(): Saves the RDD/DataFrame to HDFS/S3 (reliable storage) and breaks the lineage graph. This is useful for very long DAGs to prevent stack overflow or to cut off lengthy recovery chains.
Q11: Write a code to calculate a cumulative sum using Window functions.
Answer:
python
from pyspark.sql.window import Window
from pyspark.sql.functions import sum, col
window_spec = Window.partitionBy("department").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df_with_cumsum = df.withColumn("cumulative_sales", sum("sales").over(window_spec))Q12: How to convert a PySpark DataFrame to a Pandas DataFrame and vice versa?
Answer:
- PySpark → Pandas:
pandas_df = spark_df.toPandas()- Warning: This collects all data to the driver. Only use on small DataFrames.
- Pandas → PySpark:
spark_df = spark.createDataFrame(pandas_df)
Part 3: Key Resources & Cheat Sheet
Essential Functions Quick Reference
| Category | Functions |
|---|---|
| Read Data | spark.read.csv(), .parquet(), .json(), .jdbc() |
| Select/Filter | df.select(), df.filter(), df.where(), df.drop() |
| Aggregation | df.groupBy(), df.agg(), df.pivot() |
| Joins | df.join(other_df, on="key", how="inner") |
| Window | Window.partitionBy(), .orderBy(), .rowsBetween() |
| Performance | df.cache(), df.persist(), df.unpersist(), df.repartition(), df.coalesce() |
Official & Learning Resources
- Official Documentation: spark.apache.org/docs/latest/api/python/ (Always check for version compatibility) .
- Databricks Notebooks: Use community edition for hands-on practice. Follow the SQL-to-PySpark transition path .
- Best Practices: Always prefer built-in functions over UDFs; use broadcast joins; avoid
collect()on production data .
PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It combines Python’s ease of use with Spark’s distributed computing power for handling massive datasets efficiently across clusters.
PySpark supports Spark SQL, DataFrames, Structured Streaming, MLlib (machine learning), and more, making it popular for ETL, analytics, ML, and real-time processing.
1. Apache Spark Architecture (Core Concepts)
Spark uses a master-worker architecture:
- Driver Program: Runs the main application, creates SparkSession/SparkContext, builds the DAG (Directed Acyclic Graph) of operations, and coordinates tasks. It can become a bottleneck (e.g., OOM on driver).
- Cluster Manager: Allocates resources (e.g., Standalone, YARN, Kubernetes, Mesos).
- Executors/Workers: Run tasks on nodes; handle data processing and storage in memory/disk.
- Tasks, Stages, Jobs: A job consists of stages (narrow/wide transformations); stages break into tasks per partition.
Key Abstractions:
- RDD (Resilient Distributed Dataset): Low-level, immutable, partitioned collection of objects. Fault-tolerant via lineage (recompute lost partitions). Supports transformations (lazy) and actions (eager).
- DataFrame: High-level, structured data (like a table with named columns). Built on RDDs but optimized with Catalyst Optimizer (query planning) and Tungsten (memory/execution efficiency). Preferred for most use cases.
- Dataset: Typed DataFrame (stronger in Scala/Java; limited type safety in Python).
RDD vs DataFrame vs Dataset (key differences):
- RDD: Low-level, no schema, full control, more code/less optimization. Good for unstructured/custom logic.
- DataFrame: High-level, schema, SQL-like, Catalyst optimizations, faster for structured data.
- Dataset: Combines both (type-safe, object-oriented).
Lazy Evaluation: Transformations (e.g., filter, map, select) build the DAG but don’t execute until an action (e.g., count, collect, write). This enables optimization.
Narrow vs Wide Transformations:
- Narrow: No shuffle (e.g., map, filter); data stays in same partition.
- Wide: Requires shuffle (e.g., groupBy, join); data moves across nodes, expensive.
Partitioning: Data divided into partitions for parallelism. Tune with repartition() (increase, full shuffle) or coalesce() (decrease, avoid full shuffle).
Broadcast Variables & Accumulators: Broadcast for read-only shared data (avoids shuffle); accumulators for aggregation (e.g., counters).
2. Getting Started
Install via pip install pyspark (or use Databricks, EMR, etc.).
Python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.master("local[*]") \ # or yarn, k8s
.config("spark.executor.memory", "4g") \
.getOrCreate()Read/Write Data:
- spark.read.csv/json/parquet/orc/jdbc/…
- df.write.mode(“overwrite”).parquet(…)
Core Operations (DataFrame API):
- Selection: df.select(“col1”, col(“col2”) + 1)
- Filtering: df.filter(col(“age”) > 30)
- Grouping/Aggregation: df.groupBy(“dept”).agg(avg(“salary”))
- Joins: df1.join(df2, on=”id”, how=”inner”) (broadcast for small tables)
- Window functions: For ranking, running totals, etc.
Spark SQL: Register as temp view and run SQL queries.
Pandas API on Spark (pyspark.pandas): Familiar pd.DataFrame syntax on Spark backend.
3. Key Modules
- Spark SQL & DataFrames: Structured processing.
- Structured Streaming: For real-time data (micro-batch or continuous).
- MLlib: Scalable ML (feature engineering, classification, clustering, pipelines).
- GraphX/GraphFrames: Graphs (less common in PySpark).
- Spark Connect: For remote execution.
Data Types: Primitives, complex (Array, Map, Struct), semi-structured (JSON).
4. Best Practices & Optimization
- Use explicit schemas for performance and consistency.
- Prefer built-in functions over UDFs (UDFs are slower; use Pandas UDFs for vectorized Python code when needed).
- Minimize shuffles: Broadcast joins, salting for skew, repartition strategically.
- Caching/Persistence: df.cache() or persist(StorageLevel) for reused DataFrames.
- Checkpointing: For long lineages (saves to disk).
- File formats: Prefer Parquet (columnar, compressed, schema).
- Monitor with Spark UI (stages, tasks, shuffle, skew).
- Tune configs: Executor memory/cores, parallelism (spark.sql.shuffle.partitions), AQE (Adaptive Query Execution).
- Avoid collecting large data to driver; use approximations or sampling.
Other tips: Leverage Catalyst, avoid Python loops, handle skew, use Delta Lake for reliability on Databricks.
5. Learning Resources
- Official Docs: Spark 4.x User Guide (DataFrames, SQL, etc.).
- Tutorials: Databricks, YouTube full courses, TutorialsPoint, Analytics Vidhya.
- Books/Practice: “Spark: The Definitive Guide”, Databricks Academy, LeetCode-style problems.
- Hands-on: Local setup, Databricks Community Edition, or Kaggle notebooks.
Practice on real datasets (CSV → Parquet ETL, joins, window funcs, ML pipelines).
6. Detailed PySpark Interview Questions & Answers
Here’s a curated list of common questions (freshers to advanced), based on popular sources.
Basics:
- What is PySpark? Advantages over Pandas? Python API for Spark. Scalable, distributed, fault-tolerant vs. Pandas (single-node, in-memory).
- How to create SparkSession? See code above. Entry point for DataFrames/SQL.
- RDD vs DataFrame? As detailed earlier. Use DataFrames for most cases.
- Lazy Evaluation? Transformations deferred until action; optimizes DAG.
Intermediate: 5. Transformations vs Actions? Transformations (lazy, e.g., map, filter); Actions (trigger computation, e.g., count, show).
- Handling Missing Data? dropna(), fillna(), Imputer from MLlib.
- Joins & Types? Inner, left, right, outer, cross. Use broadcast for small side.
- Caching vs Persist? cache() = MEMORY_ONLY; persist() allows levels (MEMORY_AND_DISK, etc.).
- Partitioning: repartition vs coalesce? repartition (full shuffle, any number); coalesce (minimize shuffle, reduce only).
- Window Functions? Window.partitionBy(“col”).orderBy(“col2”) + row_number(), rank(), etc.
Advanced: 11. Narrow vs Wide Transformations? Narrow: no shuffle; Wide: shuffle involved.
- Catalyst Optimizer? Rule-based query optimizer for DataFrames/SQL; logical → physical plan.
- Spark Driver Responsibilities? Orchestrates, schedules, UI, collects results.
- Skew Handling? Salting (add random key), broadcast, AQE, custom partitioning.
- UDFs Performance? Avoid; use pandas_udf or native funcs. Serialization overhead.
- Checkpointing? Materializes RDD/DataFrame to reliable storage; breaks lineage for recovery.
- Broadcast Variables Use Case? Small lookup tables, models — sent once to executors.
- Performance Tuning? Schema, Parquet, caching, AQE, shuffle partitions, avoid UDFs/driver OOM, monitor UI.
- Real-time Scenario: Process streaming data with watermarks, joins, aggregations in Structured Streaming.
- Differences: PySpark vs Spark (Scala)? Same engine; PySpark has overhead (Py4J) but easier for Python users/ML.
Coding Questions (common):
- Rank employees by salary per dept.
- Find users with >N events in window.
- ETL: Clean, join, aggregate sales data.
- Handle skewed joins or large groupBy.
Practice explaining code + execution plan (df.explain()).
For deeper dives, refer to official docs, Databricks blogs, and practice on clusters. Focus on why (optimizations, fault tolerance) over just how. Good luck with learning and interviews!


