AWS Glue Interview Questions and Answers (Complete Data Engineer Interview Guide)

AWS Glue

AWS Glue is one of the most frequently asked services in AWS Data Engineering, Big Data, Analytics, AI/ML, and Cloud Engineering interviews.

Interviewers typically test:

  • AWS Glue Fundamentals
  • Architecture
  • ETL/ELT Concepts
  • Glue Jobs
  • Crawlers
  • Data Catalog
  • DynamicFrames
  • Spark Concepts
  • Glue Studio
  • Glue Workflows
  • Performance Optimization
  • Security
  • Monitoring
  • Real-world Scenarios
  • Troubleshooting
  • Advanced Architect-Level Design

1. What is AWS Glue?

Answer

AWS Glue is a fully managed serverless data integration service that helps discover, prepare, move, and transform data for analytics, machine learning, and application development.

It provides:

  • ETL/ELT Processing
  • Metadata Management
  • Data Discovery
  • Data Cataloging
  • Data Quality
  • Workflow Orchestration

Real-World Example

Suppose:

  • Sales Data → S3
  • Customer Data → RDS
  • Product Data → DynamoDB

Glue can:

  1. Read all data
  2. Transform it
  3. Load into Snowflake, Redshift, or S3 Data Lake

2. Why AWS Glue?

Answer

Benefits:

  • Serverless
  • Auto Scaling
  • Built-in Apache Spark
  • No infrastructure management
  • Integrated with AWS ecosystem
  • Pay only for usage

Interview Tip

Companies prefer Glue because they don’t need to manage Spark clusters.


3. What are AWS Glue Components?

Answer

Major components:

ComponentPurpose
Data CatalogMetadata repository
CrawlerDiscover schema
ETL JobData transformation
TriggerStart jobs
WorkflowOrchestrate jobs
ConnectionConnect external systems
Development EndpointInteractive development
Glue StudioVisual ETL designer
Data QualityValidate data

4. Explain AWS Glue Architecture

Answer

Flow:

Data Sources
|
v
Crawler
|
v
Data Catalog
|
v
Glue ETL Job
|
v
Target System

Sources:

  • S3
  • RDS
  • Redshift
  • DynamoDB
  • Kafka
  • Snowflake
  • JDBC Sources

Targets:

  • S3
  • Redshift
  • Snowflake
  • OpenSearch
  • JDBC Databases

5. What is a Glue Data Catalog?

Answer

Central metadata repository storing:

  • Table Definitions
  • Database Definitions
  • Partitions
  • Schema Information

Acts like Hive Metastore.


Example

Table:

sales_data

Columns:

id
amount
date
country

Stored in Glue Catalog.

Athena and Redshift can query it directly.


6. Why is Data Catalog Important?

Answer

Without catalog:

Every application must understand schema.

With catalog:

Single source of truth.

Benefits:

  • Schema management
  • Data discovery
  • Query optimization
  • Governance

7. What is a Glue Crawler?

Answer

Crawler automatically:

  • Scans data
  • Infers schema
  • Detects partitions
  • Creates metadata tables

Example

Crawler scans:

s3://sales/

Files:

2024/
2025/

Creates:

sales_table

Partitions:

year=2024
year=2025

8. How Does a Crawler Work?

Answer

Steps:

  1. Connect source
  2. Sample data
  3. Detect schema
  4. Identify partitions
  5. Update Data Catalog

9. What are Classifiers?

Answer

Classifiers identify data format.

Supported:

  • JSON
  • CSV
  • XML
  • Avro
  • Parquet

Custom classifiers can also be created.


10. What is an AWS Glue Job?

Answer

A Glue Job performs ETL transformations.

Written in:

  • PySpark
  • Scala
  • Python Shell

Example:

Read CSV

Transform

Write Parquet


11. What is Glue Studio?

Answer

Visual drag-and-drop ETL designer.

Allows:

  • Build ETL pipelines
  • Generate code automatically
  • Monitor jobs

12. What Languages Does Glue Support?

Answer

Supported:

  • PySpark
  • Scala
  • Python Shell

PySpark is most common.


13. What is DynamicFrame?

Answer

DynamicFrame is AWS Glue’s abstraction over Spark DataFrame.

Provides:

  • Schema flexibility
  • Handling inconsistent data

14. DynamicFrame vs DataFrame

FeatureDynamicFrameDataFrame
AWS Glue NativeYesNo
Schema HandlingFlexibleStrict
PerformanceSlightly LowerFaster
Error HandlingBetterLimited

Interview Answer

Use DynamicFrame for ingestion and DataFrame for complex transformations.


15. Convert DynamicFrame to DataFrame

df = dynamic_frame.toDF()

16. Convert DataFrame to DynamicFrame

from awsglue.dynamicframe import DynamicFrame

dyf = DynamicFrame.fromDF(df, glueContext, "dyf")

17. What is GlueContext?

Answer

Extension of SparkContext.

Provides Glue-specific functionality.

glueContext = GlueContext(sc)

18. What is Job Bookmark?

Answer

Tracks previously processed data.

Prevents duplicate processing.


Example

Yesterday:

file1.csv
file2.csv

Processed.

Today:

file3.csv

Only file3 processed.


19. Benefits of Job Bookmarks

Answer

  • Incremental loading
  • Reduced processing
  • Lower cost
  • Faster execution

20. What Happens If Bookmark Disabled?

Answer

Entire dataset gets reprocessed every run.


21. What Worker Types Exist in Glue?

Answer

Common:

WorkerUse
G.1XStandard
G.2XMore Memory
G.4XLarge Workloads
G.8XHeavy Processing
G.16XEnterprise Scale

22. What is DPU?

Answer

Data Processing Unit.

1 DPU =

  • 4 vCPU
  • 16 GB Memory

Glue pricing is based on DPU consumption.


23. How Do You Optimize Glue Costs?

Answer

  1. Job bookmarks
  2. Partitioning
  3. Parquet format
  4. Proper worker sizing
  5. Pushdown predicates
  6. Incremental loads

24. What is Pushdown Predicate?

Answer

Filter data before reading.

Bad:

read all
then filter

Good:

read only required partition

Example:

year=2025

25. What File Formats Are Supported?

Answer

  • CSV
  • JSON
  • XML
  • Parquet
  • ORC
  • Avro
  • Iceberg
  • Hudi
  • Delta Lake

26. Why Convert CSV to Parquet?

Answer

Benefits:

  • Columnar
  • Compression
  • Faster queries
  • Reduced storage

Often asked in interviews.


27. Explain Glue Workflow

Answer

Workflow orchestrates:

  • Crawlers
  • Jobs
  • Triggers

into one pipeline.


Example:

Crawler

ETL Job

Validation Job

Load Job

28. Types of Triggers

Answer

  1. Scheduled
  2. On-demand
  3. Event-based
  4. Conditional

29. Can Glue Be Triggered by S3 Events?

Answer

Yes.

Using:

S3 Event
→ EventBridge
→ Glue Workflow

30. How Does Glue Integrate with Athena?

Answer

Athena directly uses Glue Catalog metadata.

No schema duplication required.


31. How Does Glue Integrate with Redshift?

Answer

Glue:

  • Extract data
  • Transform
  • Load into Redshift

Using JDBC or COPY command.


32. How Does Glue Integrate with Snowflake?

Answer

Using Snowflake Connector.

Typical flow:

S3 → Glue → Snowflake

33. Explain Glue Streaming Jobs

Answer

Process real-time data.

Sources:

  • Kafka
  • MSK
  • Kinesis

Supports near-real-time ETL.


34. Difference Between Batch and Streaming Glue Jobs

BatchStreaming
ScheduledContinuous
Finite DataInfinite Data
Lower CostHigher Cost

35. What is AWS Glue Data Quality?

Answer

Built-in data validation framework.

Checks:

  • Nulls
  • Duplicates
  • Completeness
  • Consistency

36. Example Data Quality Rule

IsComplete "customer_id"

Ensures no null values.


37. How Do You Secure AWS Glue?

Answer

Use:

  • IAM
  • KMS
  • VPC
  • Security Groups
  • Lake Formation
  • Encryption

38. How Is Data Encrypted?

At Rest

  • SSE-S3
  • SSE-KMS

In Transit

  • SSL/TLS

39. Explain VPC Integration

Answer

Glue jobs can run inside VPC.

Needed when accessing:

  • RDS
  • Private Redshift
  • On-prem systems

40. How Do You Monitor Glue Jobs?

Answer

Using:

  • CloudWatch Logs
  • CloudWatch Metrics
  • Job Runs Dashboard
  • EventBridge Alerts

41. Common Glue Failures?

Answer

  • Memory errors
  • Schema mismatch
  • Permission denied
  • JDBC timeout
  • Network issues

42. How Do You Troubleshoot Memory Issues?

Answer

  • Increase workers
  • Repartition data
  • Use Parquet
  • Pushdown predicates

43. Glue vs EMR

GlueEMR
ServerlessCluster Managed
ETL FocusedBig Data Platform
SimpleFlexible
Less ControlMore Control

Interview Answer

Use Glue for ETL.
Use EMR for complex Spark workloads.


44. Glue vs Lambda

GlueLambda
Big DataSmall Data
SparkPython/Node
GB/TB ScaleMB Scale

45. Glue vs DataBrew

GlueDataBrew
Developer FocusedBusiness User Focused
CodingNo-Code

46. Real-Time Scenario

Question

How would you process clickstream data?

Answer

Architecture:

Website

Kinesis

Glue Streaming

S3

Athena

47. Scenario: Incremental Daily Loads

Answer

Use:

  • Job Bookmarks
  • Partitioned S3
  • Workflow

48. Scenario: 5 TB Daily Data

Answer

Optimization:

  • Parquet
  • Partitioning
  • G.8X Workers
  • Predicate Pushdown

49. Scenario: Duplicate Records

Answer

Use Spark:

df.dropDuplicates()

or primary-key validation.


50. Senior Architect Question

Design a Modern AWS Data Lake Using Glue

Answer

Architecture:

Sources

S3 Landing

Glue Crawler

Glue Catalog

Glue ETL

Curated Zone

Athena

Redshift

QuickSight

Services:

  • S3
  • Glue
  • Athena
  • Redshift
  • Lake Formation
  • CloudWatch
  • IAM
  • KMS

Top 20 AWS Glue Interview Questions Asked Most Frequently

  1. What is AWS Glue?
  2. Difference between DynamicFrame and DataFrame?
  3. What is Glue Data Catalog?
  4. What are Crawlers?
  5. What are Job Bookmarks?
  6. What is DPU?
  7. How Glue pricing works?
  8. Glue vs EMR?
  9. Glue vs Lambda?
  10. How do you optimize Glue jobs?
  11. What are Pushdown Predicates?
  12. How does Glue integrate with Athena?
  13. How does Glue integrate with Redshift?
  14. Explain Glue Workflows.
  15. Explain Glue Streaming.
  16. How do you handle schema evolution?
  17. How do you secure Glue?
  18. How do you troubleshoot failed jobs?
  19. What worker type would you choose for a 5 TB dataset?
  20. Design an end-to-end AWS Data Lake using Glue.

For a Senior Data Engineer, Cloud Engineer, or Solutions Architect interview in the U.S. market, you should also be prepared for advanced topics such as Apache Iceberg, Hudi, Delta Lake, Glue 5.0, Spark optimization, Lake Formation integration, CDC pipelines, and multi-account data lake architectures.

AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service from AWS that simplifies discovering, preparing, and combining data for analytics, machine learning, and application development. It handles data cataloging, schema inference, job orchestration, and execution using Apache Spark (with Python/PySpark or Scala support).

It is particularly useful for building data lakes, integrating data from various sources (S3, RDS, Redshift, JDBC, etc.), and preparing it for tools like Amazon Athena, Redshift, or SageMaker.

Basic Questions

1. What is AWS Glue? AWS Glue is a serverless data integration service that automates ETL processes. It includes a Data Catalog for metadata, crawlers for schema discovery, and an ETL engine for running Spark-based jobs. It eliminates infrastructure management, auto-scales, and supports pay-as-you-go pricing based on Data Processing Units (DPUs).

2. What are the main components of AWS Glue?

  • Data Catalog: Centralized metadata repository (databases, tables, schemas) compatible with Hive Metastore.
  • Crawlers: Scan data sources to infer schemas and populate/update the Data Catalog.
  • ETL Jobs: Scripts (Python/Scala) for transforming data; run on Apache Spark.
  • Triggers: Schedule or event-based (e.g., S3 events, job completion) starters for jobs/crawlers.
  • Development Endpoints: For interactive development/testing of scripts.
  • Workflows: Orchestrate complex multi-job/crawler pipelines.
  • Glue Studio: Visual interface for building ETL jobs (no-code/low-code).
  • Glue DataBrew: For data cleaning/preparation with visual recipes.
  • Glue Data Quality: Rules-based monitoring and validation.

3. Explain AWS Glue architecture. Data sources → Crawlers populate Data Catalog → Jobs read from Catalog sources, apply transformations (Spark), and write to targets. Triggers/workflows orchestrate execution. Metadata is stored in the Catalog; execution uses serverless Spark clusters (DPUs). Integration with Lake Formation for governance and security.

4. What are AWS Glue Crawlers and how do they work? Crawlers scan data stores (S3, databases, etc.), infer schemas (using classifiers), and create/update tables in the Data Catalog. They support incremental crawls, schema evolution, and S3 event notifications for efficiency. They handle partitioning and can use custom classifiers.

5. What is the AWS Glue Data Catalog? A persistent, managed metadata store (like a Hive metastore) that holds structural/operational metadata about data assets. It enables discovery, querying (e.g., via Athena), and governance across AWS services. Supports versioning, encryption, and Lake Formation permissions.

Intermediate Questions

6. How does AWS Glue handle schema evolution? Crawlers detect changes (new columns, types) and update table metadata. You can configure behavior (e.g., ignore changes, add new columns, or create new versions). Supports partition indexes and grouping policies.

7. What are Triggers in AWS Glue? Triggers start jobs or crawlers. Types:

  • Scheduled (cron-like).
  • On-demand.
  • Conditional (based on job/crawler success/failure). They enable chaining for workflows.

8. Explain Glue Jobs. What languages/scripts are supported? Jobs define ETL logic with a script, sources, and targets. Primarily Python (PySpark) or Scala. Glue generates boilerplate scripts; you can edit in Studio, console, or IDEs. Supports bookmarks for incremental processing.

9. What are DPUs in AWS Glue? Data Processing Units measure compute capacity. 1 DPU = 4 vCPUs + 16 GB RAM. You allocate DPUs to jobs; auto-scaling is available. Pricing is per DPU-hour (billed per second after 10-minute minimum).

10. What is Glue Studio? A visual, drag-and-drop interface to build, edit, and monitor ETL jobs without deep coding. Generates Spark code that can be customized.

11. How do you monitor and debug Glue Jobs? Use CloudWatch metrics/logs, job run details in console (logs, errors, timelines), bookmarks, and Glue Data Quality. Enable continuous logging and job insights.

12. What security features does AWS Glue support?

  • IAM roles/policies for fine-grained access.
  • Encryption at rest (KMS) and in transit (SSL).
  • VPC integration, security groups.
  • Lake Formation for row/column-level permissions.
  • Audit logging via CloudTrail.

13. How does AWS Glue integrate with other AWS services?

  • S3: Primary storage.
  • Athena/Redshift: Query targets.
  • Lake Formation: Governance.
  • SageMaker: ML pipelines.
  • EventBridge/Lambda: Event-driven triggers.
  • Kinesis/MSK: Streaming integration.

Advanced / Scenario-Based Questions

14. Explain Glue Workflows. Orchestration tool for complex ETL involving multiple jobs, crawlers, and triggers. Visual graph in console; supports dependencies and start triggers.

15. How do you optimize Glue Job performance?

  • Use partitioning and predicate pushdown.
  • Dynamic allocation and auto-scaling.
  • Appropriate worker types (Standard vs. G.1X/G.2X).
  • Cache/repartition data; avoid small files.
  • Use Glue bookmarks and incremental crawls.
  • Optimize memory (skew handling, broadcast joins).
  • Push transformations early (filtering).

16. Difference between AWS Glue, EMR, and Data Pipeline.

  • Glue: Serverless ETL-focused, easy cataloging/crawlers, managed Spark. Best for standard ETL.
  • EMR: Managed Hadoop/Spark clusters; more flexible/customizable for complex big data/ML, but requires more management. Cheaper for sustained heavy workloads.
  • Data Pipeline: Workflow orchestration (legacy); less ETL-native than Glue.

17. How would you handle incremental ETL in Glue? Use job bookmarks (tracks processed data), S3 event notifications for crawlers, or custom logic with last-modified timestamps/watermarks in scripts. Combine with partitioning.

18. What are common challenges with Glue and how to address them?

  • Cost for large/spiky workloads → Optimize DPUs, use spot-like savings, or switch to EMR Serverless.
  • Schema drift → Configure crawler rules.
  • Small files → Compaction jobs.
  • Debugging Spark issues → Use development endpoints/interactive sessions.
  • Cold starts → Pre-warm or schedule appropriately.

19. Explain Glue Data Quality. Feature to define rules (e.g., completeness, uniqueness) and monitor data in pipelines or lakes. Auto-suggests rules; integrates with workflows for alerts/blocking.

20. How do you handle large-scale data or joins in Glue? Scale DPUs, use broadcast joins for small tables, skew mitigation, repartitioning, and Glue’s optimized connectors. For very large/complex, consider EMR.

Other Common Topics

  • Pricing: DPUs + crawlers (per hour) + Data Catalog requests.
  • Limitations: Less customizable than self-managed Spark; Python shell jobs for lightweight tasks.
  • Best Practices: Use IAM least privilege, partition data, monitor with CloudWatch, version scripts in S3/CodeCommit, test with dev endpoints, and leverage Lake Formation.
  • Development Workflow: Use interactive sessions (Jupyter), Glue Studio, or local Spark testing.

For the most up-to-date details, refer to the official AWS Glue documentation. Practice with real scenarios (e.g., S3 → transformed Parquet in another bucket with partitioning) and hands-on labs. Good luck with your interview!

Preparing for an AWS Glue interview can feel like a significant challenge, as it requires you to demonstrate knowledge not just of the service itself, but of Spark, data governance, and AWS integration.

To help you succeed, I have compiled the most important questions and detailed answers, categorized by topic and difficulty. This guide covers everything from basic definitions to advanced architectural scenarios, directly reflecting what interviewers are looking for in 2025-2026.


🧱 Module 1: Core Concepts & Architecture

1. What is AWS Glue, and why is it considered “serverless”?

Answer:
AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service.
It is considered “serverless” because you do not provision or manage clusters (like EC2 or EMR). AWS Glue automatically provisions, scales, and terminates the resources (Spark or Python environments) needed to run your jobs. You only pay for the compute time consumed during execution .

2. What are the primary components of AWS Glue?

Answer:
The architecture is built on several key pillars :

  • Glue Data Catalog: A central metadata repository (Hive metastore compatible) storing table definitions, schemas, and locations (e.g., S3 paths).
  • Crawlers: Automated processes that scan data sources (S3, RDS) to infer schemas and populate the Data Catalog.
  • ETL Jobs: The logic for transforming data. Runs on Apache Spark (for heavy lifting) or Python Shell (for lightweight scripts).
  • Workflows & Triggers: Orchestration tools to chain multiple crawlers and jobs together based on time or event dependencies.
  • Glue Studio: A visual interface to design, debug, and monitor ETL pipelines without heavy coding.

3. Can you explain the difference between a Spark DataFrame and a Glue DynamicFrame?

Answer:
This is a critical distinction for technical interviews.

  • DataFrame (Spark): Lazy-evaluated, highly optimized for performance, but strictly typed. It fails immediately if a schema mismatch occurs (e.g., a missing column).
  • DynamicFrame (Glue): An extension of DataFrame designed for schema flexibility. It supports “schema on read” and handles schema evolution natively. It is ideal for semi-structured data (JSON) or messy data sources .

Code Snippet (Interview Example):

python

# Using DataFrame (strict)
df = spark.read.csv("s3://path/")
df.printSchema() # Fails if schema changes unexpectedly

# Using DynamicFrame (flexible)
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="my_db", 
    table_name="my_table"
)
# It resolves schema drift automatically

🕷️ Module 2: Crawlers & Data Catalog

4. How does a Crawler handle Schema Evolution?

Answer:
A Crawler scans data stores and updates the Data Catalog. You can configure its behavior for schema updates:

  1. Update the table definition: Adds new columns found in new data files.
  2. Ignore changes: Leaves the catalog as-is (risky for query engines).
  3. Deprecate deleted columns: Marks missing columns as deprecated rather than removing them .

5. How does the Glue Data Catalog integrate with other services?

Answer:
The Data Catalog acts as the “single source of truth” for metadata across AWS .

  • Amazon Athena: Allows SQL queries directly on S3 data using Glue tables.
  • Amazon Redshift Spectrum: Enables Redshift to query the Data Lake without loading data.
  • Amazon EMR: Can use Glue as a Hive Metastore instead of hosting its own MySQL/PostgreSQL database.

⚙️ Module 3: ETL Jobs, Development & Optimization

6. You are processing millions of log files. How do you optimize performance?

Answer:
Optimization in Glue focuses on minimizing I/O and shuffle operations :

  1. Partition Pruning: Use partitioned data (e.g., year=2024/...) so Glue only reads necessary folders.
  2. Use File Formats: Prefer Parquet (columnar) over CSV/JSON to reduce scan time.
  3. Increase DPUs (Data Processing Units): 1 DPU = 4 vCPU + 16GB RAM. For large datasets, use G.2X or G.4X workers for memory-intensive aggregations .
  4. Job Bookmarks: Enable incremental processing to avoid reprocessing old data.
  5. Broadcast Hash Joins: If joining a large table with a small lookup table, use .broadcast() to avoid shuffling the large table across the network.

7. What is a Glue Job Bookmark? How does it handle incremental loads?

Answer:
A Job Bookmark keeps track of previously processed data.

  • Function: It stores the last processed timestamp or file name in a persistent state store.
  • Incremental Load: When the job runs again, Glue checks the bookmark, reads only the new data (e.g., files added since last run), and skips the old data.
  • Use Case: Essential for processing log files in S3 or reading transaction logs from JDBC where you only want new records .

8. How do you orchestrate a complex pipeline (e.g., Crawl -> Transform -> Load -> Archive)?

Answer:
Using Glue Workflows and Triggers :

  1. Trigger 1: On schedule (e.g., 2 AM).
  2. Node A: Crawler_Logs (populates Catalog).
  3. Trigger 2: Event-based (starts on Crawler success).
  4. Node B: ETL_Job_Process (transforms data).
  5. Trigger 3: Event-based.
  6. Node C: ETL_Job_Archive (moves raw files).
    This eliminates the need for a separate orchestrator like Step Functions for simple linear dependencies.

🔒 Module 4: Security, Governance & Advanced Integration

9. How do you connect Glue to a private RDS/Aurora database?

Answer:
Connecting to private databases involves networking configuration :

  1. VPC Configuration: Glue runs inside a VPC. You must place the Glue job in the same VPC, Subnets, and Security Groups as the RDS instance.
  2. Glue Connection: Create a Connection object of type JDBC with the RDS endpoint, port, and database name.
  3. Secrets Manager (Best Practice): Do not hardcode passwords. Store credentials in AWS Secrets Manager and grant the Glue IAM role permission to GetSecretValue.
  4. IAM: The role needs ec2:DescribeSecurityGroups and rds:DescribeDBInstances .

10. Explain the difference between AWS Glue and Amazon EMR.

Answer:

FeatureAWS GlueAmazon EMR
ManagementServerless (AWS manages clusters)You manage clusters (EC2 instances)
CostPay per DPU/secondPay per EC2 hour (plus spot pricing options)
Start TimeSeconds (cold starts exist)Minutes (clusters take time to spin up)
Use CaseAd-hoc ETL, Data Catalog, small/medium workloadsLarge-scale big data processing, ML training, long-running clusters
ControlLimited (managed Spark)Full control over Hadoop/Spark configurations

Interview Tip: Use Glue for serverless, event-driven jobs. Use EMR for massive, persistent clusters or specific framework versions .


💻 Module 5: Coding & Scenario Questions

11. Scenario: Handle a Glue job that suddenly fails due to “Out of Memory” (OOM).

Answer:
Troubleshooting Steps:

  1. Check CloudWatch Logs: Look for java.lang.OutOfMemoryError or specific stage failures.
  2. Diagnose: Usually caused by improper partitioning (e.g., a single partition holding 100GB of data).
  3. Fix:
    • Code Fix: Use repartition(numPartitions) or coalesce(numPartitions) before writing to distribute the data evenly.
    • Configuration Fix: Change worker type to G.2X (more memory per DPU) or increase the --number-of-workers .

12. Write a Python/PySpark script to handle incremental loading from S3 to Redshift.

Answer:

python

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Get job arguments (including bookmark)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# 1. Read from Catalog using Job Bookmark (automatic incremental)
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db",
    table_name="s3_logs",
    transformation_ctx="datasource0"
)

# 2. Apply mappings or transformations (e.g., casting types)
apply_mapping = ApplyMapping.apply(frame=dynamic_frame, mappings=[...])

# 3. Write to Redshift (Overwrite or Append based on logic)
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=apply_mapping,
    catalog_connection="redshift_connection",
    connection_options={
        "dbtable": "public.target_table",
        "database": "dev"
    },
    redshift_tmp_dir="s3://temp-bucket/redshift_staging"
)

job.commit()

13. How do you implement real-time data pipelines with Glue?

Answer:
Using AWS Glue Streaming (Glue version 3.0+).

  • Source: Amazon Kinesis Data Streams or MSK (Managed Kafka).
  • Processing: Glue runs a continuous Spark Streaming job (serverless).
  • Sink: S3 (Delta Lake format) or JDBC.
  • Key Feature: It handles windowed aggregations (e.g., “Count clicks every 5 minutes”) without managing EC2 instances .

📈 Bonus: “What are your best practices for production?”

Interviewers often end with this question to gauge real-world experience.

  1. Data Partitioning: Always write data to S3 using PARTITIONED BY (year, month, day) to enable query pruning .
  2. Idempotency: Design jobs so that running them twice produces the same result (e.g., using overwrite mode carefully or using S3 versioning).
  3. Monitoring: Set up CloudWatch Alarms for glue.driver.aggregate.elapsedTime and job failures. Enable Continuous Logging for real-time log streaming to CloudWatch .
  4. Cost Control: Use Python Shell Jobs (1/16 DPU) for light data validation or API calls. Only use Spark Jobs (min 2 DPUs) for heavy transformations .

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top