AWS Glue is one of the most frequently asked services in AWS Data Engineering, Big Data, Analytics, AI/ML, and Cloud Engineering interviews.
Interviewers typically test:
- AWS Glue Fundamentals
- Architecture
- ETL/ELT Concepts
- Glue Jobs
- Crawlers
- Data Catalog
- DynamicFrames
- Spark Concepts
- Glue Studio
- Glue Workflows
- Performance Optimization
- Security
- Monitoring
- Real-world Scenarios
- Troubleshooting
- Advanced Architect-Level Design
1. What is AWS Glue?
Answer
AWS Glue is a fully managed serverless data integration service that helps discover, prepare, move, and transform data for analytics, machine learning, and application development.
It provides:
- ETL/ELT Processing
- Metadata Management
- Data Discovery
- Data Cataloging
- Data Quality
- Workflow Orchestration
Real-World Example
Suppose:
- Sales Data → S3
- Customer Data → RDS
- Product Data → DynamoDB
Glue can:
- Read all data
- Transform it
- Load into Snowflake, Redshift, or S3 Data Lake
2. Why AWS Glue?
Answer
Benefits:
- Serverless
- Auto Scaling
- Built-in Apache Spark
- No infrastructure management
- Integrated with AWS ecosystem
- Pay only for usage
Interview Tip
Companies prefer Glue because they don’t need to manage Spark clusters.
3. What are AWS Glue Components?
Answer
Major components:
| Component | Purpose |
|---|---|
| Data Catalog | Metadata repository |
| Crawler | Discover schema |
| ETL Job | Data transformation |
| Trigger | Start jobs |
| Workflow | Orchestrate jobs |
| Connection | Connect external systems |
| Development Endpoint | Interactive development |
| Glue Studio | Visual ETL designer |
| Data Quality | Validate data |
4. Explain AWS Glue Architecture
Answer
Flow:
Data Sources
|
v
Crawler
|
v
Data Catalog
|
v
Glue ETL Job
|
v
Target SystemSources:
- S3
- RDS
- Redshift
- DynamoDB
- Kafka
- Snowflake
- JDBC Sources
Targets:
- S3
- Redshift
- Snowflake
- OpenSearch
- JDBC Databases
5. What is a Glue Data Catalog?
Answer
Central metadata repository storing:
- Table Definitions
- Database Definitions
- Partitions
- Schema Information
Acts like Hive Metastore.
Example
Table:
sales_dataColumns:
id
amount
date
countryStored in Glue Catalog.
Athena and Redshift can query it directly.
6. Why is Data Catalog Important?
Answer
Without catalog:
Every application must understand schema.
With catalog:
Single source of truth.
Benefits:
- Schema management
- Data discovery
- Query optimization
- Governance
7. What is a Glue Crawler?
Answer
Crawler automatically:
- Scans data
- Infers schema
- Detects partitions
- Creates metadata tables
Example
Crawler scans:
s3://sales/Files:
2024/
2025/Creates:
sales_tablePartitions:
year=2024
year=20258. How Does a Crawler Work?
Answer
Steps:
- Connect source
- Sample data
- Detect schema
- Identify partitions
- Update Data Catalog
9. What are Classifiers?
Answer
Classifiers identify data format.
Supported:
- JSON
- CSV
- XML
- Avro
- Parquet
Custom classifiers can also be created.
10. What is an AWS Glue Job?
Answer
A Glue Job performs ETL transformations.
Written in:
- PySpark
- Scala
- Python Shell
Example:
Read CSV
Transform
Write Parquet
11. What is Glue Studio?
Answer
Visual drag-and-drop ETL designer.
Allows:
- Build ETL pipelines
- Generate code automatically
- Monitor jobs
12. What Languages Does Glue Support?
Answer
Supported:
- PySpark
- Scala
- Python Shell
PySpark is most common.
13. What is DynamicFrame?
Answer
DynamicFrame is AWS Glue’s abstraction over Spark DataFrame.
Provides:
- Schema flexibility
- Handling inconsistent data
14. DynamicFrame vs DataFrame
| Feature | DynamicFrame | DataFrame |
|---|---|---|
| AWS Glue Native | Yes | No |
| Schema Handling | Flexible | Strict |
| Performance | Slightly Lower | Faster |
| Error Handling | Better | Limited |
Interview Answer
Use DynamicFrame for ingestion and DataFrame for complex transformations.
15. Convert DynamicFrame to DataFrame
df = dynamic_frame.toDF()16. Convert DataFrame to DynamicFrame
from awsglue.dynamicframe import DynamicFrame
dyf = DynamicFrame.fromDF(df, glueContext, "dyf")17. What is GlueContext?
Answer
Extension of SparkContext.
Provides Glue-specific functionality.
glueContext = GlueContext(sc)18. What is Job Bookmark?
Answer
Tracks previously processed data.
Prevents duplicate processing.
Example
Yesterday:
file1.csv
file2.csvProcessed.
Today:
file3.csvOnly file3 processed.
19. Benefits of Job Bookmarks
Answer
- Incremental loading
- Reduced processing
- Lower cost
- Faster execution
20. What Happens If Bookmark Disabled?
Answer
Entire dataset gets reprocessed every run.
21. What Worker Types Exist in Glue?
Answer
Common:
| Worker | Use |
|---|---|
| G.1X | Standard |
| G.2X | More Memory |
| G.4X | Large Workloads |
| G.8X | Heavy Processing |
| G.16X | Enterprise Scale |
22. What is DPU?
Answer
Data Processing Unit.
1 DPU =
- 4 vCPU
- 16 GB Memory
Glue pricing is based on DPU consumption.
23. How Do You Optimize Glue Costs?
Answer
- Job bookmarks
- Partitioning
- Parquet format
- Proper worker sizing
- Pushdown predicates
- Incremental loads
24. What is Pushdown Predicate?
Answer
Filter data before reading.
Bad:
read all
then filterGood:
read only required partitionExample:
year=202525. What File Formats Are Supported?
Answer
- CSV
- JSON
- XML
- Parquet
- ORC
- Avro
- Iceberg
- Hudi
- Delta Lake
26. Why Convert CSV to Parquet?
Answer
Benefits:
- Columnar
- Compression
- Faster queries
- Reduced storage
Often asked in interviews.
27. Explain Glue Workflow
Answer
Workflow orchestrates:
- Crawlers
- Jobs
- Triggers
into one pipeline.
Example:
Crawler
↓
ETL Job
↓
Validation Job
↓
Load Job28. Types of Triggers
Answer
- Scheduled
- On-demand
- Event-based
- Conditional
29. Can Glue Be Triggered by S3 Events?
Answer
Yes.
Using:
S3 Event
→ EventBridge
→ Glue Workflow30. How Does Glue Integrate with Athena?
Answer
Athena directly uses Glue Catalog metadata.
No schema duplication required.
31. How Does Glue Integrate with Redshift?
Answer
Glue:
- Extract data
- Transform
- Load into Redshift
Using JDBC or COPY command.
32. How Does Glue Integrate with Snowflake?
Answer
Using Snowflake Connector.
Typical flow:
S3 → Glue → Snowflake33. Explain Glue Streaming Jobs
Answer
Process real-time data.
Sources:
- Kafka
- MSK
- Kinesis
Supports near-real-time ETL.
34. Difference Between Batch and Streaming Glue Jobs
| Batch | Streaming |
|---|---|
| Scheduled | Continuous |
| Finite Data | Infinite Data |
| Lower Cost | Higher Cost |
35. What is AWS Glue Data Quality?
Answer
Built-in data validation framework.
Checks:
- Nulls
- Duplicates
- Completeness
- Consistency
36. Example Data Quality Rule
IsComplete "customer_id"Ensures no null values.
37. How Do You Secure AWS Glue?
Answer
Use:
- IAM
- KMS
- VPC
- Security Groups
- Lake Formation
- Encryption
38. How Is Data Encrypted?
At Rest
- SSE-S3
- SSE-KMS
In Transit
- SSL/TLS
39. Explain VPC Integration
Answer
Glue jobs can run inside VPC.
Needed when accessing:
- RDS
- Private Redshift
- On-prem systems
40. How Do You Monitor Glue Jobs?
Answer
Using:
- CloudWatch Logs
- CloudWatch Metrics
- Job Runs Dashboard
- EventBridge Alerts
41. Common Glue Failures?
Answer
- Memory errors
- Schema mismatch
- Permission denied
- JDBC timeout
- Network issues
42. How Do You Troubleshoot Memory Issues?
Answer
- Increase workers
- Repartition data
- Use Parquet
- Pushdown predicates
43. Glue vs EMR
| Glue | EMR |
|---|---|
| Serverless | Cluster Managed |
| ETL Focused | Big Data Platform |
| Simple | Flexible |
| Less Control | More Control |
Interview Answer
Use Glue for ETL.
Use EMR for complex Spark workloads.
44. Glue vs Lambda
| Glue | Lambda |
|---|---|
| Big Data | Small Data |
| Spark | Python/Node |
| GB/TB Scale | MB Scale |
45. Glue vs DataBrew
| Glue | DataBrew |
|---|---|
| Developer Focused | Business User Focused |
| Coding | No-Code |
46. Real-Time Scenario
Question
How would you process clickstream data?
Answer
Architecture:
Website
↓
Kinesis
↓
Glue Streaming
↓
S3
↓
Athena47. Scenario: Incremental Daily Loads
Answer
Use:
- Job Bookmarks
- Partitioned S3
- Workflow
48. Scenario: 5 TB Daily Data
Answer
Optimization:
- Parquet
- Partitioning
- G.8X Workers
- Predicate Pushdown
49. Scenario: Duplicate Records
Answer
Use Spark:
df.dropDuplicates()or primary-key validation.
50. Senior Architect Question
Design a Modern AWS Data Lake Using Glue
Answer
Architecture:
Sources
↓
S3 Landing
↓
Glue Crawler
↓
Glue Catalog
↓
Glue ETL
↓
Curated Zone
↓
Athena
↓
Redshift
↓
QuickSightServices:
- S3
- Glue
- Athena
- Redshift
- Lake Formation
- CloudWatch
- IAM
- KMS
Top 20 AWS Glue Interview Questions Asked Most Frequently
- What is AWS Glue?
- Difference between DynamicFrame and DataFrame?
- What is Glue Data Catalog?
- What are Crawlers?
- What are Job Bookmarks?
- What is DPU?
- How Glue pricing works?
- Glue vs EMR?
- Glue vs Lambda?
- How do you optimize Glue jobs?
- What are Pushdown Predicates?
- How does Glue integrate with Athena?
- How does Glue integrate with Redshift?
- Explain Glue Workflows.
- Explain Glue Streaming.
- How do you handle schema evolution?
- How do you secure Glue?
- How do you troubleshoot failed jobs?
- What worker type would you choose for a 5 TB dataset?
- Design an end-to-end AWS Data Lake using Glue.
For a Senior Data Engineer, Cloud Engineer, or Solutions Architect interview in the U.S. market, you should also be prepared for advanced topics such as Apache Iceberg, Hudi, Delta Lake, Glue 5.0, Spark optimization, Lake Formation integration, CDC pipelines, and multi-account data lake architectures.
AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service from AWS that simplifies discovering, preparing, and combining data for analytics, machine learning, and application development. It handles data cataloging, schema inference, job orchestration, and execution using Apache Spark (with Python/PySpark or Scala support).
It is particularly useful for building data lakes, integrating data from various sources (S3, RDS, Redshift, JDBC, etc.), and preparing it for tools like Amazon Athena, Redshift, or SageMaker.
Basic Questions
1. What is AWS Glue? AWS Glue is a serverless data integration service that automates ETL processes. It includes a Data Catalog for metadata, crawlers for schema discovery, and an ETL engine for running Spark-based jobs. It eliminates infrastructure management, auto-scales, and supports pay-as-you-go pricing based on Data Processing Units (DPUs).
2. What are the main components of AWS Glue?
- Data Catalog: Centralized metadata repository (databases, tables, schemas) compatible with Hive Metastore.
- Crawlers: Scan data sources to infer schemas and populate/update the Data Catalog.
- ETL Jobs: Scripts (Python/Scala) for transforming data; run on Apache Spark.
- Triggers: Schedule or event-based (e.g., S3 events, job completion) starters for jobs/crawlers.
- Development Endpoints: For interactive development/testing of scripts.
- Workflows: Orchestrate complex multi-job/crawler pipelines.
- Glue Studio: Visual interface for building ETL jobs (no-code/low-code).
- Glue DataBrew: For data cleaning/preparation with visual recipes.
- Glue Data Quality: Rules-based monitoring and validation.
3. Explain AWS Glue architecture. Data sources → Crawlers populate Data Catalog → Jobs read from Catalog sources, apply transformations (Spark), and write to targets. Triggers/workflows orchestrate execution. Metadata is stored in the Catalog; execution uses serverless Spark clusters (DPUs). Integration with Lake Formation for governance and security.
4. What are AWS Glue Crawlers and how do they work? Crawlers scan data stores (S3, databases, etc.), infer schemas (using classifiers), and create/update tables in the Data Catalog. They support incremental crawls, schema evolution, and S3 event notifications for efficiency. They handle partitioning and can use custom classifiers.
5. What is the AWS Glue Data Catalog? A persistent, managed metadata store (like a Hive metastore) that holds structural/operational metadata about data assets. It enables discovery, querying (e.g., via Athena), and governance across AWS services. Supports versioning, encryption, and Lake Formation permissions.
Intermediate Questions
6. How does AWS Glue handle schema evolution? Crawlers detect changes (new columns, types) and update table metadata. You can configure behavior (e.g., ignore changes, add new columns, or create new versions). Supports partition indexes and grouping policies.
7. What are Triggers in AWS Glue? Triggers start jobs or crawlers. Types:
- Scheduled (cron-like).
- On-demand.
- Conditional (based on job/crawler success/failure). They enable chaining for workflows.
8. Explain Glue Jobs. What languages/scripts are supported? Jobs define ETL logic with a script, sources, and targets. Primarily Python (PySpark) or Scala. Glue generates boilerplate scripts; you can edit in Studio, console, or IDEs. Supports bookmarks for incremental processing.
9. What are DPUs in AWS Glue? Data Processing Units measure compute capacity. 1 DPU = 4 vCPUs + 16 GB RAM. You allocate DPUs to jobs; auto-scaling is available. Pricing is per DPU-hour (billed per second after 10-minute minimum).
10. What is Glue Studio? A visual, drag-and-drop interface to build, edit, and monitor ETL jobs without deep coding. Generates Spark code that can be customized.
11. How do you monitor and debug Glue Jobs? Use CloudWatch metrics/logs, job run details in console (logs, errors, timelines), bookmarks, and Glue Data Quality. Enable continuous logging and job insights.
12. What security features does AWS Glue support?
- IAM roles/policies for fine-grained access.
- Encryption at rest (KMS) and in transit (SSL).
- VPC integration, security groups.
- Lake Formation for row/column-level permissions.
- Audit logging via CloudTrail.
13. How does AWS Glue integrate with other AWS services?
- S3: Primary storage.
- Athena/Redshift: Query targets.
- Lake Formation: Governance.
- SageMaker: ML pipelines.
- EventBridge/Lambda: Event-driven triggers.
- Kinesis/MSK: Streaming integration.
Advanced / Scenario-Based Questions
14. Explain Glue Workflows. Orchestration tool for complex ETL involving multiple jobs, crawlers, and triggers. Visual graph in console; supports dependencies and start triggers.
15. How do you optimize Glue Job performance?
- Use partitioning and predicate pushdown.
- Dynamic allocation and auto-scaling.
- Appropriate worker types (Standard vs. G.1X/G.2X).
- Cache/repartition data; avoid small files.
- Use Glue bookmarks and incremental crawls.
- Optimize memory (skew handling, broadcast joins).
- Push transformations early (filtering).
16. Difference between AWS Glue, EMR, and Data Pipeline.
- Glue: Serverless ETL-focused, easy cataloging/crawlers, managed Spark. Best for standard ETL.
- EMR: Managed Hadoop/Spark clusters; more flexible/customizable for complex big data/ML, but requires more management. Cheaper for sustained heavy workloads.
- Data Pipeline: Workflow orchestration (legacy); less ETL-native than Glue.
17. How would you handle incremental ETL in Glue? Use job bookmarks (tracks processed data), S3 event notifications for crawlers, or custom logic with last-modified timestamps/watermarks in scripts. Combine with partitioning.
18. What are common challenges with Glue and how to address them?
- Cost for large/spiky workloads → Optimize DPUs, use spot-like savings, or switch to EMR Serverless.
- Schema drift → Configure crawler rules.
- Small files → Compaction jobs.
- Debugging Spark issues → Use development endpoints/interactive sessions.
- Cold starts → Pre-warm or schedule appropriately.
19. Explain Glue Data Quality. Feature to define rules (e.g., completeness, uniqueness) and monitor data in pipelines or lakes. Auto-suggests rules; integrates with workflows for alerts/blocking.
20. How do you handle large-scale data or joins in Glue? Scale DPUs, use broadcast joins for small tables, skew mitigation, repartitioning, and Glue’s optimized connectors. For very large/complex, consider EMR.
Other Common Topics
- Pricing: DPUs + crawlers (per hour) + Data Catalog requests.
- Limitations: Less customizable than self-managed Spark; Python shell jobs for lightweight tasks.
- Best Practices: Use IAM least privilege, partition data, monitor with CloudWatch, version scripts in S3/CodeCommit, test with dev endpoints, and leverage Lake Formation.
- Development Workflow: Use interactive sessions (Jupyter), Glue Studio, or local Spark testing.
For the most up-to-date details, refer to the official AWS Glue documentation. Practice with real scenarios (e.g., S3 → transformed Parquet in another bucket with partitioning) and hands-on labs. Good luck with your interview!
Preparing for an AWS Glue interview can feel like a significant challenge, as it requires you to demonstrate knowledge not just of the service itself, but of Spark, data governance, and AWS integration.
To help you succeed, I have compiled the most important questions and detailed answers, categorized by topic and difficulty. This guide covers everything from basic definitions to advanced architectural scenarios, directly reflecting what interviewers are looking for in 2025-2026.
🧱 Module 1: Core Concepts & Architecture
1. What is AWS Glue, and why is it considered “serverless”?
Answer:
AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service.
It is considered “serverless” because you do not provision or manage clusters (like EC2 or EMR). AWS Glue automatically provisions, scales, and terminates the resources (Spark or Python environments) needed to run your jobs. You only pay for the compute time consumed during execution .
2. What are the primary components of AWS Glue?
Answer:
The architecture is built on several key pillars :
- Glue Data Catalog: A central metadata repository (Hive metastore compatible) storing table definitions, schemas, and locations (e.g., S3 paths).
- Crawlers: Automated processes that scan data sources (S3, RDS) to infer schemas and populate the Data Catalog.
- ETL Jobs: The logic for transforming data. Runs on Apache Spark (for heavy lifting) or Python Shell (for lightweight scripts).
- Workflows & Triggers: Orchestration tools to chain multiple crawlers and jobs together based on time or event dependencies.
- Glue Studio: A visual interface to design, debug, and monitor ETL pipelines without heavy coding.
3. Can you explain the difference between a Spark DataFrame and a Glue DynamicFrame?
Answer:
This is a critical distinction for technical interviews.
- DataFrame (Spark): Lazy-evaluated, highly optimized for performance, but strictly typed. It fails immediately if a schema mismatch occurs (e.g., a missing column).
- DynamicFrame (Glue): An extension of DataFrame designed for schema flexibility. It supports “schema on read” and handles schema evolution natively. It is ideal for semi-structured data (JSON) or messy data sources .
Code Snippet (Interview Example):
python
# Using DataFrame (strict)
df = spark.read.csv("s3://path/")
df.printSchema() # Fails if schema changes unexpectedly
# Using DynamicFrame (flexible)
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database="my_db",
table_name="my_table"
)
# It resolves schema drift automatically🕷️ Module 2: Crawlers & Data Catalog
4. How does a Crawler handle Schema Evolution?
Answer:
A Crawler scans data stores and updates the Data Catalog. You can configure its behavior for schema updates:
- Update the table definition: Adds new columns found in new data files.
- Ignore changes: Leaves the catalog as-is (risky for query engines).
- Deprecate deleted columns: Marks missing columns as deprecated rather than removing them .
5. How does the Glue Data Catalog integrate with other services?
Answer:
The Data Catalog acts as the “single source of truth” for metadata across AWS .
- Amazon Athena: Allows SQL queries directly on S3 data using Glue tables.
- Amazon Redshift Spectrum: Enables Redshift to query the Data Lake without loading data.
- Amazon EMR: Can use Glue as a Hive Metastore instead of hosting its own MySQL/PostgreSQL database.
⚙️ Module 3: ETL Jobs, Development & Optimization
6. You are processing millions of log files. How do you optimize performance?
Answer:
Optimization in Glue focuses on minimizing I/O and shuffle operations :
- Partition Pruning: Use partitioned data (e.g.,
year=2024/...) so Glue only reads necessary folders. - Use File Formats: Prefer Parquet (columnar) over CSV/JSON to reduce scan time.
- Increase DPUs (Data Processing Units): 1 DPU = 4 vCPU + 16GB RAM. For large datasets, use
G.2XorG.4Xworkers for memory-intensive aggregations . - Job Bookmarks: Enable incremental processing to avoid reprocessing old data.
- Broadcast Hash Joins: If joining a large table with a small lookup table, use
.broadcast()to avoid shuffling the large table across the network.
7. What is a Glue Job Bookmark? How does it handle incremental loads?
Answer:
A Job Bookmark keeps track of previously processed data.
- Function: It stores the last processed timestamp or file name in a persistent state store.
- Incremental Load: When the job runs again, Glue checks the bookmark, reads only the new data (e.g., files added since last run), and skips the old data.
- Use Case: Essential for processing log files in S3 or reading transaction logs from JDBC where you only want new records .
8. How do you orchestrate a complex pipeline (e.g., Crawl -> Transform -> Load -> Archive)?
Answer:
Using Glue Workflows and Triggers :
- Trigger 1: On schedule (e.g., 2 AM).
- Node A:
Crawler_Logs(populates Catalog). - Trigger 2: Event-based (starts on Crawler success).
- Node B:
ETL_Job_Process(transforms data). - Trigger 3: Event-based.
- Node C:
ETL_Job_Archive(moves raw files).
This eliminates the need for a separate orchestrator like Step Functions for simple linear dependencies.
🔒 Module 4: Security, Governance & Advanced Integration
9. How do you connect Glue to a private RDS/Aurora database?
Answer:
Connecting to private databases involves networking configuration :
- VPC Configuration: Glue runs inside a VPC. You must place the Glue job in the same VPC, Subnets, and Security Groups as the RDS instance.
- Glue Connection: Create a Connection object of type JDBC with the RDS endpoint, port, and database name.
- Secrets Manager (Best Practice): Do not hardcode passwords. Store credentials in AWS Secrets Manager and grant the Glue IAM role permission to
GetSecretValue. - IAM: The role needs
ec2:DescribeSecurityGroupsandrds:DescribeDBInstances.
10. Explain the difference between AWS Glue and Amazon EMR.
Answer:
| Feature | AWS Glue | Amazon EMR |
|---|---|---|
| Management | Serverless (AWS manages clusters) | You manage clusters (EC2 instances) |
| Cost | Pay per DPU/second | Pay per EC2 hour (plus spot pricing options) |
| Start Time | Seconds (cold starts exist) | Minutes (clusters take time to spin up) |
| Use Case | Ad-hoc ETL, Data Catalog, small/medium workloads | Large-scale big data processing, ML training, long-running clusters |
| Control | Limited (managed Spark) | Full control over Hadoop/Spark configurations |
Interview Tip: Use Glue for serverless, event-driven jobs. Use EMR for massive, persistent clusters or specific framework versions .
💻 Module 5: Coding & Scenario Questions
11. Scenario: Handle a Glue job that suddenly fails due to “Out of Memory” (OOM).
Answer:
Troubleshooting Steps:
- Check CloudWatch Logs: Look for
java.lang.OutOfMemoryErroror specific stage failures. - Diagnose: Usually caused by improper partitioning (e.g., a single partition holding 100GB of data).
- Fix:
12. Write a Python/PySpark script to handle incremental loading from S3 to Redshift.
Answer:
python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Get job arguments (including bookmark)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# 1. Read from Catalog using Job Bookmark (automatic incremental)
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database="raw_db",
table_name="s3_logs",
transformation_ctx="datasource0"
)
# 2. Apply mappings or transformations (e.g., casting types)
apply_mapping = ApplyMapping.apply(frame=dynamic_frame, mappings=[...])
# 3. Write to Redshift (Overwrite or Append based on logic)
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=apply_mapping,
catalog_connection="redshift_connection",
connection_options={
"dbtable": "public.target_table",
"database": "dev"
},
redshift_tmp_dir="s3://temp-bucket/redshift_staging"
)
job.commit()13. How do you implement real-time data pipelines with Glue?
Answer:
Using AWS Glue Streaming (Glue version 3.0+).
- Source: Amazon Kinesis Data Streams or MSK (Managed Kafka).
- Processing: Glue runs a continuous Spark Streaming job (serverless).
- Sink: S3 (Delta Lake format) or JDBC.
- Key Feature: It handles windowed aggregations (e.g., “Count clicks every 5 minutes”) without managing EC2 instances .
📈 Bonus: “What are your best practices for production?”
Interviewers often end with this question to gauge real-world experience.
- Data Partitioning: Always write data to S3 using
PARTITIONED BY (year, month, day)to enable query pruning . - Idempotency: Design jobs so that running them twice produces the same result (e.g., using
overwritemode carefully or using S3 versioning). - Monitoring: Set up CloudWatch Alarms for
glue.driver.aggregate.elapsedTimeand job failures. Enable Continuous Logging for real-time log streaming to CloudWatch . - Cost Control: Use Python Shell Jobs (1/16 DPU) for light data validation or API calls. Only use Spark Jobs (min 2 DPUs) for heavy transformations .


