AWS Glue is one of the most frequently asked services in AWS Data Engineering, Big Data, Analytics, AI/ML, and Cloud Engineering interviews.

Interviewers typically test:

AWS Glue Fundamentals
Architecture
ETL/ELT Concepts
Glue Jobs
Crawlers
Data Catalog
DynamicFrames
Spark Concepts
Glue Studio
Glue Workflows
Performance Optimization
Security
Monitoring
Real-world Scenarios
Troubleshooting
Advanced Architect-Level Design

1. What is AWS Glue?

Answer

AWS Glue is a fully managed serverless data integration service that helps discover, prepare, move, and transform data for analytics, machine learning, and application development.

It provides:

ETL/ELT Processing
Metadata Management
Data Discovery
Data Cataloging
Data Quality
Workflow Orchestration

Real-World Example

Suppose:

Sales Data → S3
Customer Data → RDS
Product Data → DynamoDB

Glue can:

Read all data
Transform it
Load into Snowflake, Redshift, or S3 Data Lake

2. Why AWS Glue?

Answer

Benefits:

Serverless
Auto Scaling
Built-in Apache Spark
No infrastructure management
Integrated with AWS ecosystem
Pay only for usage

Interview Tip

Companies prefer Glue because they don’t need to manage Spark clusters.

3. What are AWS Glue Components?

Answer

Major components:

Component	Purpose
Data Catalog	Metadata repository
Crawler	Discover schema
ETL Job	Data transformation
Trigger	Start jobs
Workflow	Orchestrate jobs
Connection	Connect external systems
Development Endpoint	Interactive development
Glue Studio	Visual ETL designer
Data Quality	Validate data

4. Explain AWS Glue Architecture

Answer

Flow:

Data Sources
    |
    v
Crawler
    |
    v
Data Catalog
    |
    v
Glue ETL Job
    |
    v
Target System

Sources:

S3
RDS
Redshift
DynamoDB
Kafka
Snowflake
JDBC Sources

Targets:

S3
Redshift
Snowflake
OpenSearch
JDBC Databases

5. What is a Glue Data Catalog?

Answer

Central metadata repository storing:

Table Definitions
Database Definitions
Partitions
Schema Information

Acts like Hive Metastore.

Example

Table:

sales_data

Columns:

id
amount
date
country

Stored in Glue Catalog.

Athena and Redshift can query it directly.

6. Why is Data Catalog Important?

Answer

Without catalog:

Every application must understand schema.

With catalog:

Single source of truth.

Benefits:

Schema management
Data discovery
Query optimization
Governance

7. What is a Glue Crawler?

Answer

Crawler automatically:

Scans data
Infers schema
Detects partitions
Creates metadata tables

Example

Crawler scans:

s3://sales/

Files:

2024/
2025/

Creates:

sales_table

Partitions:

year=2024
year=2025

8. How Does a Crawler Work?

Answer

Steps:

Connect source
Sample data
Detect schema
Identify partitions
Update Data Catalog

9. What are Classifiers?

Answer

Classifiers identify data format.

Supported:

JSON
CSV
XML
Avro
Parquet

Custom classifiers can also be created.

10. What is an AWS Glue Job?

Answer

A Glue Job performs ETL transformations.

Written in:

PySpark
Scala
Python Shell

Example:

Read CSV

Transform

Write Parquet

11. What is Glue Studio?

Answer

Visual drag-and-drop ETL designer.

Allows:

Build ETL pipelines
Generate code automatically
Monitor jobs

12. What Languages Does Glue Support?

Answer

Supported:

PySpark
Scala
Python Shell

PySpark is most common.

13. What is DynamicFrame?

Answer

DynamicFrame is AWS Glue’s abstraction over Spark DataFrame.

Provides:

Schema flexibility
Handling inconsistent data

14. DynamicFrame vs DataFrame

Feature	DynamicFrame	DataFrame
AWS Glue Native	Yes	No
Schema Handling	Flexible	Strict
Performance	Slightly Lower	Faster
Error Handling	Better	Limited

Interview Answer

Use DynamicFrame for ingestion and DataFrame for complex transformations.

15. Convert DynamicFrame to DataFrame

df = dynamic_frame.toDF()

16. Convert DataFrame to DynamicFrame

from awsglue.dynamicframe import DynamicFrame

dyf = DynamicFrame.fromDF(df, glueContext, "dyf")

17. What is GlueContext?

Answer

Extension of SparkContext.

Provides Glue-specific functionality.

glueContext = GlueContext(sc)

18. What is Job Bookmark?

Answer

Tracks previously processed data.

Prevents duplicate processing.

Example

Yesterday:

file1.csv
file2.csv

Processed.

Today:

file3.csv

Only file3 processed.

19. Benefits of Job Bookmarks

Answer

Incremental loading
Reduced processing
Lower cost
Faster execution

20. What Happens If Bookmark Disabled?

Answer

Entire dataset gets reprocessed every run.

21. What Worker Types Exist in Glue?

Answer

Common:

Worker	Use
G.1X	Standard
G.2X	More Memory
G.4X	Large Workloads
G.8X	Heavy Processing
G.16X	Enterprise Scale

22. What is DPU?

Answer

Data Processing Unit.

1 DPU =

4 vCPU
16 GB Memory

Glue pricing is based on DPU consumption.

23. How Do You Optimize Glue Costs?

Answer

Job bookmarks
Partitioning
Parquet format
Proper worker sizing
Pushdown predicates
Incremental loads

24. What is Pushdown Predicate?

Answer

Filter data before reading.

Bad:

read all
then filter

Good:

read only required partition

Example:

year=2025

25. What File Formats Are Supported?

Answer

CSV
JSON
XML
Parquet
ORC
Avro
Iceberg
Hudi
Delta Lake

26. Why Convert CSV to Parquet?

Answer

Benefits:

Columnar
Compression
Faster queries
Reduced storage

Often asked in interviews.

27. Explain Glue Workflow

Answer

Workflow orchestrates:

Crawlers
Jobs
Triggers

into one pipeline.

Example:

Crawler
  ↓
ETL Job
  ↓
Validation Job
  ↓
Load Job

28. Types of Triggers

Answer

Scheduled
On-demand
Event-based
Conditional

29. Can Glue Be Triggered by S3 Events?

Answer

Yes.

Using:

S3 Event
→ EventBridge
→ Glue Workflow

30. How Does Glue Integrate with Athena?

Answer

Athena directly uses Glue Catalog metadata.

No schema duplication required.

31. How Does Glue Integrate with Redshift?

Answer

Glue:

Extract data
Transform
Load into Redshift

Using JDBC or COPY command.

32. How Does Glue Integrate with Snowflake?

Answer

Using Snowflake Connector.

Typical flow:

S3 → Glue → Snowflake

33. Explain Glue Streaming Jobs

Answer

Process real-time data.

Sources:

Kafka
MSK
Kinesis

Supports near-real-time ETL.

34. Difference Between Batch and Streaming Glue Jobs

Batch	Streaming
Scheduled	Continuous
Finite Data	Infinite Data
Lower Cost	Higher Cost

35. What is AWS Glue Data Quality?

Answer

Built-in data validation framework.

Checks:

Nulls
Duplicates
Completeness
Consistency

36. Example Data Quality Rule

IsComplete "customer_id"

Ensures no null values.

37. How Do You Secure AWS Glue?

Answer

Use:

IAM
KMS
VPC
Security Groups
Lake Formation
Encryption

38. How Is Data Encrypted?

At Rest

SSE-S3
SSE-KMS

In Transit

SSL/TLS

39. Explain VPC Integration

Answer

Glue jobs can run inside VPC.

Needed when accessing:

RDS
Private Redshift
On-prem systems

40. How Do You Monitor Glue Jobs?

Answer

Using:

CloudWatch Logs
CloudWatch Metrics
Job Runs Dashboard
EventBridge Alerts

41. Common Glue Failures?

Answer

Memory errors
Schema mismatch
Permission denied
JDBC timeout
Network issues

42. How Do You Troubleshoot Memory Issues?

Answer

Increase workers
Repartition data
Use Parquet
Pushdown predicates

43. Glue vs EMR

Glue	EMR
Serverless	Cluster Managed
ETL Focused	Big Data Platform
Simple	Flexible
Less Control	More Control

Interview Answer

Use Glue for ETL.
Use EMR for complex Spark workloads.

44. Glue vs Lambda

Glue	Lambda
Big Data	Small Data
Spark	Python/Node
GB/TB Scale	MB Scale

45. Glue vs DataBrew

Glue	DataBrew
Developer Focused	Business User Focused
Coding	No-Code

46. Real-Time Scenario

Question

How would you process clickstream data?

Answer

Architecture:

Website
   ↓
Kinesis
   ↓
Glue Streaming
   ↓
S3
   ↓
Athena

47. Scenario: Incremental Daily Loads

Answer

Use:

Job Bookmarks
Partitioned S3
Workflow

48. Scenario: 5 TB Daily Data

Answer

Optimization:

Parquet
Partitioning
G.8X Workers
Predicate Pushdown

49. Scenario: Duplicate Records

Answer

Use Spark:

df.dropDuplicates()

or primary-key validation.

50. Senior Architect Question

Design a Modern AWS Data Lake Using Glue

Answer

Architecture:

Sources
   ↓
S3 Landing
   ↓
Glue Crawler
   ↓
Glue Catalog
   ↓
Glue ETL
   ↓
Curated Zone
   ↓
Athena
   ↓
Redshift
   ↓
QuickSight

Services:

S3
Glue
Athena
Redshift
Lake Formation
CloudWatch
IAM
KMS

Top 20 AWS Glue Interview Questions Asked Most Frequently

What is AWS Glue?
Difference between DynamicFrame and DataFrame?
What is Glue Data Catalog?
What are Crawlers?
What are Job Bookmarks?
What is DPU?
How Glue pricing works?
Glue vs EMR?
Glue vs Lambda?
How do you optimize Glue jobs?
What are Pushdown Predicates?
How does Glue integrate with Athena?
How does Glue integrate with Redshift?
Explain Glue Workflows.
Explain Glue Streaming.
How do you handle schema evolution?
How do you secure Glue?
How do you troubleshoot failed jobs?
What worker type would you choose for a 5 TB dataset?
Design an end-to-end AWS Data Lake using Glue.

For a Senior Data Engineer, Cloud Engineer, or Solutions Architect interview in the U.S. market, you should also be prepared for advanced topics such as Apache Iceberg, Hudi, Delta Lake, Glue 5.0, Spark optimization, Lake Formation integration, CDC pipelines, and multi-account data lake architectures.

AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service from AWS that simplifies discovering, preparing, and combining data for analytics, machine learning, and application development. It handles data cataloging, schema inference, job orchestration, and execution using Apache Spark (with Python/PySpark or Scala support).

It is particularly useful for building data lakes, integrating data from various sources (S3, RDS, Redshift, JDBC, etc.), and preparing it for tools like Amazon Athena, Redshift, or SageMaker.

Basic Questions

1. What is AWS Glue? AWS Glue is a serverless data integration service that automates ETL processes. It includes a Data Catalog for metadata, crawlers for schema discovery, and an ETL engine for running Spark-based jobs. It eliminates infrastructure management, auto-scales, and supports pay-as-you-go pricing based on Data Processing Units (DPUs).

2. What are the main components of AWS Glue?

Data Catalog: Centralized metadata repository (databases, tables, schemas) compatible with Hive Metastore.
Crawlers: Scan data sources to infer schemas and populate/update the Data Catalog.
ETL Jobs: Scripts (Python/Scala) for transforming data; run on Apache Spark.
Triggers: Schedule or event-based (e.g., S3 events, job completion) starters for jobs/crawlers.
Development Endpoints: For interactive development/testing of scripts.
Workflows: Orchestrate complex multi-job/crawler pipelines.
Glue Studio: Visual interface for building ETL jobs (no-code/low-code).
Glue DataBrew: For data cleaning/preparation with visual recipes.
Glue Data Quality: Rules-based monitoring and validation.

3. Explain AWS Glue architecture. Data sources → Crawlers populate Data Catalog → Jobs read from Catalog sources, apply transformations (Spark), and write to targets. Triggers/workflows orchestrate execution. Metadata is stored in the Catalog; execution uses serverless Spark clusters (DPUs). Integration with Lake Formation for governance and security.

4. What are AWS Glue Crawlers and how do they work? Crawlers scan data stores (S3, databases, etc.), infer schemas (using classifiers), and create/update tables in the Data Catalog. They support incremental crawls, schema evolution, and S3 event notifications for efficiency. They handle partitioning and can use custom classifiers.

5. What is the AWS Glue Data Catalog? A persistent, managed metadata store (like a Hive metastore) that holds structural/operational metadata about data assets. It enables discovery, querying (e.g., via Athena), and governance across AWS services. Supports versioning, encryption, and Lake Formation permissions.

Intermediate Questions

6. How does AWS Glue handle schema evolution? Crawlers detect changes (new columns, types) and update table metadata. You can configure behavior (e.g., ignore changes, add new columns, or create new versions). Supports partition indexes and grouping policies.

7. What are Triggers in AWS Glue? Triggers start jobs or crawlers. Types:

Scheduled (cron-like).
On-demand.
Conditional (based on job/crawler success/failure). They enable chaining for workflows.

8. Explain Glue Jobs. What languages/scripts are supported? Jobs define ETL logic with a script, sources, and targets. Primarily Python (PySpark) or Scala. Glue generates boilerplate scripts; you can edit in Studio, console, or IDEs. Supports bookmarks for incremental processing.

9. What are DPUs in AWS Glue? Data Processing Units measure compute capacity. 1 DPU = 4 vCPUs + 16 GB RAM. You allocate DPUs to jobs; auto-scaling is available. Pricing is per DPU-hour (billed per second after 10-minute minimum).

10. What is Glue Studio? A visual, drag-and-drop interface to build, edit, and monitor ETL jobs without deep coding. Generates Spark code that can be customized.

11. How do you monitor and debug Glue Jobs? Use CloudWatch metrics/logs, job run details in console (logs, errors, timelines), bookmarks, and Glue Data Quality. Enable continuous logging and job insights.

12. What security features does AWS Glue support?

IAM roles/policies for fine-grained access.
Encryption at rest (KMS) and in transit (SSL).
VPC integration, security groups.
Lake Formation for row/column-level permissions.
Audit logging via CloudTrail.

13. How does AWS Glue integrate with other AWS services?

S3: Primary storage.
Athena/Redshift: Query targets.
Lake Formation: Governance.
SageMaker: ML pipelines.
EventBridge/Lambda: Event-driven triggers.
Kinesis/MSK: Streaming integration.

Advanced / Scenario-Based Questions

14. Explain Glue Workflows. Orchestration tool for complex ETL involving multiple jobs, crawlers, and triggers. Visual graph in console; supports dependencies and start triggers.

15. How do you optimize Glue Job performance?

Use partitioning and predicate pushdown.
Dynamic allocation and auto-scaling.
Appropriate worker types (Standard vs. G.1X/G.2X).
Cache/repartition data; avoid small files.
Use Glue bookmarks and incremental crawls.
Optimize memory (skew handling, broadcast joins).
Push transformations early (filtering).

16. Difference between AWS Glue, EMR, and Data Pipeline.

Glue: Serverless ETL-focused, easy cataloging/crawlers, managed Spark. Best for standard ETL.
EMR: Managed Hadoop/Spark clusters; more flexible/customizable for complex big data/ML, but requires more management. Cheaper for sustained heavy workloads.
Data Pipeline: Workflow orchestration (legacy); less ETL-native than Glue.

17. How would you handle incremental ETL in Glue? Use job bookmarks (tracks processed data), S3 event notifications for crawlers, or custom logic with last-modified timestamps/watermarks in scripts. Combine with partitioning.

18. What are common challenges with Glue and how to address them?

Cost for large/spiky workloads → Optimize DPUs, use spot-like savings, or switch to EMR Serverless.
Schema drift → Configure crawler rules.
Small files → Compaction jobs.
Debugging Spark issues → Use development endpoints/interactive sessions.
Cold starts → Pre-warm or schedule appropriately.

19. Explain Glue Data Quality. Feature to define rules (e.g., completeness, uniqueness) and monitor data in pipelines or lakes. Auto-suggests rules; integrates with workflows for alerts/blocking.

20. How do you handle large-scale data or joins in Glue? Scale DPUs, use broadcast joins for small tables, skew mitigation, repartitioning, and Glue’s optimized connectors. For very large/complex, consider EMR.

🧱 Module 1: Core Concepts & Architecture

1. What is AWS Glue, and why is it considered “serverless”?

Answer:
AWS Glue is a fully managed, serverless ETL (Extract, Transform, Load) service.
It is considered “serverless” because you do not provision or manage clusters (like EC2 or EMR). AWS Glue automatically provisions, scales, and terminates the resources (Spark or Python environments) needed to run your jobs. You only pay for the compute time consumed during execution .

2. What are the primary components of AWS Glue?

Answer:
The architecture is built on several key pillars :

Glue Data Catalog: A central metadata repository (Hive metastore compatible) storing table definitions, schemas, and locations (e.g., S3 paths).
Crawlers: Automated processes that scan data sources (S3, RDS) to infer schemas and populate the Data Catalog.
ETL Jobs: The logic for transforming data. Runs on Apache Spark (for heavy lifting) or Python Shell (for lightweight scripts).
Workflows & Triggers: Orchestration tools to chain multiple crawlers and jobs together based on time or event dependencies.
Glue Studio: A visual interface to design, debug, and monitor ETL pipelines without heavy coding.

3. Can you explain the difference between a Spark DataFrame and a Glue DynamicFrame?

Answer:
This is a critical distinction for technical interviews.

DataFrame (Spark): Lazy-evaluated, highly optimized for performance, but strictly typed. It fails immediately if a schema mismatch occurs (e.g., a missing column).
DynamicFrame (Glue): An extension of DataFrame designed for schema flexibility. It supports “schema on read” and handles schema evolution natively. It is ideal for semi-structured data (JSON) or messy data sources .

Code Snippet (Interview Example):

python

# Using DataFrame (strict)
df = spark.read.csv("s3://path/")
df.printSchema() # Fails if schema changes unexpectedly

# Using DynamicFrame (flexible)
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="my_db", 
    table_name="my_table"
)
# It resolves schema drift automatically

🕷️ Module 2: Crawlers & Data Catalog

4. How does a Crawler handle Schema Evolution?

Answer:
A Crawler scans data stores and updates the Data Catalog. You can configure its behavior for schema updates:

Update the table definition: Adds new columns found in new data files.
Ignore changes: Leaves the catalog as-is (risky for query engines).
Deprecate deleted columns: Marks missing columns as deprecated rather than removing them .

5. How does the Glue Data Catalog integrate with other services?

Answer:
The Data Catalog acts as the “single source of truth” for metadata across AWS .

Amazon Athena: Allows SQL queries directly on S3 data using Glue tables.
Amazon Redshift Spectrum: Enables Redshift to query the Data Lake without loading data.
Amazon EMR: Can use Glue as a Hive Metastore instead of hosting its own MySQL/PostgreSQL database.

⚙️ Module 3: ETL Jobs, Development & Optimization

6. You are processing millions of log files. How do you optimize performance?

Answer:
Optimization in Glue focuses on minimizing I/O and shuffle operations :

Partition Pruning: Use partitioned data (e.g., year=2024/...) so Glue only reads necessary folders.
Use File Formats: Prefer Parquet (columnar) over CSV/JSON to reduce scan time.
Increase DPUs (Data Processing Units): 1 DPU = 4 vCPU + 16GB RAM. For large datasets, use G.2X or G.4X workers for memory-intensive aggregations .
Job Bookmarks: Enable incremental processing to avoid reprocessing old data.
Broadcast Hash Joins: If joining a large table with a small lookup table, use .broadcast() to avoid shuffling the large table across the network.

7. What is a Glue Job Bookmark? How does it handle incremental loads?

Answer:
A Job Bookmark keeps track of previously processed data.

Function: It stores the last processed timestamp or file name in a persistent state store.
Incremental Load: When the job runs again, Glue checks the bookmark, reads only the new data (e.g., files added since last run), and skips the old data.
Use Case: Essential for processing log files in S3 or reading transaction logs from JDBC where you only want new records .

8. How do you orchestrate a complex pipeline (e.g., Crawl -> Transform -> Load -> Archive)?

Answer:
Using Glue Workflows and Triggers :

Trigger 1: On schedule (e.g., 2 AM).
Node A: Crawler_Logs (populates Catalog).
Trigger 2: Event-based (starts on Crawler success).
Node B: ETL_Job_Process (transforms data).
Trigger 3: Event-based.
Node C: ETL_Job_Archive (moves raw files).
This eliminates the need for a separate orchestrator like Step Functions for simple linear dependencies.

🔒 Module 4: Security, Governance & Advanced Integration

9. How do you connect Glue to a private RDS/Aurora database?

Answer:
Connecting to private databases involves networking configuration :

VPC Configuration: Glue runs inside a VPC. You must place the Glue job in the same VPC, Subnets, and Security Groups as the RDS instance.
Glue Connection: Create a Connection object of type JDBC with the RDS endpoint, port, and database name.
Secrets Manager (Best Practice): Do not hardcode passwords. Store credentials in AWS Secrets Manager and grant the Glue IAM role permission to GetSecretValue.
IAM: The role needs ec2:DescribeSecurityGroups and rds:DescribeDBInstances .

10. Explain the difference between AWS Glue and Amazon EMR.

Answer:

Feature	AWS Glue	Amazon EMR
Management	Serverless (AWS manages clusters)	You manage clusters (EC2 instances)
Cost	Pay per DPU/second	Pay per EC2 hour (plus spot pricing options)
Start Time	Seconds (cold starts exist)	Minutes (clusters take time to spin up)
Use Case	Ad-hoc ETL, Data Catalog, small/medium workloads	Large-scale big data processing, ML training, long-running clusters
Control	Limited (managed Spark)	Full control over Hadoop/Spark configurations

Interview Tip: Use Glue for serverless, event-driven jobs. Use EMR for massive, persistent clusters or specific framework versions .

💻 Module 5: Coding & Scenario Questions

11. Scenario: Handle a Glue job that suddenly fails due to “Out of Memory” (OOM).

Answer:
Troubleshooting Steps:

Check CloudWatch Logs: Look for java.lang.OutOfMemoryError or specific stage failures.
Diagnose: Usually caused by improper partitioning (e.g., a single partition holding 100GB of data).
Fix:
- Code Fix: Use repartition(numPartitions) or coalesce(numPartitions) before writing to distribute the data evenly.
- Configuration Fix: Change worker type to G.2X (more memory per DPU) or increase the --number-of-workers .

12. Write a Python/PySpark script to handle incremental loading from S3 to Redshift.

Answer:

python

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Get job arguments (including bookmark)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# 1. Read from Catalog using Job Bookmark (automatic incremental)
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db",
    table_name="s3_logs",
    transformation_ctx="datasource0"
)

# 2. Apply mappings or transformations (e.g., casting types)
apply_mapping = ApplyMapping.apply(frame=dynamic_frame, mappings=[...])

# 3. Write to Redshift (Overwrite or Append based on logic)
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=apply_mapping,
    catalog_connection="redshift_connection",
    connection_options={
        "dbtable": "public.target_table",
        "database": "dev"
    },
    redshift_tmp_dir="s3://temp-bucket/redshift_staging"
)

job.commit()

13. How do you implement real-time data pipelines with Glue?

Answer:
Using AWS Glue Streaming (Glue version 3.0+).

Source: Amazon Kinesis Data Streams or MSK (Managed Kafka).
Processing: Glue runs a continuous Spark Streaming job (serverless).
Sink: S3 (Delta Lake format) or JDBC.
Key Feature: It handles windowed aggregations (e.g., “Count clicks every 5 minutes”) without managing EC2 instances .

📈 Bonus: “What are your best practices for production?”

Interviewers often end with this question to gauge real-world experience.

Data Partitioning: Always write data to S3 using PARTITIONED BY (year, month, day) to enable query pruning .
Idempotency: Design jobs so that running them twice produces the same result (e.g., using overwrite mode carefully or using S3 versioning).
Monitoring: Set up CloudWatch Alarms for glue.driver.aggregate.elapsedTime and job failures. Enable Continuous Logging for real-time log streaming to CloudWatch .
Cost Control: Use Python Shell Jobs (1/16 DPU) for light data validation or API calls. Only use Spark Jobs (min 2 DPUs) for heavy transformations .