Machine Learning Basics, Deep Learning & Neural Networks Transformers & LLMs Generative Models Deep Dive

1. Machine Learning Basics

Core concepts that underpin all other topics.

Types of Learning:
- Supervised: Learn mapping from inputs to labeled outputs (e.g., classification, regression).
- Unsupervised: Find hidden structure in unlabeled data (e.g., clustering, dimensionality reduction).
- Reinforcement Learning: Learn actions via rewards/punishments.
Key Ideas:
- Training / validation / test sets.
- Bias-variance tradeoff (underfitting ↔ overfitting).
- Loss functions (MSE, cross-entropy).
- Optimization: Gradient Descent (SGD, Adam).
- Regularization: L1/L2, dropout, early stopping.

2. Deep Learning & Neural Networks

Scaling up ML with multi-layer, differentiable functions.

Basic Unit: Neuron → weighted sum + non-linear activation (ReLU, sigmoid, tanh).
Key Architectures:
- FNN (Feedforward): Fully connected layers.
- CNN: Spatial hierarchies via convolution + pooling (image, video).
- RNN/LSTM: Sequential data (text, time series) — now largely replaced by Transformers.
Training Deep Nets:
- Backpropagation + automatic differentiation.
- Vanishing/exploding gradients → skip connections, batch norm, residual nets (ResNet).
Practical tricks:
- Weight initialization (Xavier/He).
- Learning rate scheduling.
- Data augmentation.

3. Transformers & LLMs

Current dominant paradigm for sequence modeling, powering large language models.

Self-Attention (core innovation):
- Query, Key, Value matrices.
- Scaled dot-product attention: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$ Attention(Q,K,V)=softmax(dkQKT)V.
- Allows each token to directly attend to all others → solves long-range dependency limits of RNNs.
Transformer Block:
- Multi-head attention (parallel attention heads).
- Feedforward network (per token).
- Residual connections + layer norm.
- Positional encodings (sinusoidal or learned).
LLM Evolution:
- Encoder-only (BERT) → understanding tasks.
- Decoder-only (GPT) → generation, few-shot learning.
- Encoder-decoder (T5, BART) → translation, summarization.
Scaling Laws: Performance improves predictably with data, parameters, compute.
Modern LLM concepts:
- Pretraining (next-token prediction on web-scale text) + fine-tuning (SFT, RLHF).
- In-context learning, Chain-of-Thought.
- Efficient variants: LLaMA, Mistral; MoE (Mixture of Experts) like Mixtral, GPT-4.

4. Generative Models Deep Dive

Models that learn the data distribution $p (x)$ p(x) to generate new samples.

Taxonomy & Key Methods:

Type	Idea	Example	Pros / Cons
VAEs (Variational Autoencoders)	Encode to latent $z$ z, decode; maximize ELBO	Stable diffusion’s latent space	Good latent representation, blurry outputs
GANs	Generator vs. discriminator min-max game	StyleGAN, BigGAN	Sharp outputs, but unstable training, mode collapse
Autoregressive models	Predict next token/pixel sequentially	PixelCNN, GPT	Exact likelihood, but slow sampling
Normalizing Flows	Invertible transforms for exact density	Glow, RealNVP	Exact log-likelihood, invertible, but constrained architecture
Diffusion models	Gradually add noise, then reverse	DDPM, Stable Diffusion	State-of-the-art quality, stable training, slower sampling

Current State of the Art:

Text-to-image: Diffusion models (Stable Diffusion, DALL-E 3, Midjourney, Flux).
Text-to-video: Sora (diffusion transformer), Runway Gen-2.
Text generation: Autoregressive LLMs (GPT-4, Claude, Gemini).
Multimodal generation: Combined diffusion + LLM (e.g., any-to-any models).

Key Insight (2025+): Diffusion + Transformer backbones → dominant for high-quality continuous generation (image, video, audio). LLMs + discrete tokens for text/code. Hybrid models emerging.

Summary Comparison Table

Topic	Key Task	Main Tool	Output Example
ML Basics	Predict from features	Linear/logistic regression, trees	House price
Deep Learning	Hierarchical features	CNN, ResNet, LSTM	Classify image
Transformers	Sequence modeling	Attention, GPT, BERT	Write essay
Generative Models	Create new data	Diffusion, GAN, VAE	Draw a cat

2. Machine Learning Basics

Machine Learning (ML) is a subset of AI where systems learn patterns from data instead of being explicitly programmed.

Core Types of Machine Learning

Type	Description	Supervision	Examples
Supervised Learning	Learns from labeled data (input → output pairs)	Full	Spam detection, image classification, house price prediction
Unsupervised Learning	Finds hidden patterns in unlabeled data	None	Customer segmentation, anomaly detection, dimensionality reduction
Semi-Supervised	Uses mostly unlabeled + some labeled data	Partial	Large-scale image labeling
Reinforcement Learning	Learns via trial & error with rewards/penalties	Reward-based	Game playing (AlphaGo), robotics, recommendation

Key ML Algorithms (Classical)

Regression: Linear Regression, Polynomial Regression
Classification: Logistic Regression, Decision Trees, Random Forest, SVM, Naive Bayes, k-NN
Clustering: K-Means, Hierarchical, DBSCAN
Dimensionality Reduction: PCA, t-SNE, UMAP
Ensemble Methods: Bagging, Boosting (XGBoost, LightGBM, CatBoost)

Fundamental Concepts

Bias-Variance Tradeoff: High bias = underfitting, High variance = overfitting.
Overfitting vs Underfitting:
- Overfitting: Model memorizes training data, poor on new data.
- Underfitting: Model too simple to capture patterns.
Evaluation Metrics:
- Regression: MSE, RMSE, MAE, R²
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
Cross-Validation: k-fold CV to get reliable performance estimate.
Feature Engineering: Creating better input features (still very important).

Training Process:

Split data → Train / Validation / Test sets
Choose model + hyperparameters
Train on training set
Tune on validation set
Evaluate on unseen test set

3. Deep Learning & Neural Networks

Deep Learning is Machine Learning using artificial neural networks with many layers (hence “deep”).

Biological Inspiration

Biological neuron → Artificial Neuron (Perceptron)

Core Components of a Neural Network

Neurons (Nodes)
Layers:
- Input Layer
- Hidden Layers (this is where depth comes from)
- Output Layer
Weights & Biases (learnable parameters)
Activation Functions (introduce non-linearity):Common ones:
- ReLU (Rectified Linear Unit): f(x) = max(0, x) — most popular
- Sigmoid: 1 / (1 + e^(-x))
- Tanh
- Leaky ReLU, GELU, Swish (modern)

Forward Propagation

Input → Weighted sum → Activation → Next layer → Final output

Backpropagation (The Learning Algorithm)

Calculate error (Loss function)
Compute gradients using Chain Rule
Update weights using Gradient Descent (or variants: SGD, Adam, RMSprop)

Loss Functions:

Mean Squared Error (regression)
Cross-Entropy (classification)
Binary Cross-Entropy

Popular Architectures

Feedforward Neural Networks (MLP) — Basic
Convolutional Neural Networks (CNNs) — Best for images & spatial data
Recurrent Neural Networks (RNNs) — For sequences (old)
LSTMs / GRUs — Improved RNNs (better at long dependencies)
Transformers — Current dominant architecture (next section)

Why Deep Learning Works Well:

Automatic feature learning (no manual feature engineering needed)
Hierarchical representations (edges → shapes → objects)

Challenges:

Requires massive data
Computationally expensive (needs GPUs/TPUs)
Black-box nature (hard to interpret)

4. Transformers & Large Language Models (LLMs)

The Transformer (introduced in 2017 paper “Attention Is All You Need”) is the most important architecture in modern AI.

Key Innovation: Self-Attention Mechanism

Instead of processing sequentially (like RNNs), Transformers process entire sequences in parallel using attention.

Formula (Simplified):

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ Attention(Q,K,V)=softmax(dkQKT)V

Q = Query, K = Key, V = Value
Scaled dot-product attention

Transformer Architecture

Encoder (for understanding) — Used in BERT
Decoder (for generation) — Used in GPT models
Encoder-Decoder — Used in translation (T5, BART)

Components:

Multi-Head Self-Attention
Feed-Forward Networks
Layer Normalization + Residual Connections
Positional Encoding (since attention has no sense of order)

Large Language Models (LLMs)

Modern LLMs are Decoder-only Transformers trained on massive text data.

Training Objective: Next Token Prediction (Causal Language Modeling)

Given “The sky is”, predict next word “blue”.

Key Scaling:

More parameters → Better performance (Emergent abilities appear)
More data
More compute

Major LLM Families (as of 2026):

Family	Examples	Strengths
OpenAI	GPT-4o, o1, o3	Reasoning, multimodality
Anthropic	Claude 3.5/4	Safety, long context
Meta	Llama 3.1/4	Open weights
xAI	Grok 2 / Grok 3	Real-time knowledge, truth-seeking
Google	Gemini 2	Multimodal
Mistral	Mistral Large, Mixtral	Efficient

Techniques Used in LLMs:

Pre-training (massive unsupervised data)
Instruction Tuning / Supervised Fine-Tuning (SFT)
RLHF (Reinforcement Learning from Human Feedback)
Chain-of-Thought (CoT) prompting
Test-time Compute (o1-style reasoning)

5. Generative Models Deep Dive

Generative Models learn the underlying probability distribution of data to create new samples.

Major Types

Generative Adversarial Networks (GANs)
- Generator vs Discriminator game
- Excellent image quality but mode collapse issues
- Variants: StyleGAN, CycleGAN, BigGAN
Variational Autoencoders (VAEs)
- Encoder compresses data into latent distribution
- Decoder generates from latent space
- Good for smooth interpolation
Flow-based Models (Normalizing Flows)
- Reversible transformations
- Exact likelihood computation
Diffusion Models (Current King for Images)
- Forward process: Gradually add Gaussian noise
- Reverse process: Learn to denoise step-by-step
- Models: Stable Diffusion, DALL·E 3, Midjourney v6, Flux, Imagen 3
Autoregressive Models
- GPT-style: Generate one token at a time
- Excellent for text, also used in image (PixelRNN, Parti)
Multimodal Generative Models
- Generate across modalities (text → image, image → video, audio, etc.)

Comparison of Generative Models

Model Type	Image Quality	Training Stability	Speed	Best For
GANs	Excellent	Poor	Fast	High-fidelity images
VAEs	Good	Good	Fast	Latent space control
Diffusion	Best	Good	Slow	Creative generation
Transformers	Very Good	Good	Medium	Text & multimodal

Current Trends (2026):

Consistency Models / Flow Matching — Faster diffusion
Mixture of Experts (MoE) — Efficient scaling
World Models & Video generation (Sora-like models)
Agentic Generation — Models that plan before generating