1. Machine Learning Basics
Core concepts that underpin all other topics.
- Types of Learning:
- Supervised: Learn mapping from inputs to labeled outputs (e.g., classification, regression).
- Unsupervised: Find hidden structure in unlabeled data (e.g., clustering, dimensionality reduction).
- Reinforcement Learning: Learn actions via rewards/punishments.
- Key Ideas:
- Training / validation / test sets.
- Bias-variance tradeoff (underfitting ↔ overfitting).
- Loss functions (MSE, cross-entropy).
- Optimization: Gradient Descent (SGD, Adam).
- Regularization: L1/L2, dropout, early stopping.
2. Deep Learning & Neural Networks
Scaling up ML with multi-layer, differentiable functions.
- Basic Unit: Neuron → weighted sum + non-linear activation (ReLU, sigmoid, tanh).
- Key Architectures:
- FNN (Feedforward): Fully connected layers.
- CNN: Spatial hierarchies via convolution + pooling (image, video).
- RNN/LSTM: Sequential data (text, time series) — now largely replaced by Transformers.
- Training Deep Nets:
- Backpropagation + automatic differentiation.
- Vanishing/exploding gradients → skip connections, batch norm, residual nets (ResNet).
- Practical tricks:
- Weight initialization (Xavier/He).
- Learning rate scheduling.
- Data augmentation.
3. Transformers & LLMs
Current dominant paradigm for sequence modeling, powering large language models.
- Self-Attention (core innovation):
- Query, Key, Value matrices.
- Scaled dot-product attention: Attention(Q,K,V)=softmax(dkQKT)V.
- Allows each token to directly attend to all others → solves long-range dependency limits of RNNs.
- Transformer Block:
- Multi-head attention (parallel attention heads).
- Feedforward network (per token).
- Residual connections + layer norm.
- Positional encodings (sinusoidal or learned).
- LLM Evolution:
- Encoder-only (BERT) → understanding tasks.
- Decoder-only (GPT) → generation, few-shot learning.
- Encoder-decoder (T5, BART) → translation, summarization.
- Scaling Laws: Performance improves predictably with data, parameters, compute.
- Modern LLM concepts:
- Pretraining (next-token prediction on web-scale text) + fine-tuning (SFT, RLHF).
- In-context learning, Chain-of-Thought.
- Efficient variants: LLaMA, Mistral; MoE (Mixture of Experts) like Mixtral, GPT-4.
4. Generative Models Deep Dive
Models that learn the data distribution p(x) to generate new samples.
Taxonomy & Key Methods:
| Type | Idea | Example | Pros / Cons |
|---|---|---|---|
| VAEs (Variational Autoencoders) | Encode to latent z, decode; maximize ELBO | Stable diffusion’s latent space | Good latent representation, blurry outputs |
| GANs | Generator vs. discriminator min-max game | StyleGAN, BigGAN | Sharp outputs, but unstable training, mode collapse |
| Autoregressive models | Predict next token/pixel sequentially | PixelCNN, GPT | Exact likelihood, but slow sampling |
| Normalizing Flows | Invertible transforms for exact density | Glow, RealNVP | Exact log-likelihood, invertible, but constrained architecture |
| Diffusion models | Gradually add noise, then reverse | DDPM, Stable Diffusion | State-of-the-art quality, stable training, slower sampling |
Current State of the Art:
- Text-to-image: Diffusion models (Stable Diffusion, DALL-E 3, Midjourney, Flux).
- Text-to-video: Sora (diffusion transformer), Runway Gen-2.
- Text generation: Autoregressive LLMs (GPT-4, Claude, Gemini).
- Multimodal generation: Combined diffusion + LLM (e.g., any-to-any models).
Key Insight (2025+): Diffusion + Transformer backbones → dominant for high-quality continuous generation (image, video, audio). LLMs + discrete tokens for text/code. Hybrid models emerging.
Summary Comparison Table
| Topic | Key Task | Main Tool | Output Example |
|---|---|---|---|
| ML Basics | Predict from features | Linear/logistic regression, trees | House price |
| Deep Learning | Hierarchical features | CNN, ResNet, LSTM | Classify image |
| Transformers | Sequence modeling | Attention, GPT, BERT | Write essay |
| Generative Models | Create new data | Diffusion, GAN, VAE | Draw a cat |
2. Machine Learning Basics
Machine Learning (ML) is a subset of AI where systems learn patterns from data instead of being explicitly programmed.
Core Types of Machine Learning
| Type | Description | Supervision | Examples |
|---|---|---|---|
| Supervised Learning | Learns from labeled data (input → output pairs) | Full | Spam detection, image classification, house price prediction |
| Unsupervised Learning | Finds hidden patterns in unlabeled data | None | Customer segmentation, anomaly detection, dimensionality reduction |
| Semi-Supervised | Uses mostly unlabeled + some labeled data | Partial | Large-scale image labeling |
| Reinforcement Learning | Learns via trial & error with rewards/penalties | Reward-based | Game playing (AlphaGo), robotics, recommendation |
Key ML Algorithms (Classical)
- Regression: Linear Regression, Polynomial Regression
- Classification: Logistic Regression, Decision Trees, Random Forest, SVM, Naive Bayes, k-NN
- Clustering: K-Means, Hierarchical, DBSCAN
- Dimensionality Reduction: PCA, t-SNE, UMAP
- Ensemble Methods: Bagging, Boosting (XGBoost, LightGBM, CatBoost)
Fundamental Concepts
- Bias-Variance Tradeoff: High bias = underfitting, High variance = overfitting.
- Overfitting vs Underfitting:
- Overfitting: Model memorizes training data, poor on new data.
- Underfitting: Model too simple to capture patterns.
- Evaluation Metrics:
- Regression: MSE, RMSE, MAE, R²
- Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
- Cross-Validation: k-fold CV to get reliable performance estimate.
- Feature Engineering: Creating better input features (still very important).
Training Process:
- Split data → Train / Validation / Test sets
- Choose model + hyperparameters
- Train on training set
- Tune on validation set
- Evaluate on unseen test set
3. Deep Learning & Neural Networks
Deep Learning is Machine Learning using artificial neural networks with many layers (hence “deep”).
Biological Inspiration
- Biological neuron → Artificial Neuron (Perceptron)
Core Components of a Neural Network
- Neurons (Nodes)
- Layers:
- Input Layer
- Hidden Layers (this is where depth comes from)
- Output Layer
- Weights & Biases (learnable parameters)
- Activation Functions (introduce non-linearity):Common ones:
- ReLU (Rectified Linear Unit): f(x) = max(0, x) — most popular
- Sigmoid: 1 / (1 + e^(-x))
- Tanh
- Leaky ReLU, GELU, Swish (modern)
Forward Propagation
Input → Weighted sum → Activation → Next layer → Final output
Backpropagation (The Learning Algorithm)
- Calculate error (Loss function)
- Compute gradients using Chain Rule
- Update weights using Gradient Descent (or variants: SGD, Adam, RMSprop)
Loss Functions:
- Mean Squared Error (regression)
- Cross-Entropy (classification)
- Binary Cross-Entropy
Popular Architectures
- Feedforward Neural Networks (MLP) — Basic
- Convolutional Neural Networks (CNNs) — Best for images & spatial data
- Recurrent Neural Networks (RNNs) — For sequences (old)
- LSTMs / GRUs — Improved RNNs (better at long dependencies)
- Transformers — Current dominant architecture (next section)
Why Deep Learning Works Well:
- Automatic feature learning (no manual feature engineering needed)
- Hierarchical representations (edges → shapes → objects)
Challenges:
- Requires massive data
- Computationally expensive (needs GPUs/TPUs)
- Black-box nature (hard to interpret)
4. Transformers & Large Language Models (LLMs)
The Transformer (introduced in 2017 paper “Attention Is All You Need”) is the most important architecture in modern AI.
Key Innovation: Self-Attention Mechanism
Instead of processing sequentially (like RNNs), Transformers process entire sequences in parallel using attention.
Formula (Simplified):
Attention(Q,K,V)=softmax(dkQKT)V
- Q = Query, K = Key, V = Value
- Scaled dot-product attention
Transformer Architecture
- Encoder (for understanding) — Used in BERT
- Decoder (for generation) — Used in GPT models
- Encoder-Decoder — Used in translation (T5, BART)
Components:
- Multi-Head Self-Attention
- Feed-Forward Networks
- Layer Normalization + Residual Connections
- Positional Encoding (since attention has no sense of order)
Large Language Models (LLMs)
Modern LLMs are Decoder-only Transformers trained on massive text data.
Training Objective: Next Token Prediction (Causal Language Modeling)
Given “The sky is”, predict next word “blue”.
Key Scaling:
- More parameters → Better performance (Emergent abilities appear)
- More data
- More compute
Major LLM Families (as of 2026):
| Family | Examples | Strengths |
|---|---|---|
| OpenAI | GPT-4o, o1, o3 | Reasoning, multimodality |
| Anthropic | Claude 3.5/4 | Safety, long context |
| Meta | Llama 3.1/4 | Open weights |
| xAI | Grok 2 / Grok 3 | Real-time knowledge, truth-seeking |
| Gemini 2 | Multimodal | |
| Mistral | Mistral Large, Mixtral | Efficient |
Techniques Used in LLMs:
- Pre-training (massive unsupervised data)
- Instruction Tuning / Supervised Fine-Tuning (SFT)
- RLHF (Reinforcement Learning from Human Feedback)
- Chain-of-Thought (CoT) prompting
- Test-time Compute (o1-style reasoning)
5. Generative Models Deep Dive
Generative Models learn the underlying probability distribution of data to create new samples.
Major Types
- Generative Adversarial Networks (GANs)
- Generator vs Discriminator game
- Excellent image quality but mode collapse issues
- Variants: StyleGAN, CycleGAN, BigGAN
- Variational Autoencoders (VAEs)
- Encoder compresses data into latent distribution
- Decoder generates from latent space
- Good for smooth interpolation
- Flow-based Models (Normalizing Flows)
- Reversible transformations
- Exact likelihood computation
- Diffusion Models (Current King for Images)
- Forward process: Gradually add Gaussian noise
- Reverse process: Learn to denoise step-by-step
- Models: Stable Diffusion, DALL·E 3, Midjourney v6, Flux, Imagen 3
- Autoregressive Models
- GPT-style: Generate one token at a time
- Excellent for text, also used in image (PixelRNN, Parti)
- Multimodal Generative Models
- Generate across modalities (text → image, image → video, audio, etc.)
Comparison of Generative Models
| Model Type | Image Quality | Training Stability | Speed | Best For |
|---|---|---|---|---|
| GANs | Excellent | Poor | Fast | High-fidelity images |
| VAEs | Good | Good | Fast | Latent space control |
| Diffusion | Best | Good | Slow | Creative generation |
| Transformers | Very Good | Good | Medium | Text & multimodal |
Current Trends (2026):
- Consistency Models / Flow Matching — Faster diffusion
- Mixture of Experts (MoE) — Efficient scaling
- World Models & Video generation (Sora-like models)
- Agentic Generation — Models that plan before generating

