Machine Learning Basics, Deep Learning & Neural Networks Transformers & LLMs Generative Models Deep Dive

1. Machine Learning Basics

Core concepts that underpin all other topics.

  • Types of Learning:
    • Supervised: Learn mapping from inputs to labeled outputs (e.g., classification, regression).
    • Unsupervised: Find hidden structure in unlabeled data (e.g., clustering, dimensionality reduction).
    • Reinforcement Learning: Learn actions via rewards/punishments.
  • Key Ideas:
    • Training / validation / test sets.
    • Bias-variance tradeoff (underfitting ↔ overfitting).
    • Loss functions (MSE, cross-entropy).
    • Optimization: Gradient Descent (SGD, Adam).
    • Regularization: L1/L2, dropout, early stopping.

2. Deep Learning & Neural Networks

Scaling up ML with multi-layer, differentiable functions.

  • Basic Unit: Neuron → weighted sum + non-linear activation (ReLU, sigmoid, tanh).
  • Key Architectures:
    • FNN (Feedforward): Fully connected layers.
    • CNN: Spatial hierarchies via convolution + pooling (image, video).
    • RNN/LSTM: Sequential data (text, time series) — now largely replaced by Transformers.
  • Training Deep Nets:
    • Backpropagation + automatic differentiation.
    • Vanishing/exploding gradients → skip connections, batch norm, residual nets (ResNet).
  • Practical tricks:
    • Weight initialization (Xavier/He).
    • Learning rate scheduling.
    • Data augmentation.

3. Transformers & LLMs

Current dominant paradigm for sequence modeling, powering large language models.

  • Self-Attention (core innovation):
    • Query, Key, Value matrices.
    • Scaled dot-product attention: Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(dk​​QKT​)V.
    • Allows each token to directly attend to all others → solves long-range dependency limits of RNNs.
  • Transformer Block:
    • Multi-head attention (parallel attention heads).
    • Feedforward network (per token).
    • Residual connections + layer norm.
    • Positional encodings (sinusoidal or learned).
  • LLM Evolution:
    • Encoder-only (BERT) → understanding tasks.
    • Decoder-only (GPT) → generation, few-shot learning.
    • Encoder-decoder (T5, BART) → translation, summarization.
  • Scaling Laws: Performance improves predictably with data, parameters, compute.
  • Modern LLM concepts:
    • Pretraining (next-token prediction on web-scale text) + fine-tuning (SFT, RLHF).
    • In-context learning, Chain-of-Thought.
    • Efficient variants: LLaMA, Mistral; MoE (Mixture of Experts) like Mixtral, GPT-4.

4. Generative Models Deep Dive

Models that learn the data distribution p(x)p(x) to generate new samples.

Taxonomy & Key Methods:

TypeIdeaExamplePros / Cons
VAEs (Variational Autoencoders)Encode to latent zz, decode; maximize ELBOStable diffusion’s latent spaceGood latent representation, blurry outputs
GANsGenerator vs. discriminator min-max gameStyleGAN, BigGANSharp outputs, but unstable training, mode collapse
Autoregressive modelsPredict next token/pixel sequentiallyPixelCNN, GPTExact likelihood, but slow sampling
Normalizing FlowsInvertible transforms for exact densityGlow, RealNVPExact log-likelihood, invertible, but constrained architecture
Diffusion modelsGradually add noise, then reverseDDPM, Stable DiffusionState-of-the-art quality, stable training, slower sampling

Current State of the Art:

  • Text-to-image: Diffusion models (Stable Diffusion, DALL-E 3, Midjourney, Flux).
  • Text-to-video: Sora (diffusion transformer), Runway Gen-2.
  • Text generation: Autoregressive LLMs (GPT-4, Claude, Gemini).
  • Multimodal generation: Combined diffusion + LLM (e.g., any-to-any models).

Key Insight (2025+): Diffusion + Transformer backbones → dominant for high-quality continuous generation (image, video, audio). LLMs + discrete tokens for text/code. Hybrid models emerging.


Summary Comparison Table

TopicKey TaskMain ToolOutput Example
ML BasicsPredict from featuresLinear/logistic regression, treesHouse price
Deep LearningHierarchical featuresCNN, ResNet, LSTMClassify image
TransformersSequence modelingAttention, GPT, BERTWrite essay
Generative ModelsCreate new dataDiffusion, GAN, VAEDraw a cat

2. Machine Learning Basics

Machine Learning (ML) is a subset of AI where systems learn patterns from data instead of being explicitly programmed.

Core Types of Machine Learning

TypeDescriptionSupervisionExamples
Supervised LearningLearns from labeled data (input → output pairs)FullSpam detection, image classification, house price prediction
Unsupervised LearningFinds hidden patterns in unlabeled dataNoneCustomer segmentation, anomaly detection, dimensionality reduction
Semi-SupervisedUses mostly unlabeled + some labeled dataPartialLarge-scale image labeling
Reinforcement LearningLearns via trial & error with rewards/penaltiesReward-basedGame playing (AlphaGo), robotics, recommendation

Key ML Algorithms (Classical)

  • Regression: Linear Regression, Polynomial Regression
  • Classification: Logistic Regression, Decision Trees, Random Forest, SVM, Naive Bayes, k-NN
  • Clustering: K-Means, Hierarchical, DBSCAN
  • Dimensionality Reduction: PCA, t-SNE, UMAP
  • Ensemble Methods: Bagging, Boosting (XGBoost, LightGBM, CatBoost)

Fundamental Concepts

  • Bias-Variance Tradeoff: High bias = underfitting, High variance = overfitting.
  • Overfitting vs Underfitting:
    • Overfitting: Model memorizes training data, poor on new data.
    • Underfitting: Model too simple to capture patterns.
  • Evaluation Metrics:
    • Regression: MSE, RMSE, MAE, R²
    • Classification: Accuracy, Precision, Recall, F1-Score, ROC-AUC, Confusion Matrix
  • Cross-Validation: k-fold CV to get reliable performance estimate.
  • Feature Engineering: Creating better input features (still very important).

Training Process:

  1. Split data → Train / Validation / Test sets
  2. Choose model + hyperparameters
  3. Train on training set
  4. Tune on validation set
  5. Evaluate on unseen test set

3. Deep Learning & Neural Networks

Deep Learning is Machine Learning using artificial neural networks with many layers (hence “deep”).

Biological Inspiration

  • Biological neuron → Artificial Neuron (Perceptron)

Core Components of a Neural Network

  1. Neurons (Nodes)
  2. Layers:
    • Input Layer
    • Hidden Layers (this is where depth comes from)
    • Output Layer
  3. Weights & Biases (learnable parameters)
  4. Activation Functions (introduce non-linearity):Common ones:
    • ReLU (Rectified Linear Unit): f(x) = max(0, x) — most popular
    • Sigmoid: 1 / (1 + e^(-x))
    • Tanh
    • Leaky ReLU, GELU, Swish (modern)

Forward Propagation

Input → Weighted sum → Activation → Next layer → Final output

Backpropagation (The Learning Algorithm)

  • Calculate error (Loss function)
  • Compute gradients using Chain Rule
  • Update weights using Gradient Descent (or variants: SGD, Adam, RMSprop)

Loss Functions:

  • Mean Squared Error (regression)
  • Cross-Entropy (classification)
  • Binary Cross-Entropy

Popular Architectures

  • Feedforward Neural Networks (MLP) — Basic
  • Convolutional Neural Networks (CNNs) — Best for images & spatial data
  • Recurrent Neural Networks (RNNs) — For sequences (old)
  • LSTMs / GRUs — Improved RNNs (better at long dependencies)
  • Transformers — Current dominant architecture (next section)

Why Deep Learning Works Well:

  • Automatic feature learning (no manual feature engineering needed)
  • Hierarchical representations (edges → shapes → objects)

Challenges:

  • Requires massive data
  • Computationally expensive (needs GPUs/TPUs)
  • Black-box nature (hard to interpret)

4. Transformers & Large Language Models (LLMs)

The Transformer (introduced in 2017 paper “Attention Is All You Need”) is the most important architecture in modern AI.

Key Innovation: Self-Attention Mechanism

Instead of processing sequentially (like RNNs), Transformers process entire sequences in parallel using attention.

Formula (Simplified):

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dk​​QKT​)V

  • Q = Query, K = Key, V = Value
  • Scaled dot-product attention

Transformer Architecture

  • Encoder (for understanding) — Used in BERT
  • Decoder (for generation) — Used in GPT models
  • Encoder-Decoder — Used in translation (T5, BART)

Components:

  • Multi-Head Self-Attention
  • Feed-Forward Networks
  • Layer Normalization + Residual Connections
  • Positional Encoding (since attention has no sense of order)

Large Language Models (LLMs)

Modern LLMs are Decoder-only Transformers trained on massive text data.

Training Objective: Next Token Prediction (Causal Language Modeling)

Given “The sky is”, predict next word “blue”.

Key Scaling:

  • More parameters → Better performance (Emergent abilities appear)
  • More data
  • More compute

Major LLM Families (as of 2026):

FamilyExamplesStrengths
OpenAIGPT-4o, o1, o3Reasoning, multimodality
AnthropicClaude 3.5/4Safety, long context
MetaLlama 3.1/4Open weights
xAIGrok 2 / Grok 3Real-time knowledge, truth-seeking
GoogleGemini 2Multimodal
MistralMistral Large, MixtralEfficient

Techniques Used in LLMs:

  • Pre-training (massive unsupervised data)
  • Instruction Tuning / Supervised Fine-Tuning (SFT)
  • RLHF (Reinforcement Learning from Human Feedback)
  • Chain-of-Thought (CoT) prompting
  • Test-time Compute (o1-style reasoning)

5. Generative Models Deep Dive

Generative Models learn the underlying probability distribution of data to create new samples.

Major Types

  1. Generative Adversarial Networks (GANs)
    • Generator vs Discriminator game
    • Excellent image quality but mode collapse issues
    • Variants: StyleGAN, CycleGAN, BigGAN
  2. Variational Autoencoders (VAEs)
    • Encoder compresses data into latent distribution
    • Decoder generates from latent space
    • Good for smooth interpolation
  3. Flow-based Models (Normalizing Flows)
    • Reversible transformations
    • Exact likelihood computation
  4. Diffusion Models (Current King for Images)
    • Forward process: Gradually add Gaussian noise
    • Reverse process: Learn to denoise step-by-step
    • Models: Stable Diffusion, DALL·E 3, Midjourney v6, Flux, Imagen 3
  5. Autoregressive Models
    • GPT-style: Generate one token at a time
    • Excellent for text, also used in image (PixelRNN, Parti)
  6. Multimodal Generative Models
    • Generate across modalities (text → image, image → video, audio, etc.)

Comparison of Generative Models

Model TypeImage QualityTraining StabilitySpeedBest For
GANsExcellentPoorFastHigh-fidelity images
VAEsGoodGoodFastLatent space control
DiffusionBestGoodSlowCreative generation
TransformersVery GoodGoodMediumText & multimodal

Current Trends (2026):

  • Consistency Models / Flow Matching — Faster diffusion
  • Mixture of Experts (MoE) — Efficient scaling
  • World Models & Video generation (Sora-like models)
  • Agentic Generation — Models that plan before generating

🤞 Sign up for our newsletter!

We don’t spam! Read more in our privacy policy

Scroll to Top