Neural Networks & Machine Learning

FREE
intermediatev1.0.0tokenshrink-v2
ML (machine learning) enables systems to learn patterns from data without explicit programming. This pack covers the core theory and practical knowledge for building, training, and deploying NN (neural network) based systems.

Learning Paradigms

SL (supervised learning): model learns mapping from input X to output Y given labeled training pairs (x_i, y_i). REG (regression) predicts continuous values; CLF (classification) predicts discrete categories. The goal: minimize expected loss over unseen data (generalization), not just training data (memorization).

UL (unsupervised learning): discovers structure in unlabeled data. Clustering (k-means, DBSCAN, hierarchical), dimensionality reduction (PCA, t-SNE, UMAP), density estimation, and generative models. Autoencoders learn compressed representations by encoding input to latent space then reconstructing.

RL (reinforcement learning): agent learns policy (state-to-action mapping) by interacting with environment and receiving rewards. Key challenge: exploration vs exploitation tradeoff. Q-learning learns action-value function; policy gradient methods directly optimize the policy. PPO (Proximal Policy Optimization) is the current workhorse for continuous control and RLHF.

SSL (self-supervised learning) has emerged as the dominant pre-training paradigm. Masked language modeling (BERT), autoregressive prediction (GPT), contrastive learning (SimCLR, CLIP) — all create supervision signals from the data itself, eliminating the labeling bottleneck.

Neural Network Fundamentals

A neuron computes: output = ACT(W*x + b), where W is weights, b is bias, ACT is activation function. ACT functions introduce nonlinearity — without them, any depth network collapses to a single linear transformation. ReLU (max(0,x)) dominates hidden layers: simple, sparse activation, no vanishing gradient for positive inputs. GELU (Gaussian Error Linear Unit) is preferred in transformers. Softmax normalizes output layer for multi-class CLF.

BP (backpropagation) computes gradients via chain rule from loss back through network. SGD (stochastic gradient descent) updates weights: w = w - lr * gradient. Adam optimizer adapts per-parameter learning rates using first and second moment estimates — default choice for most applications. LR scheduling: warmup (linearly increase LR) then decay (cosine annealing, linear decay, or step decay).

BN (batch normalization) normalizes activations within mini-batch — stabilizes training, enables higher LR, provides mild regularization. LN (layer normalization) normalizes across features within single sample — required for variable-length sequences and small batches, standard in transformers.

REG techniques prevent OVF (overfitting): L2 regularization (weight decay), dropout (randomly zero activations during training, probability 0.1-0.5), early stopping (monitor validation loss, stop when increasing), data augmentation (expand training set with transformed examples).

Convolutional Neural Networks

CNN (convolutional neural networks) exploit spatial locality and translation invariance. Conv layer applies learnable filters (kernels) across input via sliding window. Key hyperparameters: kernel size (typically 3x3), stride (step size), padding (preserve spatial dimensions), number of filters (channel depth).

Pooling (max or average) reduces spatial resolution and provides local translation invariance. Modern architectures often replace pooling with strided convolutions. ResNet introduced skip connections (residual learning): h(x) = f(x) + x, enabling training of very deep networks (100+ layers) by ensuring gradient flow.

Architecture evolution: LeNet → AlexNet (ReLU, dropout, GPU training) → VGG (uniform 3x3) → GoogLeNet (inception modules) → ResNet (residual connections) → EfficientNet (compound scaling). For detection: Faster R-CNN (two-stage), YOLO (single-stage, real-time), DETR (transformer-based).

Transformer Architecture

The transformer replaced recurrence with self-ATN (self-attention), enabling parallel processing of sequences. Self-ATN computes: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V, where Q (queries), K (keys), V (values) are linear projections of input. Multi-head attention runs h parallel attention functions, concatenates results.

MHA (multi-head attention) allows the model to attend to information from different representation subspaces at different positions. Typical configs: 8-128 heads, d_model 512-12288. The attention pattern reveals what the model considers relevant — interpretability tool.

PE (positional encoding) injects sequence order information since self-ATN is permutation-invariant. Sinusoidal PE (original transformer), learned PE (BERT, GPT), RoPE (rotary position embedding — encodes relative position, extrapolates to longer sequences), ALiBi (linear bias — simple, effective for length generalization).

The encoder-decoder architecture (original transformer, T5): encoder processes input sequence bidirectionally, decoder generates output autoregressively attending to encoder output via cross-attention. Decoder-only (GPT family): causal masking ensures each position attends only to preceding positions — natural for generation tasks.

Scaling laws: model performance improves predictably with compute, data, and parameters following power laws. Chinchilla scaling: optimal training uses ~20 tokens per parameter. A 7B parameter model should train on ~140B tokens. Undertrained large models waste compute; smaller well-trained models often outperform.

Training at Scale

DDP (distributed data parallel): replicate model on multiple GPUs, each processes different mini-batch, gradients are all-reduced (averaged). Linear scaling rule: multiply LR by number of GPUs (with warmup). FSDP (fully sharded data parallel) shards model parameters, gradients, and optimizer states across GPUs — enables training models larger than single GPU memory.

MP (model parallelism): tensor parallel (split individual layers across GPUs) and pipeline parallel (split layers into stages). 3D parallelism combines data, tensor, and pipeline parallel for largest models. Communication overhead is the primary bottleneck — NVLink and InfiniBand minimize latency.

MIXP (mixed precision) training: forward pass in FP16/BF16, accumulate gradients in FP32. BF16 preferred (same dynamic range as FP32, simpler loss scaling). Reduces memory by ~50% and increases throughput 2-3x on tensor cores. Gradient scaling prevents underflow in FP16.

Fine-tuning & Adaptation

FT (fine-tuning) adapts pre-trained models to specific tasks. Full FT updates all parameters — effective but expensive and risks catastrophic forgetting. PEFT (parameter-efficient fine-tuning) methods update minimal parameters: LoRA (Low-Rank Adaptation — inject trainable low-rank matrices into attention layers, typically rank 8-64, <1% trainable parameters), QLoRA (quantize base model to 4-bit, apply LoRA), adapters (small bottleneck modules inserted between layers).

RLHF (RL from Human Feedback): train reward model on human preference comparisons, then optimize policy (language model) against reward model using PPO. DPO (Direct Preference Optimization) eliminates the reward model — directly optimizes the policy from preference pairs, simpler and more stable.

Prompt engineering: zero-shot (task description only), few-shot (include examples in context), chain-of-thought (elicit step-by-step reasoning), RAG (retrieval augmented generation — retrieve relevant context from external knowledge base). System prompts set behavioral guidelines.

INF (inference) optimization: KV-cache (store computed key-value pairs to avoid recomputation during autoregressive generation), quantization (INT8/INT4 reduce model size and increase throughput with minimal quality loss), speculative decoding (small draft model generates candidates, large model verifies in parallel — 2-3x speedup), continuous batching (dynamically add/remove requests from running batch).

Evaluation & Metrics

CLF metrics: accuracy (overall), precision (of predicted positives, how many are correct), recall (of actual positives, how many are found), F1 (harmonic mean of precision and recall), AUC-ROC (area under receiver operating characteristic — threshold-independent). For imbalanced data, accuracy is misleading — use F1 or AUC.

REG metrics: MSE (mean squared error), MAE (mean absolute error), R-squared (variance explained). Generation metrics: perplexity (exponentiated cross-entropy — lower is better), BLEU/ROUGE (n-gram overlap with reference), human evaluation (gold standard but expensive and slow).

ML ops: experiment tracking (W&B, MLflow), model versioning, automated evaluation pipelines, monitoring for data drift and model degradation in production. A/B testing validates real-world impact before full deployment.
2.9K
tokens
13.0%
savings
Downloads0