Large Language Models: Fundamentals
FREEadvancedv1.0.0tokenshrink-v2
# llm Fundamentals Knowledge Pack
## Architecture: The xfmr
The xfmr architecture, introduced in "Attn Is All You Need" (2017), is the foundation of modern llms. Unlike RNNs or LSTMs, xfmrs process entire sequences simultaneously, enabling massive parallelization.
### Self-Attn Mechanism
Instead of reading text sequentially, self-attn allows each tok in a sequence to weigh the importance of every other tok.
Components of self-attn:
1. **Query (Q)**: What the current tok is looking for
2. **Key (K)**: What the other toks offer
3. **Value (V)**: The actual content of the toks
The attn score is calculated as `softmax(Q * K^T / sqrt(d_k)) * V`. The scaling factor `sqrt(d_k)` prevents gradients from vanishing when dim is large.
### Multi-Head Attn
Instead of one set of Q/K/V weights, models use multiple "heads". Each head learns to attend to different types of relationships (e.g., one head for grammar, one for coreference resolution, one for semantic meaning). The outputs of all heads are concatenated and linearly transformed.
### FFN & Residual Connections
After attn, the representations pass through an ffn (typically a two-layer MLP with a non-linear activation like GELU or SwiGLU). Residual connections (adding the input to the output) and Layer Normalization wrap both the attn and ffn blocks to stabilize training and prevent vanishing gradients in deep networks.
## tokization & emb
### Subword tokization
llms don't read words or letters; they read toks. Common algorithms (Byte-Pair Encoding (BPE), WordPiece, or SentencePiece) iteratively merge frequent character pairs.
- Example: "unhappiness" might be split into ["un", "happi", "ness"]
- Benefits: Handles out-of-vocabulary words gracefully, language-agnostic, balances vocabulary size vs. sequence length.
### embs
Each tok is mapped to a dense high-dim vec (emb) in the model's vocabulary matrix. Position embs are added to these tok embs to inject sequence order (since self-attn is inherently permutation-invariant). Modern approaches use Rotary Position embs (RoPE) which inject relative position information directly into the attn calculation rather than absolute position at the input layer.
## The Training Pipeline
### Phase 1: pretr (Next tok Prediction)
The model is trained on massive corpora (trillions of toks) from the internet. The objective is simple: given a sequence of toks, predict the next tok. This phase consumes 99% of compute and teaches the model grammar, facts, reasoning, and world knowledge. Result: A base model (e.g., Llama 3 Base) that acts as a powerful autocomplete, but isn't helpful as an assistant.
### Phase 2: Supervised ft (SFT)
The base model is refined on tens of thousands of high-quality conversational examples (prompt/response pairs). This teaches the model the *format* of a helpful assistant (how to answer questions, write code, or format lists). Result: An instruct model.
### Phase 3: Alignment (rlhf / dpo)
The model is optimized to prefer safe, helpful, and non-toxic responses.
- **rlhf**: Train a separate Reward Model on human preferences (A is better than B), then use Proximal Policy Optimization (PPO) to update the llm to maximize this reward.
- **dpo**: Bypasses the Reward Model entirely by modifying the SFT loss function to directly increase the probability of preferred responses and decrease the probability of rejected responses.
## Generation Mechanisms
### Inference & kv Cache
During generation, the llm predicts one tok at a time autoregressively. To avoid recomputing the attn keys and values for all previous toks at every step, the model caches them (the kv Cache). As ctx length grows (e.g., 1M toks in Gemini 1.5 Pro), the kv cache becomes a major memory bottleneck. Solutions include Multi-Query Attn (MQA) or Grouped-Query Attn (GQA) where multiple Query heads share a single kv head.
### Decoding Strategies
- **Greedy Search**: Always pick the single highest-probability next tok. Good for facts/code, bad for creative writing (often repeats itself).
- **Temperature**: Scales the logits before softmax. T=0 acts like greedy. T=1 is standard. T>1 flattens the distribution, increasing randomness.
- **Top-K**: restricts selection to the K most likely toks.
- **Top-P (Nucleus)**: restricts selection to the smallest set of toks whose cumulative probability exceeds P.
## Enhancing llms (rag)
llms hallucinate and have a knowledge cutoff. rag solves this by augmenting the prompt with external information retrieved at runtime.
1. **Index**: Embed documents into vecs; store in a vec database.
2. **Retrieve**: Embed the user's query; find the Top-K most similar documents using cosine similarity.
3. **Generate**: Inject retrieved documents into the llm's prompt: `[ctx: {docs}] Answer the question: {query}`.