Data Science Fundamentals
FREEintermediatev1.0.0tokenshrink-v2
# DS Fundamentals
## EDA Process
EDA is the first and most critical phase of any DS project. Before modeling, you must understand your data's structure, quality, and relationships.
**Data profiling**: for every column, compute type (numeric, categorical, datetime, text), count of non-null values, unique value count, and basic statistics (mean, median, std, min, max, quartiles for numeric; mode, frequency distribution for categorical). Tools: `pandas.DataFrame.describe()`, `pandas.DataFrame.info()`, or automated profiling with `ydata-profiling`.
**Missing data analysis**: compute missing percentage per column. Below 5% missing is generally safe to impute. 5-30% requires careful strategy. Above 30% — consider dropping the column or engineering a binary "is_missing" feature (missingness itself can be predictive). Missing mechanisms matter: MCAR (random, safe to drop), MAR (related to observed data, impute using related columns), MNAR (related to the missing value itself, requires domain knowledge).
Imputation strategies: mean/median for numeric (median is robust to outliers), mode for categorical, KNN imputation when features are correlated, iterative imputation (MICE) for complex dependencies. Never impute target variable — drop those rows.
**Distribution analysis**: histogram every numeric feature. Check for skewness (log-transform if skew > 1), multimodality (may indicate mixed populations), and outliers (IQR method: below Q1-1.5*IQR or above Q3+1.5*IQR). For categorical features, check class balance and cardinality (high-cardinality categoricals need special encoding).
**Correlation analysis**: compute Pearson (linear), Spearman (monotonic), and point-biserial (numeric vs. binary) correlations. Create a correlation heatmap. Features correlated above 0.9 with each other introduce multicollinearity — consider dropping one. Features with near-zero correlation to the target may be droppable, but check for non-linear relationships first.
**Target variable analysis**: for classification, check class imbalance (minority class below 10% requires resampling strategy). For regression, check distribution shape — highly skewed targets benefit from log or Box-Cox transformation. Compute target correlation with every feature to identify top predictors early.
## FE Techniques
FE is where domain knowledge transforms raw data into predictive signals. Good FE often matters more than MS.
**Numeric transformations**: log transform (right-skewed data like income, prices), square root (count data), Box-Cox (automatically finds optimal power transform), standard scaling (for distance-based models like KNN, SVM), min-max scaling (for NN). Binning continuous variables into categories works when the relationship to the target is step-wise rather than linear.
**Categorical encoding**: OHE for low-cardinality categoricals (under 15 categories). Creates binary columns per category. TE for high-cardinality (ZIP codes, product IDs): replace category with the mean of the target variable for that category. TE requires regularization to avoid OF — use smoothing: `encoded = (count * category_mean + global_mean * smoothing) / (count + smoothing)`. Ordinal encoding for naturally ordered categories (education level, satisfaction rating).
**Datetime FE**: extract year, month, day_of_week, hour, is_weekend, is_holiday, quarter, days_since_reference_date. Cyclical encoding for periodic features: `sin(2*pi*hour/24)` and `cos(2*pi*hour/24)` — captures that hour 23 and hour 0 are close.
**Interaction features**: multiply two features to capture joint effects. Example: `room_size * room_count = total_area`. Polynomial features (degree 2-3) capture non-linear relationships for linear models. Use domain knowledge to guide which interactions to create — brute-force all pairwise interactions creates feature explosion.
**Aggregation features**: for grouped data (transactions per customer, events per session), compute count, sum, mean, std, min, max, last, time_since_last. These roll up granular data into entity-level features.
**Text FE**: TF-IDF for bag-of-words representation (works surprisingly well for many tasks). Word count, character count, average word length, sentiment score, presence of specific keywords. For modern approaches, sentence embeddings from pre-trained models capture semantic meaning.
## MS & Algorithm Selection
**Linear models** (LR, Linear Regression, Ridge, Lasso): fast, interpretable, strong baselines. Work well with many features, handle high dimensionality. Lasso (L1) provides automatic feature selection by zeroing out coefficients. Ridge (L2) handles multicollinearity. Use when interpretability matters or as a first baseline.
**DT-based models**: single DTs are interpretable but overfit aggressively. RF (ensemble of DTs trained on bootstrapped samples with random feature subsets) reduces variance and is remarkably robust. Minimal HP tuning needed — default settings often work well. Key HPs: `n_estimators` (100-500), `max_depth` (None for full trees, or limit for speed), `min_samples_leaf` (increase to reduce OF).
**XGB (gradient boosted trees)**: sequentially builds DTs where each tree corrects errors of the ensemble so far. Typically the strongest tabular data model. Key HPs: `learning_rate` (0.01-0.3, lower = more trees needed), `max_depth` (3-8, lower prevents OF), `n_estimators` (100-1000), `subsample` (0.7-0.9), `colsample_bytree` (0.7-0.9), `reg_alpha` (L1), `reg_lambda` (L2). XGB handles missing values natively. LightGBM is faster for large datasets with similar accuracy.
**NN**: excel at unstructured data (images, text, audio) and very large datasets. For tabular data, NNs rarely outperform XGB without extensive tuning. Architecture choices: feedforward for tabular (2-4 hidden layers, ReLU activation, dropout 0.2-0.5), CNNs for images, transformers for sequences. Require more data, compute, and tuning than tree models.
**MS decision framework**: start with LR/linear regression as baseline. Add RF for comparison (minimal tuning). Add XGB for best performance on tabular. Use NN only for unstructured data, very large datasets (millions of rows), or when tree models plateau. Always compare to the simplest viable model — if LR achieves 92% AUC and XGB achieves 93%, the complexity may not be worth it.
## CV Strategy
CV provides honest performance estimates by testing on data the model hasn't seen during training.
**K-fold CV** (standard: k=5 or k=10): split data into k folds, train on k-1, test on 1, rotate. Average the k scores. Stratified K-fold maintains target class proportions in each fold — essential for imbalanced classification. Use this as default.
**Time series CV**: data has temporal ordering. Never use random splits — future data leaks into training. Use expanding window (train on all data before fold) or sliding window (fixed training window). `TimeSeriesSplit` in scikit-learn implements this.
**Group CV**: when data has groups (multiple samples per patient, per customer). Use `GroupKFold` to ensure all samples from one group appear in the same fold. Prevents leakage from correlated samples.
**Nested CV**: outer loop evaluates model performance, inner loop tunes HPs. Prevents HP tuning from biasing performance estimates. Essential for honest comparisons between model types. Structure: outer 5-fold for evaluation, inner 3-fold for HP tuning within each outer fold.
HP tuning: GS exhaustively searches parameter grid (exponential cost). RS samples randomly from parameter distributions — nearly as effective as GS at 60x fewer evaluations. Bayesian optimization (Optuna, Hyperopt) uses prior results to guide search — best for expensive models. Always tune within CV folds, never on the full dataset.
## BV Tradeoff
The BV tradeoff is the central tension in ML. Total error = bias squared + variance + irreducible error.
**High bias (UF)**: model is too simple to capture the underlying pattern. Symptoms: poor performance on both training and validation data. Training accuracy 75%, validation accuracy 73%. Fixes: more complex model, add features, reduce regularization, add polynomial/interaction features.
**High variance (OF)**: model memorizes training data including noise. Symptoms: excellent training performance, poor validation performance. Training accuracy 99%, validation accuracy 80%. Fixes: more training data (most effective), stronger regularization, simpler model, feature selection/reduction, dropout (for NNs), early stopping, ensemble methods.
**Diagnosing BV**: plot learning curves (training and validation score vs. training set size). If both curves converge at a low score = high bias. If large gap between curves = high variance. If both converge at high score = sweet spot.
Regularization controls BV: L1 (Lasso) drives coefficients to zero (feature selection + reduces variance). L2 (Ridge) shrinks coefficients (reduces variance without elimination). Elastic Net combines both. For tree models: `max_depth`, `min_samples_leaf`, `max_features` all regularize by limiting tree complexity.
## Evaluation Metrics
**Classification metrics**: accuracy (misleading with class imbalance — 95% accuracy when 95% is majority class is useless). Use the CM to compute precision (of predicted positives, how many are correct), recall (of actual positives, how many were found), and F1 (harmonic mean of precision and recall).
ROC-AUC: plots true positive rate vs. false positive rate at all thresholds. AUC of 0.5 = random guessing, 1.0 = perfect. Good for balanced datasets. For imbalanced data, PR-AUC (precision vs. recall curve) is more informative — it focuses on the minority class performance.
Threshold tuning: default 0.5 threshold is arbitrary. Optimize based on business cost. If false negatives cost 10x more than false positives (fraud detection), lower the threshold to increase recall at the cost of precision. Plot precision-recall vs. threshold to find the optimal operating point.
**Regression metrics**: RMSE penalizes large errors more (squared term). MAE is robust to outliers and more interpretable ("average error is $X"). R-squared shows proportion of variance explained (0.8 means model explains 80% of variation). MAPE (mean absolute percentage error) provides scale-independent comparison. Choose based on whether large errors matter disproportionately (RMSE) or not (MAE).
**Business metrics**: ultimately, model performance must map to business impact. A model that improves AUC by 0.02 might translate to $500K in fraud prevention or $0 in revenue impact depending on the use case. Always define the business metric first (revenue, cost savings, customer retention rate) and relate model metrics to it.
**Model comparison protocol**: compare models using the same CV strategy and same train/test splits. Use paired statistical tests (Wilcoxon signed-rank on fold-level scores) to determine if differences are statistically significant. A model that's 0.5% better in AUC but 10x slower to train and impossible to explain may not be the right choice. Consider accuracy, speed, interpretability, and maintenance cost.
## Feature Selection & Dimensionality Reduction
Filter methods: compute a relevance score for each feature independently. Mutual information (works for non-linear relationships), chi-squared (categorical features vs. categorical target), ANOVA F-test (numeric features vs. categorical target), correlation coefficient (numeric vs. numeric). Fast to compute, but ignore feature interactions.
Wrapper methods: train models with different feature subsets and compare performance. Forward selection (start empty, add best feature iteratively), backward elimination (start full, remove worst feature iteratively), recursive feature elimination (RFE — train model, remove least important feature, repeat). Computationally expensive but captures interactions. Use with CV to prevent OF to feature selection.
Embedded methods: feature selection built into the model training. Lasso (L1) zeroes out unimportant feature coefficients. RF and XGB provide feature importance scores (Gini importance, permutation importance). SHAP values provide the most reliable feature importance — they account for feature interactions and are consistent across model types. Permutation importance is model-agnostic: shuffle one feature, measure performance drop. Repeat for each feature.
PCA (Principal Component Analysis): linear dimensionality reduction. Projects data onto orthogonal components that maximize variance. Choose components that explain 90-95% of total variance (plot cumulative explained variance ratio). PCA components are not interpretable — you lose feature meaning. Use for visualization (2D projections), noise reduction, or when you have hundreds of correlated features. Always standardize features before PCA.
t-SNE and UMAP: non-linear dimensionality reduction for visualization. t-SNE excels at revealing cluster structure in high-dimensional data but is stochastic (run multiple times to verify patterns). UMAP preserves more global structure than t-SNE and is faster. Neither should be used for feature extraction in ML pipelines — they are visualization tools only.
## Practical ML Pipeline Architecture
Data pipeline stages: raw data ingestion, data validation (schema checks, range checks, null checks), FE (transforms, encodations, aggregations), feature store (versioned, reusable features), model training (with CV and HP tuning), model evaluation (holdout test set, never used during training or HP tuning), model deployment, monitoring.
Train/validation/test split: the test set is sacred — touch it only once, for final evaluation. Use 70/15/15 or 80/10/10 split. For time series: train on oldest data, validate on middle period, test on most recent. For grouped data: ensure groups don't leak across splits. Stratify by target for classification. For very small datasets (under 1000 rows), use nested CV instead of a fixed split.
Data leakage prevention: the #1 source of ML project failures. Leakage occurs when information from outside the training set influences the model. Common leaks: fitting scalers/encoders on full dataset before splitting (fit only on training data, transform all sets), using future data for FE in time series (use only past data), including features derived from the target, including proxy features that won't be available at prediction time. Always ask: "Would I have this feature at prediction time in production?"
ML experiment tracking: log every experiment with parameters, metrics, data version, code version, and model artifacts. Tools: MLflow, Weights & Biases, or even structured CSV logs. Without experiment tracking, you will forget which configuration produced your best result. Version datasets alongside code — the same code on different data produces different results.
## Model Deployment & Monitoring
Model serving patterns: batch prediction (run model on schedule, store predictions), real-time API (model behind REST endpoint, latency-sensitive), streaming (model processes events as they arrive). Choose based on latency requirements and prediction volume. Most DS projects start with batch and add real-time later.
Model monitoring: track prediction distribution shift (are model outputs changing over time?), feature distribution shift (are inputs changing over time?), and performance degradation (are accuracy metrics declining?). Set alerts when prediction distribution KL-divergence exceeds threshold or when rolling accuracy drops below baseline. Concept drift means the relationship between features and target has changed — requires retraining. Data drift means input distributions have changed — may require retraining or investigation.
Retraining strategy: scheduled retraining (weekly/monthly) is simple but wasteful if data hasn't changed. Trigger-based retraining (retrain when monitoring detects drift) is more efficient. Champion/challenger pattern: new model (challenger) runs in shadow mode alongside current model (champion). Promote challenger only when it demonstrates statistically significant improvement over a sufficient evaluation period.
Model explainability: SHAP values explain individual predictions ("this loan was rejected because income was $30K below the threshold and DTI was high"). Feature importance explains global model behavior ("income and credit score are the two most important features"). Partial dependence plots show the marginal effect of a single feature on predictions. For regulated industries (finance, healthcare), explainability is not optional — it's legally required.