Data Science Fundamentals

intermediatev1.0.0tokenshrink-v2
# DS Fundamentals

## EDA Process

EDA is the first and most critical phase of any DS project. Before modeling, you must understand your data's structure, quality, and relationships.

**Data profiling**: for every column, compute type (numeric, categorical, datetime, text), count of non-null values, unique value count, and basic statistics (mean, median, std, min, max, quartiles for numeric; mode, frequency distribution for categorical). Tools: `pandas.DataFrame.describe()`, `pandas.DataFrame.info()`, or automated profiling with `ydata-profiling`.

**Missing data analysis**: compute missing percentage per column. Below 5% missing is generally safe to impute. 5-30% requires careful strategy. Above 30% — consider dropping the column or engineering a binary "is_missing" feature (missingness itself can be predictive). Missing mechanisms matter: MCAR (random, safe to drop), MAR (related to observed data, impute using related columns), MNAR (related to the missing value itself, requires domain knowledge).

Imputation strategies: mean/median for numeric (median is robust to outliers), mode for categorical, KNN imputation when features are correlated, iterative imputation (MICE) for complex dependencies. Never impute target variable — drop those rows.

**Distribution analysis**: histogram every numeric feature. Check for skewness (log-transform if skew > 1), multimodality (may indicate mixed populations), and outliers (IQR method: below Q1-1.5*IQR or above Q3+1.5*IQR). For categorical features, check class balance and cardinality (high-cardinality categoricals need special encoding).

**Correlation analysis**: compute Pearson (linear), Spearman (monotonic), and point-biserial (numeric vs. binary) correlations. Create a correlation heatmap. Features correlated above 0.9 with each other introduce multicollinearity — consider dropping one. Features with near-zero correlation to the target may be droppable, but check for non-linear relationships first.

**Target variable analysis**: for classification, check class imbalance (minority class below 10% requires resampling strategy). For regression, check distribution shape — highly skewed targets benefit from log or Box-Cox transformation. Compute target correlation with every feature to identify top predictors early.

## FE Techniques

FE is where domain knowledge transforms raw data into predictive signals. Good FE often matters more than MS.

**Numeric transformations**: log transform (right-skewed data like income, prices), square root (count data), Box-Cox (automatically finds optimal power transform), standard scaling (for distance-based models like KNN, SVM), min-max scaling (for NN). Binning continuous variables into categories works when the relationship to the target is step-wise rather than linear.

Showing 20% preview. Upgrade to Pro for full access.

3.0K

tokens

13.0%

savings

Downloads0
Sign in to Download