Statistical & Machine Learning Reference Guide

Descriptive Statistics

Central Tendency Measures

Mean (Arithmetic Average)

When to use:

For normally distributed data
When outliers are not a concern
When you need a measure sensitive to all values

Example: Average daily sales, average temperature

Median

The middle value when data is ordered

When to use:

When data is skewed
When outliers are present
For ordinal data

Example: House prices, income distributions

Mode

Most frequently occurring value

When to use:

For categorical data
When looking for most common occurrences
In multimodal distributions

Dispersion Measures

Standard Deviation

When to use:

To measure spread around the mean
When data is approximately normal
When units matter

Interquartile Range (IQR)

When to use:

When data is skewed
To identify outliers
When median is preferred over mean

Statistical Tests

Parametric Tests

T-Test (Student's t-test)

For comparing means:

When to use:

Comparing sample mean to population mean
Comparing means of two groups
When data is normally distributed

Example:

Testing if a new drug treatment is effective
Comparing test scores between two classes

ANOVA (Analysis of Variance)

F-statistic calculation:

When to use:

Comparing means of 3+ groups
Testing effects of categorical variables
When assumptions of normality and equal variance hold

Non-parametric Tests

Mann-Whitney U Test

Alternative to t-test when normality cannot be assumed

When to use:

Comparing two independent groups
When data is ordinal or continuous but not normal
Small sample sizes

Kruskal-Wallis Test

Non-parametric alternative to ANOVA

When to use:

Comparing 3+ independent groups
When normality cannot be assumed
When dealing with ordinal data

Regression Analysis

Linear Regression

When to use:

Predicting continuous outcomes
When relationship appears linear
When assumptions of linearity, independence, homoscedasticity are met

Key Metrics:

R-squared (coefficient of determination)
RMSE (Root Mean Square Error)
p-values for coefficients

Multiple Linear Regression

When to use:

Multiple predictors needed
Continuous outcome variable
Linear relationships assumed

Logistic Regression

When to use:

Binary outcome prediction
Probability estimation
Classification problems

Classification Algorithms

Decision Trees

Recursive partitioning using information gain or Gini impurity

When to use:

Need interpretable results
Mixed data types
Non-linear relationships
Hierarchical decision making

Advantages:

Easy to interpret
Handles missing values
No data scaling needed

Random Forest

Ensemble of decision trees with:

Bootstrap sampling
Random feature selection
Majority voting

When to use:

High-dimensional data
Complex relationships
When overfitting is a concern

Support Vector Machines (SVM)

Finding optimal hyperplane:

When to use:

Binary classification
High-dimensional data
When clear margin of separation exists

Clustering Methods

K-Means Clustering

Minimizing within-cluster sum of squares:

When to use:

When number of clusters is known
Spherical clusters expected
Large datasets

Hierarchical Clustering

When to use:

Unknown number of clusters
Need dendrogram visualization
Smaller datasets

DBSCAN

Density-based clustering

When to use:

Arbitrary shaped clusters
When noise/outliers present
When cluster number unknown

Dimensionality Reduction

Principal Component Analysis (PCA)

Finding orthogonal components maximizing variance

When to use:

Feature reduction needed
Correlated features present
Visualization of high-dim data

t-SNE

t-Distributed Stochastic Neighbor Embedding

When to use:

Visualization of high-dim data
When local structure important
Non-linear relationships

Time Series Analysis

Moving Average

When to use:

Smoothing time series
Trend identification
Noise reduction

Exponential Smoothing

When to use:

Short-term forecasting
When recent values more important
Trend and seasonality present

ARIMA Models

(Autoregressive Integrated Moving Average)

When to use:

Complex time series
Seasonal patterns
Need statistical forecasting

Components:

AR: Autoregression
I: Integration (differencing)
MA: Moving Average

Selection Guidelines

Data Type:
- Continuous → Regression, PCA
- Categorical → Classification, Chi-square
- Time-based → Time Series Analysis
- Mixed → Decision Trees
Sample Size:
- Small (n < 30) → Non-parametric tests
- Large → Parametric tests, Deep Learning
Assumptions:
- Normal distribution → Parametric tests
- Non-normal → Non-parametric alternatives
- Linear relationships → Linear regression
- Non-linear → Decision trees, SVM
Goal:
- Prediction → Regression, Classification
- Pattern Discovery → Clustering
- Dimension Reduction → PCA, t-SNE
- Time Forecasting → ARIMA, Exponential Smoothing