Statistical & Machine Learning Reference Guide


Descriptive Statistics

Central Tendency Measures

Mean (Arithmetic Average)

When to use:

Example: Average daily sales, average temperature

Median

The middle value when data is ordered

When to use:

Example: House prices, income distributions

Mode

Most frequently occurring value

When to use:

Dispersion Measures

Standard Deviation

When to use:

Interquartile Range (IQR)

When to use:


Statistical Tests

Parametric Tests

T-Test (Student's t-test)

For comparing means:

When to use:

Example:

ANOVA (Analysis of Variance)

F-statistic calculation:

When to use:

Non-parametric Tests

Mann-Whitney U Test

Alternative to t-test when normality cannot be assumed

When to use:

Kruskal-Wallis Test

Non-parametric alternative to ANOVA

When to use:


Regression Analysis

Linear Regression

When to use:

Key Metrics:

Multiple Linear Regression

When to use:

Logistic Regression

When to use:


Classification Algorithms

Decision Trees

Recursive partitioning using information gain or Gini impurity

When to use:

Advantages:

Random Forest

Ensemble of decision trees with:

When to use:

Support Vector Machines (SVM)

Finding optimal hyperplane:

When to use:


Clustering Methods

K-Means Clustering

Minimizing within-cluster sum of squares:

When to use:

Hierarchical Clustering

When to use:

DBSCAN

Density-based clustering

When to use:


Dimensionality Reduction

Principal Component Analysis (PCA)

Finding orthogonal components maximizing variance

When to use:

t-SNE

t-Distributed Stochastic Neighbor Embedding

When to use:


Time Series Analysis

Moving Average

When to use:

Exponential Smoothing

When to use:

ARIMA Models

(Autoregressive Integrated Moving Average)

When to use:

Components:


Selection Guidelines

  1. Data Type:

    • Continuous → Regression, PCA
    • Categorical → Classification, Chi-square
    • Time-based → Time Series Analysis
    • Mixed → Decision Trees
  2. Sample Size:

    • Small (n < 30) → Non-parametric tests
    • Large → Parametric tests, Deep Learning
  3. Assumptions:

    • Normal distribution → Parametric tests
    • Non-normal → Non-parametric alternatives
    • Linear relationships → Linear regression
    • Non-linear → Decision trees, SVM
  4. Goal:

    • Prediction → Regression, Classification
    • Pattern Discovery → Clustering
    • Dimension Reduction → PCA, t-SNE
    • Time Forecasting → ARIMA, Exponential Smoothing