Week 4: Statistics Foundations for Data Analysis

Master the Mathematical Foundation Every Data Analyst Needs

Duration: 7 Days | Level: Intermediate | Prerequisites: Weeks 1-3

Welcome to Week 4! This week marks a pivotal point in your data analytics journey. Statistics is the backbone of data analysis - it's how we make sense of data, draw conclusions, and make predictions with confidence.

Whether you're calculating average sales, determining if a marketing campaign was successful, or predicting customer behavior, you'll use the statistical concepts you learn this week. By the end, you'll be able to analyze data distributions, calculate probabilities, and understand sampling - skills that separate good analysts from great ones.

Don't worry if math isn't your strongest subject. We'll build concepts step by step with practical examples and real-world applications. Let's dive in!

Day 1

Measures of Central Tendency

Understanding Mean, Median, and Mode

Why Statistics for Data Analysts?

Statistics is the science of collecting, organizing, analyzing, and interpreting data. For data analysts, it provides the tools to summarize large datasets, identify patterns, and make data-driven decisions with measurable confidence.

Every time you see a report saying "average customer spends Rs. 500" or "typical delivery time is 3 days," you're seeing statistics in action. These numbers help businesses understand their performance and make informed decisions.

Measures of Central Tendency

Central tendency measures tell us where the "center" of our data lies. There are three main measures:

Measure Definition When to Use Sensitivity to Outliers
Mean Sum of all values divided by count Symmetric data without outliers Highly sensitive
Median Middle value when data is sorted Skewed data or data with outliers Resistant (robust)
Mode Most frequently occurring value Categorical data or finding common values Not affected

1. Mean (Average)

The mean is the most common measure of central tendency. It's calculated by adding all values and dividing by the count.

Formula: Arithmetic Mean

x̄ = (x₁ + x₂ + x₃ + ... + xₙ) / n = Σx / n

Where x̄ (x-bar) is the mean, x values are individual data points, and n is the count.

Example: Calculating Mean Salary

A company has 5 employees with salaries: Rs. 30,000, Rs. 35,000, Rs. 40,000, Rs. 45,000, Rs. 50,000

Mean = (30,000 + 35,000 + 40,000 + 45,000 + 50,000) / 5 = Rs. 40,000

The average salary is Rs. 40,000.

Types of Means

Weighted Mean

When some values are more important than others, we use weighted mean.

Weighted Mean = Σ(wₖ × xₖ) / Σwₖ

Example: A student's grades with different credit weights:

  • Math (4 credits): 85%
  • English (3 credits): 90%
  • History (2 credits): 78%

Weighted Mean = (4×85 + 3×90 + 2×78) / (4+3+2) = (340 + 270 + 156) / 9 = 85.1%

Trimmed Mean

Remove a percentage of extreme values from both ends before calculating mean. This makes it more robust to outliers.

Example: 10% trimmed mean of [2, 5, 7, 8, 9, 10, 12, 15, 18, 100]

Remove 10% from each end (1 value): Remove 2 and 100

Trimmed Mean = (5+7+8+9+10+12+15+18) / 8 = 10.5

Compare to regular mean: 18.6 (heavily influenced by 100)

2. Median (Middle Value)

The median is the middle value when data is arranged in order. It's particularly useful for skewed distributions.

Finding the Median

For odd n: Median = value at position (n+1)/2

For even n: Median = average of values at positions n/2 and (n/2)+1

Example: Finding Median

Odd count: Data: 12, 15, 18, 22, 30 (n=5)

Position = (5+1)/2 = 3rd value = 18

Even count: Data: 12, 15, 18, 22, 30, 35 (n=6)

Positions = 6/2 = 3rd and 4th values = (18+22)/2 = 20

3. Mode (Most Frequent)

The mode is the value that appears most frequently in a dataset.

Example: Finding Mode

T-shirt sizes sold: S, M, M, L, M, XL, L, M, S, M

Count: S=2, M=5, L=2, XL=1

Mode = M (appears 5 times)

Data can be:

  • Unimodal: One mode (most common)
  • Bimodal: Two modes
  • Multimodal: More than two modes
  • No mode: All values appear equally

Effect of Outliers

Important: Outliers Affect the Mean!

Consider these salaries: Rs. 30K, 35K, 40K, 45K, 500K (CEO)

  • Mean: Rs. 130,000 (misleading!)
  • Median: Rs. 40,000 (better representation)

This is why income and housing prices are often reported as median, not mean.

When to Use Each Measure

Scenario Best Measure Reason
Symmetric data (test scores) Mean Uses all data, most precise
Income/Housing prices Median Skewed by high earners/luxury homes
Most popular product size Mode Find most common category
Data with outliers Median or Trimmed Mean Robust to extreme values

Excel Functions for Central Tendency

  • =AVERAGE(A1:A100) - Calculates mean
  • =MEDIAN(A1:A100) - Finds median
  • =MODE.SNGL(A1:A100) - Returns single mode
  • =MODE.MULT(A1:A100) - Returns multiple modes (array formula)
  • =TRIMMEAN(A1:A100, 0.1) - 10% trimmed mean
# Python: Measures of Central Tendency import numpy as np from scipy import stats # Sample data: Employee salaries salaries = [30000, 35000, 40000, 45000, 50000, 55000, 500000] # Mean mean_salary = np.mean(salaries) print(f"Mean: Rs. {mean_salary:,.0f}") # Rs. 107,857 # Median median_salary = np.median(salaries) print(f"Median: Rs. {median_salary:,.0f}") # Rs. 45,000 # Mode mode_result = stats.mode(salaries, keepdims=True) print(f"Mode: Rs. {mode_result.mode[0]:,.0f}") # Trimmed Mean (10% from each end) trimmed_mean = stats.trim_mean(salaries, 0.1) print(f"Trimmed Mean: Rs. {trimmed_mean:,.0f}")

Try It Yourself - Day 1 Exercises

  1. Calculate mean, median, and mode for: 15, 20, 20, 25, 30, 35, 100
  2. A store sells shoes in sizes: 7, 8, 8, 9, 9, 9, 10, 10, 11. What size should they stock most?
  3. Student scores with weights: Quiz (20%): 85, Midterm (30%): 78, Final (50%): 92. Find weighted average.
  4. When would you report median income instead of mean income?

Day 1 Key Takeaways

  • Mean is the arithmetic average - sensitive to outliers
  • Median is the middle value - robust to outliers
  • Mode is the most frequent value - best for categorical data
  • Choose the measure based on your data distribution and presence of outliers
  • Weighted mean accounts for different importance of values
Day 2

Measures of Dispersion

Understanding How Spread Out Your Data Is

Why Dispersion Matters

Knowing the center of your data is just half the story. Two datasets can have the same mean but very different spreads:

Same Mean, Different Spread

Dataset A: 48, 49, 50, 51, 52 → Mean = 50

Dataset B: 10, 30, 50, 70, 90 → Mean = 50

Both have mean 50, but Dataset B is much more spread out!

1. Range

The simplest measure of spread - the difference between the largest and smallest values.

Formula: Range

Range = Maximum Value - Minimum Value

Limitation: Range only uses two values and is heavily affected by outliers.

2. Interquartile Range (IQR)

IQR measures the spread of the middle 50% of data, making it robust to outliers.

Formula: Interquartile Range

IQR = Q3 - Q1

Where Q1 is the 25th percentile and Q3 is the 75th percentile.

Quartiles and Percentiles

Term Position Meaning
Q1 (First Quartile) 25th percentile 25% of data falls below this value
Q2 (Second Quartile) 50th percentile Median - 50% below, 50% above
Q3 (Third Quartile) 75th percentile 75% of data falls below this value

Example: Calculating IQR

Data (sorted): 12, 15, 18, 22, 25, 28, 30, 35, 40, 45, 50, 55

n = 12 values

Q1 position: (12+1) × 0.25 = 3.25 → Between 18 and 22 → Q1 = 19

Q3 position: (12+1) × 0.75 = 9.75 → Between 40 and 45 → Q3 = 43.75

IQR = Q3 - Q1 = 43.75 - 19 = 24.75

3. Variance

Variance measures how far each data point is from the mean, on average. It's the foundation for standard deviation.

Formula: Variance

Population Variance (σ²):

σ² = Σ(xₖ - μ)² / N

Sample Variance (s²):

s² = Σ(xₖ - x̄)² / (n - 1)

Note: We use (n-1) for sample variance to get an unbiased estimate (Bessel's correction).

Example: Calculating Variance

Data: 4, 8, 6, 5, 7 (sample)

Step 1: Calculate mean: (4+8+6+5+7)/5 = 6

Step 2: Calculate squared deviations:

  • (4-6)² = 4
  • (8-6)² = 4
  • (6-6)² = 0
  • (5-6)² = 1
  • (7-6)² = 1

Step 3: Sum = 4+4+0+1+1 = 10

Step 4: Sample Variance = 10/(5-1) = 2.5

4. Standard Deviation

Standard deviation is the square root of variance. It's in the same units as the original data, making it easier to interpret.

Formula: Standard Deviation

σ = √(Variance) = √[Σ(xₖ - μ)² / N]

For sample: s = √[Σ(xₖ - x̄)² / (n-1)]

Interpreting Standard Deviation

  • Small SD: Data points are clustered close to the mean
  • Large SD: Data points are spread out from the mean
  • SD is always ≥ 0 (can't be negative)
  • SD = 0 means all values are identical

5. Coefficient of Variation (CV)

CV expresses standard deviation as a percentage of the mean, allowing comparison between datasets with different units or scales.

Formula: Coefficient of Variation

CV = (Standard Deviation / Mean) × 100%

Example: Comparing Variability

Stock A: Mean return = Rs. 100, SD = Rs. 20 → CV = 20%

Stock B: Mean return = Rs. 500, SD = Rs. 50 → CV = 10%

Stock B has higher absolute SD but lower relative variability (lower risk per unit return).

6. Five-Number Summary and Box Plots

The five-number summary provides a complete picture of data distribution:

Component Description
Minimum Smallest value
Q1 First quartile (25th percentile)
Median (Q2) Middle value (50th percentile)
Q3 Third quartile (75th percentile)
Maximum Largest value

Identifying Outliers with IQR

IQR Outlier Rule

Lower Bound = Q1 - 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR

Values outside these bounds are considered outliers.

Excel Functions for Dispersion

  • =MIN(A1:A100), =MAX(A1:A100) - Range components
  • =QUARTILE.INC(A1:A100, 1) - First quartile
  • =QUARTILE.INC(A1:A100, 3) - Third quartile
  • =VAR.S(A1:A100) - Sample variance
  • =VAR.P(A1:A100) - Population variance
  • =STDEV.S(A1:A100) - Sample standard deviation
  • =STDEV.P(A1:A100) - Population standard deviation
  • =PERCENTILE.INC(A1:A100, 0.9) - 90th percentile
# Python: Measures of Dispersion import numpy as np data = [45, 52, 48, 55, 60, 47, 53, 49, 58, 51] # Range data_range = np.max(data) - np.min(data) print(f"Range: {data_range}") # 15 # Quartiles and IQR q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}") # Variance and Standard Deviation variance = np.var(data, ddof=1) # ddof=1 for sample variance std_dev = np.std(data, ddof=1) # ddof=1 for sample std dev print(f"Variance: {variance:.2f}") print(f"Standard Deviation: {std_dev:.2f}") # Coefficient of Variation cv = (std_dev / np.mean(data)) * 100 print(f"Coefficient of Variation: {cv:.2f}%") # Five-Number Summary print(f"\nFive-Number Summary:") print(f"Min: {np.min(data)}") print(f"Q1: {np.percentile(data, 25)}") print(f"Median: {np.median(data)}") print(f"Q3: {np.percentile(data, 75)}") print(f"Max: {np.max(data)}") # Outlier Detection lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr outliers = [x for x in data if x < lower_bound or x > upper_bound] print(f"Outliers: {outliers}")

Try It Yourself - Day 2 Exercises

  1. Calculate range, variance, and standard deviation for: 10, 15, 20, 25, 30
  2. Find the five-number summary for: 2, 5, 7, 8, 12, 15, 18, 22, 35, 50
  3. Identify outliers using the IQR method for the above dataset
  4. Compare two products: Product A (mean=100, SD=15) vs Product B (mean=200, SD=25). Which has more relative variability?

Day 2 Key Takeaways

  • Range is simple but sensitive to outliers
  • IQR measures spread of middle 50% - robust to outliers
  • Variance measures average squared distance from mean
  • Standard deviation is in original units - easier to interpret
  • CV allows comparison across different scales
  • Use IQR rule (1.5 × IQR) to identify outliers
Day 3

Data Distributions & Shape

Understanding How Your Data is Distributed

Frequency Distributions and Histograms

A frequency distribution shows how often each value (or range of values) occurs in a dataset. Histograms visualize this distribution.

Example: Test Scores Frequency Distribution

Score Range Frequency Relative Frequency
50-59510%
60-691224%
70-791836%
80-891020%
90-100510%

Skewness: The Shape of Distribution

Skewness measures the asymmetry of a distribution around its mean.

Type Description Mean vs Median Example
Left (Negative) Skewed Tail extends to the left Mean < Median Age at retirement, Easy test scores
Symmetric Mirror image on both sides Mean ≈ Median Heights, IQ scores
Right (Positive) Skewed Tail extends to the right Mean > Median Income, House prices, Wait times

Remember: The Tail Tells the Tale!

The skew is named for the direction of the tail, not the peak.

  • Right-skewed: Long tail to the RIGHT → Mean pulled RIGHT (higher)
  • Left-skewed: Long tail to the LEFT → Mean pulled LEFT (lower)

Kurtosis: Peaks and Tails

Kurtosis measures the "tailedness" of a distribution - how much data is in the tails compared to a normal distribution.

Type Kurtosis Description
Mesokurtic = 3 (or excess = 0) Normal distribution (baseline)
Leptokurtic > 3 (excess > 0) Heavy tails, sharp peak, more outliers
Platykurtic < 3 (excess < 0) Light tails, flat peak, fewer outliers

The Normal Distribution

The normal (Gaussian) distribution is the most important distribution in statistics. It's bell-shaped and symmetric around the mean.

Properties of Normal Distribution:
  • Symmetric around the mean
  • Mean = Median = Mode
  • Defined by two parameters: mean (μ) and standard deviation (σ)
  • Total area under curve = 1 (100%)
  • Tails extend to infinity but never touch the x-axis

The Empirical Rule (68-95-99.7 Rule)

For normally distributed data:

68%
Within 1 SD (μ ± 1σ)
95%
Within 2 SD (μ ± 2σ)
99.7%
Within 3 SD (μ ± 3σ)

Example: Applying the Empirical Rule

IQ scores are normally distributed with mean = 100 and SD = 15.

  • 68% of people have IQ between 85-115 (100 ± 15)
  • 95% of people have IQ between 70-130 (100 ± 30)
  • 99.7% of people have IQ between 55-145 (100 ± 45)

An IQ of 145+ is in the top 0.15% (very rare!)

Z-Scores and Standardization

A Z-score tells you how many standard deviations a value is from the mean.

Formula: Z-Score

Z = (X - μ) / σ

Where X is the value, μ is the mean, and σ is the standard deviation.

Interpreting Z-Scores

  • Z = 0: Value equals the mean
  • Z = 1: Value is 1 SD above the mean
  • Z = -2: Value is 2 SD below the mean
  • |Z| > 3: Value is an extreme outlier

Example: Calculating and Interpreting Z-Scores

Test scores: Mean = 75, SD = 10

Student A scored 85: Z = (85-75)/10 = 1.0 → 1 SD above average

Student B scored 60: Z = (60-75)/10 = -1.5 → 1.5 SD below average

Student C scored 95: Z = (95-75)/10 = 2.0 → Top ~2.5% (excellent!)

Why Standardize?

Z-scores allow you to:

  • Compare values from different distributions
  • Identify outliers (|Z| > 2 or 3)
  • Calculate probabilities using standard normal tables
  • Create a common scale for different variables

Excel Functions for Distribution Shape

  • =SKEW(A1:A100) - Calculate skewness
  • =KURT(A1:A100) - Calculate excess kurtosis
  • =STANDARDIZE(x, mean, std_dev) - Calculate Z-score
  • =NORM.DIST(x, mean, std_dev, TRUE) - Cumulative probability
  • =NORM.INV(probability, mean, std_dev) - Value for given percentile
# Python: Distribution Shape Analysis import numpy as np from scipy import stats import matplotlib.pyplot as plt # Generate sample data np.random.seed(42) normal_data = np.random.normal(loc=100, scale=15, size=1000) # IQ-like data # Skewness and Kurtosis skewness = stats.skew(normal_data) kurtosis = stats.kurtosis(normal_data) # Excess kurtosis print(f"Skewness: {skewness:.3f}") # Near 0 for normal print(f"Excess Kurtosis: {kurtosis:.3f}") # Near 0 for normal # Z-Score Calculation value = 130 mean = np.mean(normal_data) std = np.std(normal_data) z_score = (value - mean) / std print(f"\nZ-score for {value}: {z_score:.2f}") # Standardize all data z_scores = stats.zscore(normal_data) # Find percentage above/below certain values above_130 = np.sum(normal_data > 130) / len(normal_data) * 100 print(f"Percentage above 130: {above_130:.2f}%") # Verify Empirical Rule within_1sd = np.sum(np.abs(z_scores) <= 1) / len(z_scores) * 100 within_2sd = np.sum(np.abs(z_scores) <= 2) / len(z_scores) * 100 within_3sd = np.sum(np.abs(z_scores) <= 3) / len(z_scores) * 100 print(f"\nEmpirical Rule Verification:") print(f"Within 1 SD: {within_1sd:.1f}% (expected ~68%)") print(f"Within 2 SD: {within_2sd:.1f}% (expected ~95%)") print(f"Within 3 SD: {within_3sd:.1f}% (expected ~99.7%)")

Try It Yourself - Day 3 Exercises

  1. Heights of adults are normally distributed with mean=170cm, SD=10cm. What percentage are between 160cm and 180cm?
  2. Calculate Z-scores for students with scores 45, 72, 88 if mean=70 and SD=12. Who performed best relative to the class?
  3. A dataset has mean=50 and median=62. Is it left-skewed, right-skewed, or symmetric?
  4. If SAT scores have mean=1050, SD=200, what score puts you in the top 2.5%?

Day 3 Key Takeaways

  • Histograms visualize frequency distributions
  • Skewness shows direction of tail; affects mean vs median relationship
  • Kurtosis measures tail heaviness and peak sharpness
  • Normal distribution: symmetric, bell-shaped, defined by mean and SD
  • Empirical Rule: 68-95-99.7% within 1-2-3 standard deviations
  • Z-scores standardize data for comparison and probability calculation
Day 4

Probability Fundamentals

The Mathematics of Uncertainty

What is Probability?

Probability is a measure of how likely an event is to occur. It ranges from 0 (impossible) to 1 (certain), often expressed as a percentage (0% to 100%).

Types of Probability

Type Definition Example
Classical (Theoretical) Based on equally likely outcomes P(heads) = 1/2 for a fair coin
Empirical (Experimental) Based on observed data 30% of customers buy Product A (from past data)
Subjective Based on judgment/experience "I think there's a 70% chance it will rain"

Basic Probability Formula

P(A) = Number of favorable outcomes / Total number of possible outcomes

Example: Rolling a Die

What's the probability of rolling a number greater than 4?

Favorable outcomes: {5, 6} = 2 outcomes

Total outcomes: {1, 2, 3, 4, 5, 6} = 6 outcomes

P(X > 4) = 2/6 = 1/3 ≈ 33.3%

Complement Rule

The probability of an event NOT happening is 1 minus the probability it does happen.

Complement Rule

P(not A) = P(A') = 1 - P(A)

Example: Complement

If P(rain) = 0.3, then P(no rain) = 1 - 0.3 = 0.7 (70%)

Addition Rule (OR)

For finding the probability of event A OR event B occurring.

Addition Rule

For mutually exclusive events:

P(A or B) = P(A) + P(B)

For non-mutually exclusive events:

P(A or B) = P(A) + P(B) - P(A and B)

Example: Addition Rule

Mutually Exclusive: P(rolling 1 OR 6) = 1/6 + 1/6 = 2/6 = 1/3

Non-Mutually Exclusive: In a class, 40% play cricket, 30% play football, 15% play both.

P(cricket OR football) = 0.40 + 0.30 - 0.15 = 0.55 (55%)

Multiplication Rule (AND)

For finding the probability of event A AND event B both occurring.

Multiplication Rule

For independent events:

P(A and B) = P(A) × P(B)

For dependent events:

P(A and B) = P(A) × P(B|A)

Example: Multiplication Rule

Independent: Flip two coins. P(both heads) = 1/2 × 1/2 = 1/4

Dependent: Draw 2 cards without replacement. P(both aces)?

P(1st ace) = 4/52, P(2nd ace | 1st was ace) = 3/51

P(both aces) = (4/52) × (3/51) = 12/2652 = 1/221 ≈ 0.45%

Conditional Probability

The probability of A given that B has already occurred.

Conditional Probability Formula

P(A|B) = P(A and B) / P(B)

Read as "probability of A given B"

Example: Conditional Probability

In a company: 60% are developers, 40% are managers. Among developers, 30% are senior. Among managers, 50% are senior.

Given that someone is senior, what's the probability they're a developer?

P(Senior and Developer) = 0.60 × 0.30 = 0.18

P(Senior and Manager) = 0.40 × 0.50 = 0.20

P(Senior) = 0.18 + 0.20 = 0.38

P(Developer|Senior) = 0.18 / 0.38 = 47.4%

Bayes' Theorem

Bayes' theorem allows us to update probabilities based on new evidence.

Bayes' Theorem

P(A|B) = [P(B|A) × P(A)] / P(B)

This lets us "reverse" conditional probabilities.

Example: Medical Testing (Classic Bayes' Application)

A disease affects 1% of the population. A test is 99% accurate (detects disease when present) and has a 5% false positive rate.

If you test positive, what's the probability you have the disease?

  • P(Disease) = 0.01
  • P(Positive|Disease) = 0.99
  • P(Positive|No Disease) = 0.05

P(Positive) = P(Pos|Disease)×P(Disease) + P(Pos|No Disease)×P(No Disease)

P(Positive) = 0.99×0.01 + 0.05×0.99 = 0.0099 + 0.0495 = 0.0594

P(Disease|Positive) = (0.99 × 0.01) / 0.0594 = 16.7%

Even with a positive test, there's only a 16.7% chance of actually having the disease!

Base Rate Fallacy

The example above shows why we must consider the base rate (prevalence). A "99% accurate" test can still give mostly false positives when the condition is rare!

# Python: Probability Calculations import numpy as np from itertools import product # Basic Probability - Rolling a die outcomes = [1, 2, 3, 4, 5, 6] favorable = [x for x in outcomes if x > 4] prob = len(favorable) / len(outcomes) print(f"P(X > 4) = {prob:.4f}") # Multiplication Rule - Coin flips def prob_both_heads(n_flips): return (0.5) ** n_flips print(f"P(3 heads in a row) = {prob_both_heads(3):.4f}") # Bayes' Theorem - Medical Test p_disease = 0.01 p_positive_given_disease = 0.99 p_positive_given_no_disease = 0.05 # Calculate P(Positive) p_positive = (p_positive_given_disease * p_disease + p_positive_given_no_disease * (1 - p_disease)) # Apply Bayes' Theorem p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive print(f"\nBayes' Theorem - Medical Test:") print(f"P(Disease | Positive) = {p_disease_given_positive:.2%}") # Simulation verification np.random.seed(42) population = 100000 has_disease = np.random.random(population) < p_disease test_positive = np.where( has_disease, np.random.random(population) < p_positive_given_disease, np.random.random(population) < p_positive_given_no_disease ) simulated_prob = np.sum(has_disease & test_positive) / np.sum(test_positive) print(f"Simulated P(Disease | Positive) = {simulated_prob:.2%}")

Try It Yourself - Day 4 Exercises

  1. What's the probability of drawing a King OR a Heart from a standard deck?
  2. If P(sunny) = 0.7 and P(traffic|sunny) = 0.3, P(traffic|not sunny) = 0.6, what's P(traffic)?
  3. A spam filter is 98% accurate at detecting spam and has a 3% false positive rate. If 20% of emails are spam, what's P(spam|flagged)?
  4. You flip 4 coins. What's the probability of getting exactly 2 heads?

Day 4 Key Takeaways

  • Probability ranges from 0 to 1 (or 0% to 100%)
  • Complement Rule: P(not A) = 1 - P(A)
  • Addition Rule: For OR, add probabilities (subtract overlap if not mutually exclusive)
  • Multiplication Rule: For AND, multiply probabilities (use conditional if dependent)
  • Conditional Probability: P(A|B) = P(A and B) / P(B)
  • Bayes' Theorem: Update beliefs with new evidence; beware of base rate fallacy
Day 5

Common Probability Distributions

Patterns in Random Events

Discrete vs Continuous Distributions

Discrete Continuous
Countable outcomes (0, 1, 2, 3...) Infinite outcomes in a range
Number of customers, defects, sales Height, weight, time, temperature
Probability Mass Function (PMF) Probability Density Function (PDF)
Binomial, Poisson Normal, Exponential

1. Binomial Distribution

Models the number of successes in a fixed number of independent trials, each with the same probability of success.

Binomial Distribution

Use when: Fixed number of trials, two outcomes (success/failure), constant probability, independent trials.

Parameters: n (number of trials), p (probability of success)

Mean: μ = n × p

Variance: σ² = n × p × (1-p)

P(X = k) = C(n,k) × p𝓀 × (1-p)ⁿ⁻𝓀

Where C(n,k) = n! / (k! × (n-k)!)

Example: Customer Conversion

A website has a 10% conversion rate. Out of 20 visitors, what's the probability exactly 3 will convert?

n = 20, p = 0.10, k = 3

C(20,3) = 1140

P(X=3) = 1140 × (0.10)³ × (0.90)¹&sup7; = 0.190 (19%)

Expected conversions: 20 × 0.10 = 2

2. Poisson Distribution

Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant rate.

Poisson Distribution

Use when: Counting events in fixed time/space, events are rare, independent, rate is constant.

Parameter: λ (lambda) = average rate of occurrence

Mean: μ = λ

Variance: σ² = λ

P(X = k) = (λ𝓀 × e⁻𝜆) / k!

Example: Customer Service Calls

A call center receives an average of 5 calls per hour. What's the probability of receiving exactly 3 calls in an hour?

λ = 5, k = 3

P(X=3) = (5³ × e⁻&sup5;) / 3! = (125 × 0.0067) / 6 = 0.140 (14%)

3. Normal Distribution (Revisited)

The most important continuous distribution - many natural phenomena follow it.

Normal Distribution

Use when: Continuous data, symmetric distribution, result of many small independent factors.

Parameters: μ (mean), σ (standard deviation)

Properties: Symmetric, mean=median=mode, 68-95-99.7 rule applies.

f(x) = (1 / σ√2π) × e⁻½[(x-μ)/σ]²

Example: Test Scores

Exam scores are normally distributed with mean=72, SD=8.

What percentage of students scored above 80?

Z = (80-72)/8 = 1.0

P(Z > 1.0) = 1 - P(Z ≤ 1.0) = 1 - 0.8413 = 0.1587 (15.87%)

4. Exponential Distribution

Models the time between events in a Poisson process - waiting times, lifetimes, etc.

Exponential Distribution

Use when: Time until next event, memoryless process (past doesn't affect future).

Parameter: λ (rate) or 1/λ = μ (mean)

Mean: μ = 1/λ

Variance: σ² = 1/λ²

P(X ≤ x) = 1 - e⁻𝜆ₓ

Example: Time Between Customers

Customers arrive at a rate of 10 per hour (λ = 10). What's the probability the next customer arrives within 5 minutes (1/12 hour)?

P(X ≤ 1/12) = 1 - e⁻¹⁰×(¹/¹²) = 1 - e⁻⁰⋅⁸³³

P(X ≤ 1/12) = 1 - 0.435 = 0.565 (56.5%)

Choosing the Right Distribution

Question Distribution Example
How many successes in n trials? Binomial Defects in batch of 100
How many events in fixed time? Poisson Emails per hour
Where does this measurement fall? Normal Student height, test scores
How long until next event? Exponential Time until next customer
# Python: Working with Distributions from scipy import stats import numpy as np # Binomial Distribution n, p = 20, 0.10 binomial = stats.binom(n, p) print("Binomial (n=20, p=0.1):") print(f" P(X=3) = {binomial.pmf(3):.4f}") print(f" P(X<=2) = {binomial.cdf(2):.4f}") print(f" Mean = {binomial.mean()}, Var = {binomial.var():.2f}") # Poisson Distribution lambda_rate = 5 poisson = stats.poisson(lambda_rate) print("\nPoisson (λ=5):") print(f" P(X=3) = {poisson.pmf(3):.4f}") print(f" P(X>=7) = {1 - poisson.cdf(6):.4f}") # Normal Distribution mean, std = 72, 8 normal = stats.norm(mean, std) print("\nNormal (μ=72, σ=8):") print(f" P(X>80) = {1 - normal.cdf(80):.4f}") print(f" 90th percentile = {normal.ppf(0.90):.2f}") # Exponential Distribution rate = 10 # 10 per hour exponential = stats.expon(scale=1/rate) print("\nExponential (λ=10 per hour):") print(f" P(X<=5min) = {exponential.cdf(1/12):.4f}") print(f" Mean time = {exponential.mean()*60:.1f} minutes")

Excel Functions for Distributions

  • =BINOM.DIST(k, n, p, FALSE) - Binomial P(X=k)
  • =BINOM.DIST(k, n, p, TRUE) - Binomial P(X≤k)
  • =POISSON.DIST(k, lambda, FALSE) - Poisson P(X=k)
  • =NORM.DIST(x, mean, std, TRUE) - Normal P(X≤x)
  • =NORM.INV(prob, mean, std) - Normal percentile
  • =EXPON.DIST(x, lambda, TRUE) - Exponential P(X≤x)

Try It Yourself - Day 5 Exercises

  1. A factory has 5% defect rate. In a sample of 50, what's P(exactly 3 defects)?
  2. A website gets 100 visits per hour. What's P(more than 120 visits)?
  3. Heights are normal with mean=165cm, SD=7cm. What height is the 95th percentile?
  4. If buses arrive every 15 minutes on average, what's P(waiting more than 20 minutes)?

Day 5 Key Takeaways

  • Binomial: Fixed trials, count successes (defects, conversions, yes/no)
  • Poisson: Count events in fixed time/space (calls, arrivals, errors)
  • Normal: Continuous measurements, symmetric (heights, scores, errors)
  • Exponential: Time between events, memoryless (wait times, lifetimes)
  • Match the distribution to the question being asked
Day 6

Sampling & Estimation

Making Inferences About Populations

Population vs Sample

  • Population: The entire group you want to study (e.g., all customers, all products)
  • Sample: A subset of the population used to make inferences
  • Parameter: A measure describing the population (μ, σ)
  • Statistic: A measure calculated from a sample (x̄, s)

Why sample? Studying entire populations is often impossible, too expensive, or destructive (quality testing).

Sampling Methods

Method Description When to Use
Simple Random Every member has equal chance of selection Homogeneous population
Stratified Divide into groups (strata), sample from each Population has distinct subgroups
Cluster Divide into clusters, randomly select clusters Population spread geographically
Systematic Select every k-th member Ordered list available
Convenience Use whoever is available Quick, informal insights (biased!)

Example: Choosing a Sampling Method

Scenario: Survey customer satisfaction for a bank with branches across India.

Simple Random: Randomly select 1000 customers from the entire customer database.

Stratified: Sample proportionally from different account types (Savings, Current, Fixed Deposit).

Cluster: Randomly select 50 branches, survey all customers at those branches.

Sampling Distribution of the Mean

If we take many samples and calculate the mean of each, these sample means form their own distribution.

Properties of Sampling Distribution

  • Center: Mean of sample means = Population mean (μᵣ = μ)
  • Spread: Standard error = σ / √n (smaller than population SD)
  • Shape: Approximately normal (if n is large enough - CLT)

Standard Error of the Mean

SE = σ / √n

As sample size increases, standard error decreases (estimates become more precise).

Central Limit Theorem (CLT)

One of the most important theorems in statistics!

Central Limit Theorem: Regardless of the population distribution, the sampling distribution of the mean approaches a normal distribution as sample size increases (typically n ≥ 30).

CLT in Action

Rolling a die has a uniform distribution (each number equally likely).

Population mean = 3.5, Population SD = 1.71

If we take samples of n=30 and calculate means:

  • Mean of sample means ≈ 3.5
  • Standard error = 1.71/√30 = 0.31
  • Distribution of sample means is approximately normal!

Confidence Intervals

A confidence interval gives a range of plausible values for a population parameter.

Confidence Interval for Mean (known σ)

CI = x̄ ± Z × (σ / √n)

Common Z values: 90% CI → Z=1.645, 95% CI → Z=1.96, 99% CI → Z=2.576

Confidence Interval for Mean (unknown σ, use t-distribution)

CI = x̄ ± t × (s / √n)

Use t-distribution with (n-1) degrees of freedom when using sample standard deviation.

Example: Calculating a 95% Confidence Interval

A sample of 36 customers shows average spend of Rs. 500 with SD of Rs. 120.

n = 36, x̄ = 500, s = 120

Standard Error = 120/√36 = 20

For 95% CI with df=35, t ≈ 2.03

CI = 500 ± 2.03 × 20 = 500 ± 40.6

95% CI: (Rs. 459.40, Rs. 540.60)

Interpretation: We are 95% confident the true average spend is between Rs. 459.40 and Rs. 540.60.

Common Misconception

A 95% CI does NOT mean there's a 95% probability the true mean is in this interval. The true mean is fixed - it either is or isn't in the interval.

Instead: If we repeated this sampling process many times, 95% of the intervals would contain the true mean.

Factors Affecting Confidence Interval Width

Factor Effect on CI Width
Increase sample size (n) Narrower CI (more precise)
Increase confidence level Wider CI (more certain)
Larger population SD Wider CI (more variability)
# Python: Sampling and Confidence Intervals import numpy as np from scipy import stats # Simulate Central Limit Theorem np.random.seed(42) population = np.random.exponential(scale=100, size=100000) # Skewed! # Take many samples and calculate means sample_size = 30 n_samples = 1000 sample_means = [np.mean(np.random.choice(population, sample_size)) for _ in range(n_samples)] print("Central Limit Theorem Demonstration:") print(f"Population mean: {np.mean(population):.2f}") print(f"Mean of sample means: {np.mean(sample_means):.2f}") print(f"Theoretical SE: {np.std(population)/np.sqrt(sample_size):.2f}") print(f"Actual SE: {np.std(sample_means):.2f}") # Confidence Interval Calculation sample = np.random.choice(population, 36) x_bar = np.mean(sample) s = np.std(sample, ddof=1) n = len(sample) se = s / np.sqrt(n) # 95% CI using t-distribution confidence = 0.95 t_critical = stats.t.ppf((1 + confidence) / 2, df=n-1) margin_of_error = t_critical * se ci_lower = x_bar - margin_of_error ci_upper = x_bar + margin_of_error print(f"\n95% Confidence Interval:") print(f"Sample mean: {x_bar:.2f}") print(f"CI: ({ci_lower:.2f}, {ci_upper:.2f})") # Using scipy's built-in function ci = stats.t.interval(confidence, df=n-1, loc=x_bar, scale=se) print(f"SciPy CI: ({ci[0]:.2f}, {ci[1]:.2f})")

Excel Functions for Sampling & Estimation

  • =CONFIDENCE.T(alpha, std_dev, size) - Margin of error (t-distribution)
  • =CONFIDENCE.NORM(alpha, std_dev, size) - Margin of error (normal)
  • =T.INV.2T(alpha, df) - Two-tailed t critical value
  • =NORM.S.INV(probability) - Z critical value

Try It Yourself - Day 6 Exercises

  1. A sample of 49 products has mean weight 100g, SD 14g. Calculate 95% CI for true mean weight.
  2. If you want to cut the margin of error in half, how should sample size change?
  3. Why would stratified sampling be better than simple random sampling for a salary survey across different departments?
  4. Population SD is 50. How large a sample is needed for a margin of error of 5 at 95% confidence?

Day 6 Key Takeaways

  • Samples allow inference about populations without measuring everyone
  • Different sampling methods suit different situations
  • Central Limit Theorem: Sample means are approximately normal for large n
  • Standard error measures precision: SE = σ/√n
  • Confidence intervals give a range of plausible values for parameters
  • Larger samples give narrower (more precise) confidence intervals
Day 7

Week 4 Project & Assessment

Apply Your Statistical Knowledge

Project: Statistical Analysis Report

Create a comprehensive statistical analysis of a business dataset, applying all concepts learned this week.

Project Requirements

  1. Dataset: Use the provided employee salary dataset or your own data (minimum 100 records)
  2. Descriptive Statistics:
    • Calculate mean, median, mode for key variables
    • Calculate range, IQR, variance, standard deviation
    • Create five-number summary and identify outliers
  3. Distribution Analysis:
    • Create histograms for numeric variables
    • Calculate skewness and kurtosis
    • Test for normality visually
    • Calculate and interpret Z-scores
  4. Probability Calculations:
    • Calculate relevant probabilities using appropriate distributions
    • Apply conditional probability to a business question
  5. Sampling & Estimation:
    • Calculate 95% confidence interval for mean salary
    • Explain sampling strategy recommendations
  6. Insights & Recommendations:
    • Summarize key findings
    • Provide data-driven recommendations

Sample Project Structure

Statistical Analysis: Employee Compensation Study

1. Executive Summary

Brief overview of findings...

2. Data Overview

Dataset description, variables, data quality...

3. Descriptive Statistics

MeasureSalary (Rs.)Experience (Years)
Mean65,0007.2
Median55,0006.0
Std Dev25,0004.5

4. Distribution Analysis

Salary is right-skewed (skewness = 1.2), suggesting a few high earners...

5. Key Findings & Recommendations

Based on our analysis, we recommend...

Assessment Breakdown

Component Weight Criteria
Statistical Analysis Report 50% Completeness, accuracy, insights
Quiz: Statistics Fundamentals 30% Conceptual understanding
Problem-Solving Exercises 20% Calculation accuracy, interpretation

Week 4 Summary

  • Day 1: Central tendency - Mean, Median, Mode for summarizing data
  • Day 2: Dispersion - Range, IQR, Variance, SD for measuring spread
  • Day 3: Distributions - Shape, Skewness, Normal distribution, Z-scores
  • Day 4: Probability - Rules, Conditional probability, Bayes' theorem
  • Day 5: Distributions - Binomial, Poisson, Normal, Exponential
  • Day 6: Sampling - Methods, CLT, Confidence intervals
  • Day 7: Project applying all concepts

Self-Assessment Quiz

Test your understanding of this week's concepts:

1. A dataset has mean=50 and median=65. The distribution is:

  • a) Left-skewed (negative skew)
  • b) Right-skewed (positive skew)
  • c) Symmetric
  • d) Cannot determine

2. Which measure of central tendency is most affected by outliers?

  • a) Mode
  • b) Median
  • c) Mean
  • d) All equally affected

3. According to the Empirical Rule, what percentage of data falls within 2 standard deviations of the mean in a normal distribution?

  • a) 68%
  • b) 95%
  • c) 99.7%
  • d) 100%

4. A student scored 85 on a test where mean=75 and SD=5. What is their Z-score?

  • a) 1.0
  • b) 2.0
  • c) -2.0
  • d) 10.0

5. If P(A)=0.4 and P(B)=0.3, and A and B are independent, what is P(A and B)?

  • a) 0.7
  • b) 0.12
  • c) 0.1
  • d) 0.58

6. Which distribution would you use to model the number of customer arrivals per hour?

  • a) Normal
  • b) Binomial
  • c) Poisson
  • d) Exponential

7. As sample size increases, what happens to the standard error?

  • a) Increases
  • b) Decreases
  • c) Stays the same
  • d) Becomes zero

8. A 95% confidence interval means:

  • a) 95% probability the true mean is in this interval
  • b) 95% of data falls in this interval
  • c) If repeated many times, 95% of intervals would contain true mean
  • d) The mean is exactly at the center

9. Which sampling method divides population into subgroups and samples from each?

  • a) Simple random
  • b) Stratified
  • c) Cluster
  • d) Systematic

10. The Central Limit Theorem states that:

  • a) All populations are normally distributed
  • b) Sample means approach normal distribution as n increases
  • c) Larger samples have larger variance
  • d) Standard deviation equals standard error

Quiz Answers

1-a, 2-c, 3-b, 4-b, 5-b, 6-c, 7-b, 8-c, 9-b, 10-b

Scoring: 9-10 correct: Excellent! | 7-8: Good understanding | 5-6: Review needed | Below 5: Re-study the material

About the Instructor

Pawan Mali

Data Analyst & Statistics Educator | 5+ Years Experience

Pawan brings extensive experience in statistical analysis and data-driven decision making. Having worked with companies across finance, healthcare, and e-commerce sectors, he specializes in making complex statistical concepts accessible and practical.

Specializations:

  • Statistical Analysis & Hypothesis Testing
  • Data Visualization & Business Intelligence
  • Python, R, Excel for Data Analysis
  • Machine Learning Fundamentals

Visit Instructor's Portfolio