Week 5: Advanced Statistics & Exploratory Data Analysis

Hypothesis Testing, EDA Techniques & Data Storytelling

Duration: 7 Days | Level: Intermediate | Prerequisites: Week 4 Statistics

Welcome to Week 5! You've built a solid foundation in statistics - now it's time to apply those skills to real-world analysis. This week, you'll learn how to test hypotheses, explore data systematically, and communicate findings effectively.

By the end of this week, you'll be able to design experiments, analyze A/B tests, perform comprehensive exploratory data analysis, and present your findings in a compelling way.

Day 1

Hypothesis Testing Fundamentals

Making Data-Driven Decisions with Confidence

What is Hypothesis Testing?

Hypothesis Testing is a statistical method used to make decisions about a population based on sample data. It helps us determine whether observed differences are statistically significant or just due to random chance.

Think of hypothesis testing as a courtroom trial: we start with an assumption of innocence (null hypothesis) and need evidence to prove guilt (alternative hypothesis).

Null and Alternative Hypotheses

H₀ (Null Hypothesis)

The "status quo" - no effect, no difference

Example: "The new website design has NO effect on conversion rates"

H₁ (Alternative Hypothesis)

What we're trying to prove - there IS an effect

Example: "The new website design DOES affect conversion rates"

Type I and Type II Errors

Type I Error (False Positive)

Rejecting H₀ when it's actually true

Example: Concluding a drug works when it doesn't

Probability = α (significance level)

Type II Error (False Negative)

Failing to reject H₀ when it's actually false

Example: Missing a real effect that exists

Probability = β

P-values and Significance Levels

P-value: The probability of observing results at least as extreme as the sample results, assuming the null hypothesis is true.

Significance Level (α): The threshold for rejecting H₀, typically 0.05 (5%).

Decision Rule

  • If p-value ≤ α: Reject H₀ (result is statistically significant)
  • If p-value > α: Fail to reject H₀ (result is not statistically significant)

One-tailed vs Two-tailed Tests

Type When to Use Example H₁
Two-tailed Testing for any difference (could be higher OR lower) μ ≠ 50 (mean is different from 50)
One-tailed (Right) Testing if value is greater than μ > 50 (mean is greater than 50)
One-tailed (Left) Testing if value is less than μ < 50 (mean is less than 50)

Steps in Hypothesis Testing

  1. State the hypotheses: Define H₀ and H₁
  2. Choose significance level: Typically α = 0.05
  3. Select the appropriate test: Based on data type and sample size
  4. Calculate the test statistic: Z-score, t-statistic, etc.
  5. Find the p-value: Probability of observing the result
  6. Make a decision: Reject or fail to reject H₀
  7. Interpret results: In context of the business problem

Example: Testing a Marketing Campaign

Scenario: A company claims their new email campaign increases click-through rates above the historical average of 2.5%.

H₀: μ = 2.5% (no improvement)

H₁: μ > 2.5% (improvement exists)

Sample: 1000 emails sent, 32 clicks (3.2%)

Test: One-sample z-test

Result: p-value = 0.023

Decision: Since p < 0.05, reject H₀. The new campaign significantly improves click rates.

# Python: Hypothesis Testing Example from scipy import stats import numpy as np # One-sample t-test # Testing if sample mean differs from population mean sample_data = [52, 48, 55, 51, 49, 53, 50, 54, 47, 52] population_mean = 50 t_stat, p_value = stats.ttest_1samp(sample_data, population_mean) print(f"Sample Mean: {np.mean(sample_data):.2f}") print(f"T-statistic: {t_stat:.4f}") print(f"P-value: {p_value:.4f}") alpha = 0.05 if p_value < alpha: print("Result: Reject H₀ - Significant difference found") else: print("Result: Fail to reject H₀ - No significant difference")

Try It Yourself - Day 1 Activities

  1. Formulate H₀ and H₁ for: "Does a new training program improve employee productivity?"
  2. A test gives p-value = 0.03 with α = 0.05. What's your conclusion?
  3. Explain the difference between Type I and Type II errors in a medical testing context

Day 1 Key Takeaways

  • Hypothesis testing helps make data-driven decisions
  • H₀ represents "no effect"; H₁ represents what we're testing
  • P-value measures evidence against H₀
  • Reject H₀ when p-value ≤ α (typically 0.05)
  • Type I error: false positive; Type II error: false negative
Day 2

Statistical Tests

Choosing and Applying the Right Test

Z-Test for Large Samples

Use Z-test when: sample size is large (n ≥ 30) and population standard deviation is known.

Z-Test Formula

Z = (x̄ - μ) / (σ / √n)

Where: x̄ = sample mean, μ = population mean, σ = population SD, n = sample size

T-Tests

Use T-test when: sample size is small (n < 30) or population SD is unknown.

T-Test Type Use Case Example
One-Sample Compare sample mean to known value Is average delivery time different from 3 days?
Two-Sample (Independent) Compare means of two independent groups Do male and female customers spend differently?
Paired Compare means from same group at different times Did training improve employee scores?
# Python: Different T-Tests from scipy import stats import numpy as np # Two-Sample Independent T-Test group_a = [85, 90, 78, 92, 88, 76, 95, 89] group_b = [78, 82, 75, 80, 77, 83, 79, 81] t_stat, p_value = stats.ttest_ind(group_a, group_b) print(f"Two-Sample T-Test: t={t_stat:.3f}, p={p_value:.4f}") # Paired T-Test (before/after) before = [200, 220, 190, 210, 205] after = [180, 195, 170, 185, 175] t_stat, p_value = stats.ttest_rel(before, after) print(f"Paired T-Test: t={t_stat:.3f}, p={p_value:.4f}")

Chi-Square Test for Independence

Tests whether two categorical variables are independent or related.

Example: Customer Preference by Region

Is product preference independent of region?

Product A Product B Product C
North 50 30 20
South 35 45 20
West 40 35 25

Chi-square test determines if the differences are statistically significant.

ANOVA: Analysis of Variance

Compares means across three or more groups to see if at least one is significantly different.

When to Use ANOVA

  • Comparing sales across multiple regions
  • Testing if different treatments have different effects
  • Comparing customer satisfaction across product categories
# Python: ANOVA from scipy import stats # Sales data from three regions region_a = [120, 135, 128, 142, 130] region_b = [145, 150, 138, 155, 148] region_c = [110, 115, 122, 108, 118] f_stat, p_value = stats.f_oneway(region_a, region_b, region_c) print(f"ANOVA: F={f_stat:.3f}, p={p_value:.4f}") if p_value < 0.05: print("At least one region has significantly different sales")

Choosing the Right Test

Data Type Comparison Test
Continuous 1 sample vs known value One-sample t-test
Continuous 2 independent groups Two-sample t-test
Continuous Same group, before/after Paired t-test
Continuous 3+ groups ANOVA
Categorical Independence of 2 variables Chi-square test

Excel Functions for Statistical Tests

  • =T.TEST(array1, array2, tails, type) - T-test
  • =CHISQ.TEST(observed, expected) - Chi-square test
  • =F.TEST(array1, array2) - F-test for variance
  • Data Analysis ToolPak: ANOVA, t-Test, etc.

Day 2 Key Takeaways

  • Z-test for large samples with known population SD
  • T-test for small samples or unknown population SD
  • Chi-square tests independence between categorical variables
  • ANOVA compares means across 3+ groups
  • Choose tests based on data type and comparison needed
Day 3

Correlation & Relationships

Understanding How Variables Relate

Correlation vs Causation

Critical Concept: Correlation ≠ Causation

Just because two variables move together doesn't mean one causes the other!

Example: Ice cream sales and drowning deaths are correlated. Does ice cream cause drowning? No - both increase in summer (confounding variable).

Pearson Correlation Coefficient (r)

Measures the strength and direction of LINEAR relationship between two continuous variables.

-1 (Strong Negative) 0 (No Correlation) +1 (Strong Positive)
r Value Interpretation
0.9 to 1.0 Very strong positive
0.7 to 0.9 Strong positive
0.5 to 0.7 Moderate positive
0.3 to 0.5 Weak positive
-0.3 to 0.3 Little to no correlation
-0.5 to -0.3 Weak negative
-1.0 to -0.7 Strong negative

Spearman Rank Correlation

Use when: data is ordinal, or relationship is monotonic but not linear.

Example: Correlation Analysis

Analyzing relationship between advertising spend and sales:

  • Ad Spend: Rs. 10K, 15K, 20K, 25K, 30K
  • Sales: Rs. 100K, 140K, 180K, 210K, 250K

Pearson r = 0.99 (Very strong positive correlation)

As ad spend increases, sales tend to increase proportionally.

# Python: Correlation Analysis import numpy as np from scipy import stats import pandas as pd # Sample data ad_spend = [10, 15, 20, 25, 30, 35, 40] sales = [100, 140, 180, 210, 250, 280, 310] # Pearson correlation pearson_r, p_value = stats.pearsonr(ad_spend, sales) print(f"Pearson r: {pearson_r:.4f}, p-value: {p_value:.4f}") # Spearman correlation spearman_r, p_value = stats.spearmanr(ad_spend, sales) print(f"Spearman r: {spearman_r:.4f}") # Correlation matrix for multiple variables df = pd.DataFrame({ 'price': [100, 150, 200, 250, 300], 'quantity': [500, 400, 350, 250, 200], 'rating': [4.2, 4.5, 4.1, 3.8, 3.5] }) print("\nCorrelation Matrix:") print(df.corr())

Introduction to Linear Regression

While correlation measures relationship strength, regression helps predict one variable from another.

Simple Linear Regression

y = mx + b

Where: y = predicted value, m = slope, x = input, b = intercept

Excel Correlation Functions

  • =CORREL(array1, array2) - Pearson correlation
  • =RSQ(known_y, known_x) - R-squared value
  • =SLOPE(known_y, known_x) - Regression slope
  • =INTERCEPT(known_y, known_x) - Regression intercept

Day 3 Key Takeaways

  • Correlation shows relationship strength (-1 to +1)
  • Correlation does NOT imply causation
  • Pearson for linear relationships; Spearman for monotonic
  • Correlation matrices help visualize multiple relationships
  • Regression predicts values; correlation measures association
Day 4

Exploratory Data Analysis

Systematic Data Exploration

What is EDA?

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It's about understanding your data before formal modeling.

EDA Process and Methodology

  1. Understand the data: What does each column represent?
  2. Check data quality: Missing values, duplicates, errors
  3. Univariate analysis: Examine each variable individually
  4. Bivariate analysis: Examine relationships between pairs
  5. Multivariate analysis: Examine complex relationships
  6. Document findings: Insights, issues, recommendations

Univariate Analysis

Analyzing one variable at a time.

Variable Type Analysis Methods Visualizations
Numerical Mean, median, SD, min, max, quartiles Histogram, box plot, density plot
Categorical Frequency counts, mode, proportions Bar chart, pie chart

Bivariate Analysis

Analyzing relationships between two variables.

Variable Combination Methods
Numerical vs Numerical Correlation, scatter plot, regression
Categorical vs Categorical Cross-tabulation, chi-square, stacked bar
Numerical vs Categorical Group statistics, box plots by group, ANOVA

Handling Missing Data

Strategies for Missing Values

  • Remove: Delete rows/columns with missing values (if small %)
  • Impute with mean/median: For numerical data
  • Impute with mode: For categorical data
  • Forward/backward fill: For time series
  • Flag as missing: Create indicator variable

Dealing with Outliers

Outlier Detection Methods

  • IQR Method: Values below Q1-1.5×IQR or above Q3+1.5×IQR
  • Z-score: Values with |z| > 3
  • Visual: Box plots, scatter plots

Handling Outliers

  • Investigate: Are they errors or genuine?
  • Remove: If errors or not relevant
  • Cap/Floor: Replace with threshold values
  • Transform: Log transformation reduces impact
  • Keep: If valid and important for analysis
# Python: EDA Workflow import pandas as pd import numpy as np # Load data df = pd.read_csv('data.csv') # 1. Basic info print(df.info()) print(df.describe()) # 2. Missing values print("\nMissing Values:") print(df.isnull().sum()) # 3. Duplicates print(f"\nDuplicates: {df.duplicated().sum()}") # 4. Value counts for categorical print("\nCategory Distribution:") print(df['category'].value_counts()) # 5. Correlation matrix print("\nCorrelation Matrix:") print(df.corr())

Day 4 Key Takeaways

  • EDA is essential before any modeling or analysis
  • Start with univariate, then bivariate, then multivariate
  • Always check for missing values and outliers
  • Choose handling strategies based on context
  • Document all findings and data quality issues
Day 5

A/B Testing & Experimentation

Data-Driven Decision Making

What is A/B Testing?

A/B Testing (split testing) is a method of comparing two versions of something to determine which performs better. Users are randomly assigned to see either version A (control) or version B (treatment).

Variant A (Control)

Original version

50% of traffic

Baseline performance

Variant B (Treatment)

New version with change

50% of traffic

Test performance

Designing Experiments

  1. Define the goal: What metric are you trying to improve?
  2. Formulate hypothesis: What change do you expect?
  3. Calculate sample size: How many users needed?
  4. Randomize: Randomly assign users to groups
  5. Run the test: Collect data for sufficient time
  6. Analyze results: Statistical significance check
  7. Make decision: Implement winner or iterate

Sample Size Calculation

Factors Affecting Sample Size

  • Baseline conversion rate: Current performance
  • Minimum detectable effect: Smallest improvement worth detecting
  • Significance level (α): Usually 0.05
  • Power (1-β): Usually 0.80 (80%)

Example: Website Button Color Test

Goal: Increase sign-up conversion rate

H₀: Button color has no effect on conversions

H₁: New button color increases conversions

Current rate: 3%

Target improvement: 3.5% (16.7% relative increase)

Sample size needed: ~15,000 per variant

Common A/B Testing Pitfalls

Avoid These Mistakes

  • Peeking: Checking results too early and stopping
  • Small sample: Not enough data for significance
  • Multiple testing: Testing many variants inflates false positives
  • Novelty effect: Users react to change, not the change itself
  • Selection bias: Non-random group assignment
  • Running too short: Miss weekly/monthly patterns
# Python: A/B Test Analysis from scipy import stats import numpy as np # A/B test results control_visitors = 10000 control_conversions = 300 # 3% treatment_visitors = 10000 treatment_conversions = 350 # 3.5% # Conversion rates control_rate = control_conversions / control_visitors treatment_rate = treatment_conversions / treatment_visitors print(f"Control: {control_rate:.2%}") print(f"Treatment: {treatment_rate:.2%}") print(f"Lift: {(treatment_rate - control_rate) / control_rate:.2%}") # Two-proportion z-test count = np.array([control_conversions, treatment_conversions]) nobs = np.array([control_visitors, treatment_visitors]) from statsmodels.stats.proportion import proportions_ztest z_stat, p_value = proportions_ztest(count, nobs, alternative='smaller') print(f"\nZ-statistic: {z_stat:.4f}") print(f"P-value: {p_value:.4f}") if p_value < 0.05: print("Result: Statistically significant - Treatment wins!") else: print("Result: Not significant - Cannot conclude treatment is better")

Day 5 Key Takeaways

  • A/B testing enables data-driven decisions
  • Proper sample size is crucial for valid results
  • Random assignment prevents selection bias
  • Avoid peeking - wait for statistical significance
  • Consider practical significance, not just statistical
Day 6

Data Visualization for Analysis

Telling Stories with Data

Choosing the Right Chart Type

Bar Chart

Comparing categories

Line Chart

Trends over time

Scatter Plot

Relationships between variables

Histogram

Distribution of values

Box Plot

Distribution + outliers

Heatmap

Correlation matrix

Effective Chart Design Principles

Best Practices

  • Clear title: State what the chart shows
  • Label axes: Include units
  • Start at zero: For bar charts (usually)
  • Remove clutter: No unnecessary gridlines, borders
  • Use color purposefully: Highlight key insights
  • Add context: Benchmarks, targets, annotations

Common Visualization Mistakes

Avoid These Errors

  • Truncated y-axis exaggerating differences
  • 3D charts that distort perception
  • Too many colors or categories
  • Pie charts for more than 5 categories
  • Missing legends or labels
  • Using the wrong chart type for the data

Telling Stories with Data

  1. Start with the insight: What's the key message?
  2. Build context: Why does this matter?
  3. Show evidence: Support with data
  4. Provide recommendations: What should we do?
  5. Make it actionable: Clear next steps
# Python: Data Visualization import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Set style sns.set_style("whitegrid") # Sample data months = ['Jan', 'Feb', 'Mar', 'Apr', 'May'] sales = [100, 120, 115, 140, 160] # Line chart plt.figure(figsize=(10, 6)) plt.plot(months, sales, marker='o', linewidth=2) plt.title('Monthly Sales Trend', fontsize=14, fontweight='bold') plt.xlabel('Month') plt.ylabel('Sales (Rs. thousands)') plt.tight_layout() plt.show() # Correlation heatmap df = pd.DataFrame({ 'price': [100, 150, 200, 250, 300], 'quantity': [500, 400, 350, 250, 200], 'revenue': [50000, 60000, 70000, 62500, 60000] }) plt.figure(figsize=(8, 6)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0) plt.title('Correlation Matrix') plt.tight_layout() plt.show()

Day 6 Key Takeaways

  • Choose chart type based on what you're showing
  • Keep visualizations clean and purposeful
  • Always include titles, labels, and context
  • Tell a story - start with the insight
  • End with actionable recommendations
Day 7

Week 5 Project & Assessment

Complete EDA & Statistical Analysis

Project: Complete EDA & Statistical Analysis

Project Requirements

Perform end-to-end exploratory data analysis with statistical testing:

  1. Comprehensive EDA with visualizations
    • Data overview and quality assessment
    • Univariate and bivariate analysis
    • Handle missing values and outliers
  2. Hypothesis testing for business questions
    • Formulate at least 2 hypotheses
    • Apply appropriate statistical tests
    • Interpret results in business context
  3. Correlation analysis and insights
    • Identify key relationships
    • Create correlation matrix
  4. A/B test analysis and recommendations
    • Analyze provided A/B test data
    • Calculate significance
    • Make recommendations
  5. Executive summary presentation
    • Key findings and insights
    • Data-driven recommendations
    • Next steps

Assessment Breakdown

Component Weight
EDA Report with Visualizations 50%
Quiz: Hypothesis Testing & EDA 30%
Presentation Skills 20%

Week 5 Summary

  • Day 1: Hypothesis testing fundamentals
  • Day 2: Statistical tests (t-test, chi-square, ANOVA)
  • Day 3: Correlation analysis
  • Day 4: EDA methodology
  • Day 5: A/B testing
  • Day 6: Data visualization
  • Day 7: Complete EDA project

Self-Assessment Quiz

1. A p-value of 0.03 with α = 0.05 means:

  • a) Fail to reject H₀
  • b) Reject H₀
  • c) Accept H₀
  • d) Cannot determine

2. Type I error is:

  • a) False negative
  • b) False positive
  • c) True negative
  • d) True positive

3. Which test compares means of 3+ groups?

  • a) T-test
  • b) Chi-square
  • c) ANOVA
  • d) Z-test

4. A correlation of -0.85 indicates:

  • a) Strong positive relationship
  • b) Weak negative relationship
  • c) Strong negative relationship
  • d) No relationship

5. In A/B testing, "peeking" refers to:

  • a) Looking at competitor tests
  • b) Checking results too early
  • c) Testing multiple variants
  • d) Using small samples

Quiz Answers

1-b, 2-b, 3-c, 4-c, 5-b