Week 5: Advanced Statistics & Exploratory Data Analysis

Hypothesis Testing, EDA Techniques & Data Storytelling

Duration: 7 Days | Level: Intermediate | Prerequisites: Week 4 Statistics

Welcome to Week 5! You've built a solid foundation in statistics - now it's time to apply those skills to real-world analysis. This week, you'll learn how to test hypotheses, explore data systematically, and communicate findings effectively.

By the end of this week, you'll be able to design experiments, analyze A/B tests, perform comprehensive exploratory data analysis, and present your findings in a compelling way.

Day 1

Hypothesis Testing Fundamentals

Making Data-Driven Decisions with Confidence

What is Hypothesis Testing?

Hypothesis Testing is a statistical method used to make decisions about a population based on sample data. It helps us determine whether observed differences are statistically significant or just due to random chance.

Think of hypothesis testing as a courtroom trial: we start with an assumption of innocence (null hypothesis) and need evidence to prove guilt (alternative hypothesis).

Null and Alternative Hypotheses

H₀ (Null Hypothesis)

The "status quo" - no effect, no difference

Example: "The new website design has NO effect on conversion rates"

H₁ (Alternative Hypothesis)

What we're trying to prove - there IS an effect

Example: "The new website design DOES affect conversion rates"

Type I and Type II Errors

Type I Error (False Positive)

Rejecting H₀ when it's actually true

Example: Concluding a drug works when it doesn't

Probability = α (significance level)

Type II Error (False Negative)

Failing to reject H₀ when it's actually false

Example: Missing a real effect that exists

Probability = β

P-values and Significance Levels

P-value: The probability of observing results at least as extreme as the sample results, assuming the null hypothesis is true.

Significance Level (α): The threshold for rejecting H₀, typically 0.05 (5%).

                        Decision Rule
                        If p-value ≤ α: Reject H₀ (result is statistically significant)
If p-value > α: Fail to reject H₀ (result is not statistically significant)

                    

One-tailed vs Two-tailed Tests

Type	When to Use	Example H₁
Two-tailed	Testing for any difference (could be higher OR lower)	μ ≠ 50 (mean is different from 50)
One-tailed (Right)	Testing if value is greater than	μ > 50 (mean is greater than 50)
One-tailed (Left)	Testing if value is less than	μ < 50 (mean is less than 50)

Steps in Hypothesis Testing

State the hypotheses: Define H₀ and H₁
Choose significance level: Typically α = 0.05
Select the appropriate test: Based on data type and sample size
Calculate the test statistic: Z-score, t-statistic, etc.
Find the p-value: Probability of observing the result
Make a decision: Reject or fail to reject H₀
Interpret results: In context of the business problem

Example: Testing a Marketing Campaign

Scenario: A company claims their new email campaign increases click-through rates above the historical average of 2.5%.

H₀: μ = 2.5% (no improvement)

H₁: μ > 2.5% (improvement exists)

Sample: 1000 emails sent, 32 clicks (3.2%)

Test: One-sample z-test

Result: p-value = 0.023

Decision: Since p < 0.05, reject H₀. The new campaign significantly improves click rates.

# Python: Hypothesis Testing Example from scipy import stats import numpy as np # One-sample t-test # Testing if sample mean differs from population mean sample_data = [52, 48, 55, 51, 49, 53, 50, 54, 47, 52] population_mean = 50 t_stat, p_value = stats.ttest_1samp(sample_data, population_mean) print(f"Sample Mean: {np.mean(sample_data):.2f}") print(f"T-statistic: {t_stat:.4f}") print(f"P-value: {p_value:.4f}") alpha = 0.05 if p_value < alpha: print("Result: Reject H₀ - Significant difference found") else: print("Result: Fail to reject H₀ - No significant difference")

                        Try It Yourself - Day 1 Activities
                        Formulate H₀ and H₁ for: "Does a new training program improve employee productivity?"
A test gives p-value = 0.03 with α = 0.05. What's your conclusion?
Explain the difference between Type I and Type II errors in a medical testing context

                    

Day 1 Key Takeaways

Hypothesis testing helps make data-driven decisions
H₀ represents "no effect"; H₁ represents what we're testing
P-value measures evidence against H₀
Reject H₀ when p-value ≤ α (typically 0.05)
Type I error: false positive; Type II error: false negative

Day 2

Statistical Tests

Choosing and Applying the Right Test

Z-Test for Large Samples

Use Z-test when: sample size is large (n ≥ 30) and population standard deviation is known.

Z-Test Formula

Z = (x̄ - μ) / (σ / √n)

Where: x̄ = sample mean, μ = population mean, σ = population SD, n = sample size

T-Tests

Use T-test when: sample size is small (n < 30) or population SD is unknown.

T-Test Type	Use Case	Example
One-Sample	Compare sample mean to known value	Is average delivery time different from 3 days?
Two-Sample (Independent)	Compare means of two independent groups	Do male and female customers spend differently?
Paired	Compare means from same group at different times	Did training improve employee scores?

# Python: Different T-Tests from scipy import stats import numpy as np # Two-Sample Independent T-Test group_a = [85, 90, 78, 92, 88, 76, 95, 89] group_b = [78, 82, 75, 80, 77, 83, 79, 81] t_stat, p_value = stats.ttest_ind(group_a, group_b) print(f"Two-Sample T-Test: t={t_stat:.3f}, p={p_value:.4f}") # Paired T-Test (before/after) before = [200, 220, 190, 210, 205] after = [180, 195, 170, 185, 175] t_stat, p_value = stats.ttest_rel(before, after) print(f"Paired T-Test: t={t_stat:.3f}, p={p_value:.4f}")

Chi-Square Test for Independence

Tests whether two categorical variables are independent or related.

Example: Customer Preference by Region

Is product preference independent of region?

	Product A	Product B	Product C
North	50	30	20
South	35	45	20
West	40	35	25

Chi-square test determines if the differences are statistically significant.

ANOVA: Analysis of Variance

Compares means across three or more groups to see if at least one is significantly different.

                        When to Use ANOVA
                        Comparing sales across multiple regions
Testing if different treatments have different effects
Comparing customer satisfaction across product categories

                    

# Python: ANOVA from scipy import stats # Sales data from three regions region_a = [120, 135, 128, 142, 130] region_b = [145, 150, 138, 155, 148] region_c = [110, 115, 122, 108, 118] f_stat, p_value = stats.f_oneway(region_a, region_b, region_c) print(f"ANOVA: F={f_stat:.3f}, p={p_value:.4f}") if p_value < 0.05: print("At least one region has significantly different sales")

Choosing the Right Test

Data Type	Comparison	Test
Continuous	1 sample vs known value	One-sample t-test
Continuous	2 independent groups	Two-sample t-test
Continuous	Same group, before/after	Paired t-test
Continuous	3+ groups	ANOVA
Categorical	Independence of 2 variables	Chi-square test

Excel Functions for Statistical Tests

=T.TEST(array1, array2, tails, type) - T-test
=CHISQ.TEST(observed, expected) - Chi-square test
=F.TEST(array1, array2) - F-test for variance
Data Analysis ToolPak: ANOVA, t-Test, etc.

Day 2 Key Takeaways

Z-test for large samples with known population SD
T-test for small samples or unknown population SD
Chi-square tests independence between categorical variables
ANOVA compares means across 3+ groups
Choose tests based on data type and comparison needed

Day 3

Correlation & Relationships

Understanding How Variables Relate

Correlation vs Causation

Critical Concept: Correlation ≠ Causation

Just because two variables move together doesn't mean one causes the other!

Example: Ice cream sales and drowning deaths are correlated. Does ice cream cause drowning? No - both increase in summer (confounding variable).

Pearson Correlation Coefficient (r)

Measures the strength and direction of LINEAR relationship between two continuous variables.

-1 (Strong Negative) 0 (No Correlation) +1 (Strong Positive)

r Value	Interpretation
0.9 to 1.0	Very strong positive
0.7 to 0.9	Strong positive
0.5 to 0.7	Moderate positive
0.3 to 0.5	Weak positive
-0.3 to 0.3	Little to no correlation
-0.5 to -0.3	Weak negative
-1.0 to -0.7	Strong negative

Spearman Rank Correlation

Use when: data is ordinal, or relationship is monotonic but not linear.

Example: Correlation Analysis

Analyzing relationship between advertising spend and sales:

Ad Spend: Rs. 10K, 15K, 20K, 25K, 30K
Sales: Rs. 100K, 140K, 180K, 210K, 250K

Pearson r = 0.99 (Very strong positive correlation)

As ad spend increases, sales tend to increase proportionally.

# Python: Correlation Analysis import numpy as np from scipy import stats import pandas as pd # Sample data ad_spend = [10, 15, 20, 25, 30, 35, 40] sales = [100, 140, 180, 210, 250, 280, 310] # Pearson correlation pearson_r, p_value = stats.pearsonr(ad_spend, sales) print(f"Pearson r: {pearson_r:.4f}, p-value: {p_value:.4f}") # Spearman correlation spearman_r, p_value = stats.spearmanr(ad_spend, sales) print(f"Spearman r: {spearman_r:.4f}") # Correlation matrix for multiple variables df = pd.DataFrame({ 'price': [100, 150, 200, 250, 300], 'quantity': [500, 400, 350, 250, 200], 'rating': [4.2, 4.5, 4.1, 3.8, 3.5] }) print("\nCorrelation Matrix:") print(df.corr())

Introduction to Linear Regression

While correlation measures relationship strength, regression helps predict one variable from another.

Simple Linear Regression

y = mx + b

Where: y = predicted value, m = slope, x = input, b = intercept

Excel Correlation Functions

=CORREL(array1, array2) - Pearson correlation
=RSQ(known_y, known_x) - R-squared value
=SLOPE(known_y, known_x) - Regression slope
=INTERCEPT(known_y, known_x) - Regression intercept

Day 3 Key Takeaways

Correlation shows relationship strength (-1 to +1)
Correlation does NOT imply causation
Pearson for linear relationships; Spearman for monotonic
Correlation matrices help visualize multiple relationships
Regression predicts values; correlation measures association

Day 4

Exploratory Data Analysis

Systematic Data Exploration

What is EDA?

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using visual methods. It's about understanding your data before formal modeling.

EDA Process and Methodology

Understand the data: What does each column represent?
Check data quality: Missing values, duplicates, errors
Univariate analysis: Examine each variable individually
Bivariate analysis: Examine relationships between pairs
Multivariate analysis: Examine complex relationships
Document findings: Insights, issues, recommendations

Univariate Analysis

Analyzing one variable at a time.

Variable Type	Analysis Methods	Visualizations
Numerical	Mean, median, SD, min, max, quartiles	Histogram, box plot, density plot
Categorical	Frequency counts, mode, proportions	Bar chart, pie chart

Bivariate Analysis

Analyzing relationships between two variables.

Variable Combination	Methods
Numerical vs Numerical	Correlation, scatter plot, regression
Categorical vs Categorical	Cross-tabulation, chi-square, stacked bar
Numerical vs Categorical	Group statistics, box plots by group, ANOVA

Handling Missing Data

                        Strategies for Missing Values
                        Remove: Delete rows/columns with missing values (if small %)
Impute with mean/median: For numerical data
Impute with mode: For categorical data
Forward/backward fill: For time series
Flag as missing: Create indicator variable

                    

Dealing with Outliers

                        Outlier Detection Methods
                        IQR Method: Values below Q1-1.5×IQR or above Q3+1.5×IQR
Z-score: Values with |z| > 3
Visual: Box plots, scatter plots

                        Handling Outliers
                        Investigate: Are they errors or genuine?
Remove: If errors or not relevant
Cap/Floor: Replace with threshold values
Transform: Log transformation reduces impact
Keep: If valid and important for analysis

                    

# Python: EDA Workflow import pandas as pd import numpy as np # Load data df = pd.read_csv('data.csv') # 1. Basic info print(df.info()) print(df.describe()) # 2. Missing values print("\nMissing Values:") print(df.isnull().sum()) # 3. Duplicates print(f"\nDuplicates: {df.duplicated().sum()}") # 4. Value counts for categorical print("\nCategory Distribution:") print(df['category'].value_counts()) # 5. Correlation matrix print("\nCorrelation Matrix:") print(df.corr())

Day 4 Key Takeaways

EDA is essential before any modeling or analysis
Start with univariate, then bivariate, then multivariate
Always check for missing values and outliers
Choose handling strategies based on context
Document all findings and data quality issues

Day 5

A/B Testing & Experimentation

Data-Driven Decision Making

What is A/B Testing?

A/B Testing (split testing) is a method of comparing two versions of something to determine which performs better. Users are randomly assigned to see either version A (control) or version B (treatment).

Variant A (Control)

Original version

50% of traffic

Baseline performance

Variant B (Treatment)

New version with change

50% of traffic

Test performance

Designing Experiments

Define the goal: What metric are you trying to improve?
Formulate hypothesis: What change do you expect?
Calculate sample size: How many users needed?
Randomize: Randomly assign users to groups
Run the test: Collect data for sufficient time
Analyze results: Statistical significance check
Make decision: Implement winner or iterate

Sample Size Calculation

                        Factors Affecting Sample Size
                        Baseline conversion rate: Current performance
Minimum detectable effect: Smallest improvement worth detecting
Significance level (α): Usually 0.05
Power (1-β): Usually 0.80 (80%)

                    

Example: Website Button Color Test

Goal: Increase sign-up conversion rate

H₀: Button color has no effect on conversions

H₁: New button color increases conversions

Current rate: 3%

Target improvement: 3.5% (16.7% relative increase)

Sample size needed: ~15,000 per variant

Common A/B Testing Pitfalls

Avoid These Mistakes

Peeking: Checking results too early and stopping
Small sample: Not enough data for significance
Multiple testing: Testing many variants inflates false positives
Novelty effect: Users react to change, not the change itself
Selection bias: Non-random group assignment
Running too short: Miss weekly/monthly patterns

# Python: A/B Test Analysis from scipy import stats import numpy as np # A/B test results control_visitors = 10000 control_conversions = 300 # 3% treatment_visitors = 10000 treatment_conversions = 350 # 3.5% # Conversion rates control_rate = control_conversions / control_visitors treatment_rate = treatment_conversions / treatment_visitors print(f"Control: {control_rate:.2%}") print(f"Treatment: {treatment_rate:.2%}") print(f"Lift: {(treatment_rate - control_rate) / control_rate:.2%}") # Two-proportion z-test count = np.array([control_conversions, treatment_conversions]) nobs = np.array([control_visitors, treatment_visitors]) from statsmodels.stats.proportion import proportions_ztest z_stat, p_value = proportions_ztest(count, nobs, alternative='smaller') print(f"\nZ-statistic: {z_stat:.4f}") print(f"P-value: {p_value:.4f}") if p_value < 0.05: print("Result: Statistically significant - Treatment wins!") else: print("Result: Not significant - Cannot conclude treatment is better")

Day 5 Key Takeaways

A/B testing enables data-driven decisions
Proper sample size is crucial for valid results
Random assignment prevents selection bias
Avoid peeking - wait for statistical significance
Consider practical significance, not just statistical

Day 6

Data Visualization for Analysis

Telling Stories with Data

Choosing the Right Chart Type

Bar Chart

Comparing categories

Line Chart

Trends over time

Scatter Plot

Relationships between variables

Histogram

Distribution of values

Box Plot

Distribution + outliers

Heatmap

Correlation matrix

Effective Chart Design Principles

                        Best Practices
                        Clear title: State what the chart shows
Label axes: Include units
Start at zero: For bar charts (usually)
Remove clutter: No unnecessary gridlines, borders
Use color purposefully: Highlight key insights
Add context: Benchmarks, targets, annotations

                    

Common Visualization Mistakes

Avoid These Errors

Truncated y-axis exaggerating differences
3D charts that distort perception
Too many colors or categories
Pie charts for more than 5 categories
Missing legends or labels
Using the wrong chart type for the data

Telling Stories with Data

Start with the insight: What's the key message?
Build context: Why does this matter?
Show evidence: Support with data
Provide recommendations: What should we do?
Make it actionable: Clear next steps

# Python: Data Visualization import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # Set style sns.set_style("whitegrid") # Sample data months = ['Jan', 'Feb', 'Mar', 'Apr', 'May'] sales = [100, 120, 115, 140, 160] # Line chart plt.figure(figsize=(10, 6)) plt.plot(months, sales, marker='o', linewidth=2) plt.title('Monthly Sales Trend', fontsize=14, fontweight='bold') plt.xlabel('Month') plt.ylabel('Sales (Rs. thousands)') plt.tight_layout() plt.show() # Correlation heatmap df = pd.DataFrame({ 'price': [100, 150, 200, 250, 300], 'quantity': [500, 400, 350, 250, 200], 'revenue': [50000, 60000, 70000, 62500, 60000] }) plt.figure(figsize=(8, 6)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0) plt.title('Correlation Matrix') plt.tight_layout() plt.show()

Day 6 Key Takeaways

Choose chart type based on what you're showing
Keep visualizations clean and purposeful
Always include titles, labels, and context
Tell a story - start with the insight
End with actionable recommendations

Day 7

Week 5 Project & Assessment

Complete EDA & Statistical Analysis

Project: Complete EDA & Statistical Analysis

Project Requirements

Perform end-to-end exploratory data analysis with statistical testing:

Comprehensive EDA with visualizations
- Data overview and quality assessment
- Univariate and bivariate analysis
- Handle missing values and outliers
Hypothesis testing for business questions
- Formulate at least 2 hypotheses
- Apply appropriate statistical tests
- Interpret results in business context
Correlation analysis and insights
- Identify key relationships
- Create correlation matrix
A/B test analysis and recommendations
- Analyze provided A/B test data
- Calculate significance
- Make recommendations
Executive summary presentation
- Key findings and insights
- Data-driven recommendations
- Next steps

Assessment Breakdown

Component	Weight
EDA Report with Visualizations	50%
Quiz: Hypothesis Testing & EDA	30%
Presentation Skills	20%

Week 5 Summary

Day 1: Hypothesis testing fundamentals
Day 2: Statistical tests (t-test, chi-square, ANOVA)
Day 3: Correlation analysis
Day 4: EDA methodology
Day 5: A/B testing
Day 6: Data visualization
Day 7: Complete EDA project

Self-Assessment Quiz

1. A p-value of 0.03 with α = 0.05 means:

a) Fail to reject H₀
b) Reject H₀
c) Accept H₀
d) Cannot determine

2. Type I error is:

a) False negative
b) False positive
c) True negative
d) True positive

3. Which test compares means of 3+ groups?

a) T-test
b) Chi-square
c) ANOVA
d) Z-test

4. A correlation of -0.85 indicates:

a) Strong positive relationship
b) Weak negative relationship
c) Strong negative relationship
d) No relationship

5. In A/B testing, "peeking" refers to:

a) Looking at competitor tests
b) Checking results too early
c) Testing multiple variants
d) Using small samples

Quiz Answers

1-b, 2-b, 3-c, 4-c, 5-b

← Week 4: Statistics Foundations Course Overview →