Welcome to Week 5! You've built a solid foundation in statistics - now it's time to apply those skills to real-world analysis. This week, you'll learn how to test hypotheses, explore data systematically, and communicate findings effectively.
By the end of this week, you'll be able to design experiments, analyze A/B tests, perform comprehensive exploratory data analysis, and present your findings in a compelling way.
What You'll Learn This Week
Hypothesis Testing Fundamentals
Making Data-Driven Decisions with Confidence
What is Hypothesis Testing?
Think of hypothesis testing as a courtroom trial: we start with an assumption of innocence (null hypothesis) and need evidence to prove guilt (alternative hypothesis).
Null and Alternative Hypotheses
H₀ (Null Hypothesis)
The "status quo" - no effect, no difference
Example: "The new website design has NO effect on conversion rates"
H₁ (Alternative Hypothesis)
What we're trying to prove - there IS an effect
Example: "The new website design DOES affect conversion rates"
Type I and Type II Errors
Type I Error (False Positive)
Rejecting H₀ when it's actually true
Example: Concluding a drug works when it doesn't
Probability = α (significance level)
Type II Error (False Negative)
Failing to reject H₀ when it's actually false
Example: Missing a real effect that exists
Probability = β
P-values and Significance Levels
Significance Level (α): The threshold for rejecting H₀, typically 0.05 (5%).
Decision Rule
- If p-value ≤ α: Reject H₀ (result is statistically significant)
- If p-value > α: Fail to reject H₀ (result is not statistically significant)
One-tailed vs Two-tailed Tests
| Type | When to Use | Example H₁ |
|---|---|---|
| Two-tailed | Testing for any difference (could be higher OR lower) | μ ≠ 50 (mean is different from 50) |
| One-tailed (Right) | Testing if value is greater than | μ > 50 (mean is greater than 50) |
| One-tailed (Left) | Testing if value is less than | μ < 50 (mean is less than 50) |
Steps in Hypothesis Testing
- State the hypotheses: Define H₀ and H₁
- Choose significance level: Typically α = 0.05
- Select the appropriate test: Based on data type and sample size
- Calculate the test statistic: Z-score, t-statistic, etc.
- Find the p-value: Probability of observing the result
- Make a decision: Reject or fail to reject H₀
- Interpret results: In context of the business problem
Example: Testing a Marketing Campaign
Scenario: A company claims their new email campaign increases click-through rates above the historical average of 2.5%.
H₀: μ = 2.5% (no improvement)
H₁: μ > 2.5% (improvement exists)
Sample: 1000 emails sent, 32 clicks (3.2%)
Test: One-sample z-test
Result: p-value = 0.023
Decision: Since p < 0.05, reject H₀. The new campaign significantly improves click rates.
Try It Yourself - Day 1 Activities
- Formulate H₀ and H₁ for: "Does a new training program improve employee productivity?"
- A test gives p-value = 0.03 with α = 0.05. What's your conclusion?
- Explain the difference between Type I and Type II errors in a medical testing context
Day 1 Key Takeaways
- Hypothesis testing helps make data-driven decisions
- H₀ represents "no effect"; H₁ represents what we're testing
- P-value measures evidence against H₀
- Reject H₀ when p-value ≤ α (typically 0.05)
- Type I error: false positive; Type II error: false negative
Statistical Tests
Choosing and Applying the Right Test
Z-Test for Large Samples
Use Z-test when: sample size is large (n ≥ 30) and population standard deviation is known.
Z-Test Formula
Z = (x̄ - μ) / (σ / √n)
Where: x̄ = sample mean, μ = population mean, σ = population SD, n = sample size
T-Tests
Use T-test when: sample size is small (n < 30) or population SD is unknown.
| T-Test Type | Use Case | Example |
|---|---|---|
| One-Sample | Compare sample mean to known value | Is average delivery time different from 3 days? |
| Two-Sample (Independent) | Compare means of two independent groups | Do male and female customers spend differently? |
| Paired | Compare means from same group at different times | Did training improve employee scores? |
Chi-Square Test for Independence
Tests whether two categorical variables are independent or related.
Example: Customer Preference by Region
Is product preference independent of region?
| Product A | Product B | Product C | |
|---|---|---|---|
| North | 50 | 30 | 20 |
| South | 35 | 45 | 20 |
| West | 40 | 35 | 25 |
Chi-square test determines if the differences are statistically significant.
ANOVA: Analysis of Variance
Compares means across three or more groups to see if at least one is significantly different.
When to Use ANOVA
- Comparing sales across multiple regions
- Testing if different treatments have different effects
- Comparing customer satisfaction across product categories
Choosing the Right Test
| Data Type | Comparison | Test |
|---|---|---|
| Continuous | 1 sample vs known value | One-sample t-test |
| Continuous | 2 independent groups | Two-sample t-test |
| Continuous | Same group, before/after | Paired t-test |
| Continuous | 3+ groups | ANOVA |
| Categorical | Independence of 2 variables | Chi-square test |
Excel Functions for Statistical Tests
=T.TEST(array1, array2, tails, type)- T-test=CHISQ.TEST(observed, expected)- Chi-square test=F.TEST(array1, array2)- F-test for variance- Data Analysis ToolPak: ANOVA, t-Test, etc.
Day 2 Key Takeaways
- Z-test for large samples with known population SD
- T-test for small samples or unknown population SD
- Chi-square tests independence between categorical variables
- ANOVA compares means across 3+ groups
- Choose tests based on data type and comparison needed
Correlation & Relationships
Understanding How Variables Relate
Correlation vs Causation
Critical Concept: Correlation ≠ Causation
Just because two variables move together doesn't mean one causes the other!
Example: Ice cream sales and drowning deaths are correlated. Does ice cream cause drowning? No - both increase in summer (confounding variable).
Pearson Correlation Coefficient (r)
Measures the strength and direction of LINEAR relationship between two continuous variables.
| r Value | Interpretation |
|---|---|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.5 to 0.7 | Moderate positive |
| 0.3 to 0.5 | Weak positive |
| -0.3 to 0.3 | Little to no correlation |
| -0.5 to -0.3 | Weak negative |
| -1.0 to -0.7 | Strong negative |
Spearman Rank Correlation
Use when: data is ordinal, or relationship is monotonic but not linear.
Example: Correlation Analysis
Analyzing relationship between advertising spend and sales:
- Ad Spend: Rs. 10K, 15K, 20K, 25K, 30K
- Sales: Rs. 100K, 140K, 180K, 210K, 250K
Pearson r = 0.99 (Very strong positive correlation)
As ad spend increases, sales tend to increase proportionally.
Introduction to Linear Regression
While correlation measures relationship strength, regression helps predict one variable from another.
Simple Linear Regression
y = mx + b
Where: y = predicted value, m = slope, x = input, b = intercept
Excel Correlation Functions
=CORREL(array1, array2)- Pearson correlation=RSQ(known_y, known_x)- R-squared value=SLOPE(known_y, known_x)- Regression slope=INTERCEPT(known_y, known_x)- Regression intercept
Day 3 Key Takeaways
- Correlation shows relationship strength (-1 to +1)
- Correlation does NOT imply causation
- Pearson for linear relationships; Spearman for monotonic
- Correlation matrices help visualize multiple relationships
- Regression predicts values; correlation measures association
Exploratory Data Analysis
Systematic Data Exploration
What is EDA?
EDA Process and Methodology
- Understand the data: What does each column represent?
- Check data quality: Missing values, duplicates, errors
- Univariate analysis: Examine each variable individually
- Bivariate analysis: Examine relationships between pairs
- Multivariate analysis: Examine complex relationships
- Document findings: Insights, issues, recommendations
Univariate Analysis
Analyzing one variable at a time.
| Variable Type | Analysis Methods | Visualizations |
|---|---|---|
| Numerical | Mean, median, SD, min, max, quartiles | Histogram, box plot, density plot |
| Categorical | Frequency counts, mode, proportions | Bar chart, pie chart |
Bivariate Analysis
Analyzing relationships between two variables.
| Variable Combination | Methods |
|---|---|
| Numerical vs Numerical | Correlation, scatter plot, regression |
| Categorical vs Categorical | Cross-tabulation, chi-square, stacked bar |
| Numerical vs Categorical | Group statistics, box plots by group, ANOVA |
Handling Missing Data
Strategies for Missing Values
- Remove: Delete rows/columns with missing values (if small %)
- Impute with mean/median: For numerical data
- Impute with mode: For categorical data
- Forward/backward fill: For time series
- Flag as missing: Create indicator variable
Dealing with Outliers
Outlier Detection Methods
- IQR Method: Values below Q1-1.5×IQR or above Q3+1.5×IQR
- Z-score: Values with |z| > 3
- Visual: Box plots, scatter plots
Handling Outliers
- Investigate: Are they errors or genuine?
- Remove: If errors or not relevant
- Cap/Floor: Replace with threshold values
- Transform: Log transformation reduces impact
- Keep: If valid and important for analysis
Day 4 Key Takeaways
- EDA is essential before any modeling or analysis
- Start with univariate, then bivariate, then multivariate
- Always check for missing values and outliers
- Choose handling strategies based on context
- Document all findings and data quality issues
A/B Testing & Experimentation
Data-Driven Decision Making
What is A/B Testing?
Variant A (Control)
Original version
50% of traffic
Baseline performance
Variant B (Treatment)
New version with change
50% of traffic
Test performance
Designing Experiments
- Define the goal: What metric are you trying to improve?
- Formulate hypothesis: What change do you expect?
- Calculate sample size: How many users needed?
- Randomize: Randomly assign users to groups
- Run the test: Collect data for sufficient time
- Analyze results: Statistical significance check
- Make decision: Implement winner or iterate
Sample Size Calculation
Factors Affecting Sample Size
- Baseline conversion rate: Current performance
- Minimum detectable effect: Smallest improvement worth detecting
- Significance level (α): Usually 0.05
- Power (1-β): Usually 0.80 (80%)
Example: Website Button Color Test
Goal: Increase sign-up conversion rate
H₀: Button color has no effect on conversions
H₁: New button color increases conversions
Current rate: 3%
Target improvement: 3.5% (16.7% relative increase)
Sample size needed: ~15,000 per variant
Common A/B Testing Pitfalls
Avoid These Mistakes
- Peeking: Checking results too early and stopping
- Small sample: Not enough data for significance
- Multiple testing: Testing many variants inflates false positives
- Novelty effect: Users react to change, not the change itself
- Selection bias: Non-random group assignment
- Running too short: Miss weekly/monthly patterns
Day 5 Key Takeaways
- A/B testing enables data-driven decisions
- Proper sample size is crucial for valid results
- Random assignment prevents selection bias
- Avoid peeking - wait for statistical significance
- Consider practical significance, not just statistical
Data Visualization for Analysis
Telling Stories with Data
Choosing the Right Chart Type
Bar Chart
Comparing categories
Line Chart
Trends over time
Scatter Plot
Relationships between variables
Histogram
Distribution of values
Box Plot
Distribution + outliers
Heatmap
Correlation matrix
Effective Chart Design Principles
Best Practices
- Clear title: State what the chart shows
- Label axes: Include units
- Start at zero: For bar charts (usually)
- Remove clutter: No unnecessary gridlines, borders
- Use color purposefully: Highlight key insights
- Add context: Benchmarks, targets, annotations
Common Visualization Mistakes
Avoid These Errors
- Truncated y-axis exaggerating differences
- 3D charts that distort perception
- Too many colors or categories
- Pie charts for more than 5 categories
- Missing legends or labels
- Using the wrong chart type for the data
Telling Stories with Data
- Start with the insight: What's the key message?
- Build context: Why does this matter?
- Show evidence: Support with data
- Provide recommendations: What should we do?
- Make it actionable: Clear next steps
Day 6 Key Takeaways
- Choose chart type based on what you're showing
- Keep visualizations clean and purposeful
- Always include titles, labels, and context
- Tell a story - start with the insight
- End with actionable recommendations
Week 5 Project & Assessment
Complete EDA & Statistical Analysis
Project: Complete EDA & Statistical Analysis
Project Requirements
Perform end-to-end exploratory data analysis with statistical testing:
- Comprehensive EDA with visualizations
- Data overview and quality assessment
- Univariate and bivariate analysis
- Handle missing values and outliers
- Hypothesis testing for business questions
- Formulate at least 2 hypotheses
- Apply appropriate statistical tests
- Interpret results in business context
- Correlation analysis and insights
- Identify key relationships
- Create correlation matrix
- A/B test analysis and recommendations
- Analyze provided A/B test data
- Calculate significance
- Make recommendations
- Executive summary presentation
- Key findings and insights
- Data-driven recommendations
- Next steps
Assessment Breakdown
| Component | Weight |
|---|---|
| EDA Report with Visualizations | 50% |
| Quiz: Hypothesis Testing & EDA | 30% |
| Presentation Skills | 20% |
Week 5 Summary
- Day 1: Hypothesis testing fundamentals
- Day 2: Statistical tests (t-test, chi-square, ANOVA)
- Day 3: Correlation analysis
- Day 4: EDA methodology
- Day 5: A/B testing
- Day 6: Data visualization
- Day 7: Complete EDA project
Self-Assessment Quiz
1. A p-value of 0.03 with α = 0.05 means:
2. Type I error is:
3. Which test compares means of 3+ groups?
4. A correlation of -0.85 indicates:
5. In A/B testing, "peeking" refers to:
Quiz Answers
1-b, 2-b, 3-c, 4-c, 5-b