Welcome to Week 4! This week marks a pivotal point in your data analytics journey. Statistics is the backbone of data analysis - it's how we make sense of data, draw conclusions, and make predictions with confidence.
Whether you're calculating average sales, determining if a marketing campaign was successful, or predicting customer behavior, you'll use the statistical concepts you learn this week. By the end, you'll be able to analyze data distributions, calculate probabilities, and understand sampling - skills that separate good analysts from great ones.
Don't worry if math isn't your strongest subject. We'll build concepts step by step with practical examples and real-world applications. Let's dive in!
What You'll Learn This Week
Measures of Central Tendency
Understanding Mean, Median, and Mode
Why Statistics for Data Analysts?
Every time you see a report saying "average customer spends Rs. 500" or "typical delivery time is 3 days," you're seeing statistics in action. These numbers help businesses understand their performance and make informed decisions.
Measures of Central Tendency
Central tendency measures tell us where the "center" of our data lies. There are three main measures:
| Measure | Definition | When to Use | Sensitivity to Outliers |
|---|---|---|---|
| Mean | Sum of all values divided by count | Symmetric data without outliers | Highly sensitive |
| Median | Middle value when data is sorted | Skewed data or data with outliers | Resistant (robust) |
| Mode | Most frequently occurring value | Categorical data or finding common values | Not affected |
1. Mean (Average)
The mean is the most common measure of central tendency. It's calculated by adding all values and dividing by the count.
Formula: Arithmetic Mean
Where x̄ (x-bar) is the mean, x values are individual data points, and n is the count.
Example: Calculating Mean Salary
A company has 5 employees with salaries: Rs. 30,000, Rs. 35,000, Rs. 40,000, Rs. 45,000, Rs. 50,000
Mean = (30,000 + 35,000 + 40,000 + 45,000 + 50,000) / 5 = Rs. 40,000
The average salary is Rs. 40,000.
Types of Means
Weighted Mean
When some values are more important than others, we use weighted mean.
Example: A student's grades with different credit weights:
- Math (4 credits): 85%
- English (3 credits): 90%
- History (2 credits): 78%
Weighted Mean = (4×85 + 3×90 + 2×78) / (4+3+2) = (340 + 270 + 156) / 9 = 85.1%
Trimmed Mean
Remove a percentage of extreme values from both ends before calculating mean. This makes it more robust to outliers.
Example: 10% trimmed mean of [2, 5, 7, 8, 9, 10, 12, 15, 18, 100]
Remove 10% from each end (1 value): Remove 2 and 100
Trimmed Mean = (5+7+8+9+10+12+15+18) / 8 = 10.5
Compare to regular mean: 18.6 (heavily influenced by 100)
2. Median (Middle Value)
The median is the middle value when data is arranged in order. It's particularly useful for skewed distributions.
Finding the Median
For odd n: Median = value at position (n+1)/2
For even n: Median = average of values at positions n/2 and (n/2)+1
Example: Finding Median
Odd count: Data: 12, 15, 18, 22, 30 (n=5)
Position = (5+1)/2 = 3rd value = 18
Even count: Data: 12, 15, 18, 22, 30, 35 (n=6)
Positions = 6/2 = 3rd and 4th values = (18+22)/2 = 20
3. Mode (Most Frequent)
The mode is the value that appears most frequently in a dataset.
Example: Finding Mode
T-shirt sizes sold: S, M, M, L, M, XL, L, M, S, M
Count: S=2, M=5, L=2, XL=1
Mode = M (appears 5 times)
Data can be:
- Unimodal: One mode (most common)
- Bimodal: Two modes
- Multimodal: More than two modes
- No mode: All values appear equally
Effect of Outliers
Important: Outliers Affect the Mean!
Consider these salaries: Rs. 30K, 35K, 40K, 45K, 500K (CEO)
- Mean: Rs. 130,000 (misleading!)
- Median: Rs. 40,000 (better representation)
This is why income and housing prices are often reported as median, not mean.
When to Use Each Measure
| Scenario | Best Measure | Reason |
|---|---|---|
| Symmetric data (test scores) | Mean | Uses all data, most precise |
| Income/Housing prices | Median | Skewed by high earners/luxury homes |
| Most popular product size | Mode | Find most common category |
| Data with outliers | Median or Trimmed Mean | Robust to extreme values |
Excel Functions for Central Tendency
=AVERAGE(A1:A100)- Calculates mean=MEDIAN(A1:A100)- Finds median=MODE.SNGL(A1:A100)- Returns single mode=MODE.MULT(A1:A100)- Returns multiple modes (array formula)=TRIMMEAN(A1:A100, 0.1)- 10% trimmed mean
Try It Yourself - Day 1 Exercises
- Calculate mean, median, and mode for: 15, 20, 20, 25, 30, 35, 100
- A store sells shoes in sizes: 7, 8, 8, 9, 9, 9, 10, 10, 11. What size should they stock most?
- Student scores with weights: Quiz (20%): 85, Midterm (30%): 78, Final (50%): 92. Find weighted average.
- When would you report median income instead of mean income?
Day 1 Key Takeaways
- Mean is the arithmetic average - sensitive to outliers
- Median is the middle value - robust to outliers
- Mode is the most frequent value - best for categorical data
- Choose the measure based on your data distribution and presence of outliers
- Weighted mean accounts for different importance of values
Measures of Dispersion
Understanding How Spread Out Your Data Is
Why Dispersion Matters
Knowing the center of your data is just half the story. Two datasets can have the same mean but very different spreads:
Same Mean, Different Spread
Dataset A: 48, 49, 50, 51, 52 → Mean = 50
Dataset B: 10, 30, 50, 70, 90 → Mean = 50
Both have mean 50, but Dataset B is much more spread out!
1. Range
The simplest measure of spread - the difference between the largest and smallest values.
Formula: Range
Limitation: Range only uses two values and is heavily affected by outliers.
2. Interquartile Range (IQR)
IQR measures the spread of the middle 50% of data, making it robust to outliers.
Formula: Interquartile Range
Where Q1 is the 25th percentile and Q3 is the 75th percentile.
Quartiles and Percentiles
| Term | Position | Meaning |
|---|---|---|
| Q1 (First Quartile) | 25th percentile | 25% of data falls below this value |
| Q2 (Second Quartile) | 50th percentile | Median - 50% below, 50% above |
| Q3 (Third Quartile) | 75th percentile | 75% of data falls below this value |
Example: Calculating IQR
Data (sorted): 12, 15, 18, 22, 25, 28, 30, 35, 40, 45, 50, 55
n = 12 values
Q1 position: (12+1) × 0.25 = 3.25 → Between 18 and 22 → Q1 = 19
Q3 position: (12+1) × 0.75 = 9.75 → Between 40 and 45 → Q3 = 43.75
IQR = Q3 - Q1 = 43.75 - 19 = 24.75
3. Variance
Variance measures how far each data point is from the mean, on average. It's the foundation for standard deviation.
Formula: Variance
Population Variance (σ²):
Sample Variance (s²):
Note: We use (n-1) for sample variance to get an unbiased estimate (Bessel's correction).
Example: Calculating Variance
Data: 4, 8, 6, 5, 7 (sample)
Step 1: Calculate mean: (4+8+6+5+7)/5 = 6
Step 2: Calculate squared deviations:
- (4-6)² = 4
- (8-6)² = 4
- (6-6)² = 0
- (5-6)² = 1
- (7-6)² = 1
Step 3: Sum = 4+4+0+1+1 = 10
Step 4: Sample Variance = 10/(5-1) = 2.5
4. Standard Deviation
Standard deviation is the square root of variance. It's in the same units as the original data, making it easier to interpret.
Formula: Standard Deviation
For sample: s = √[Σ(xₖ - x̄)² / (n-1)]
Interpreting Standard Deviation
- Small SD: Data points are clustered close to the mean
- Large SD: Data points are spread out from the mean
- SD is always ≥ 0 (can't be negative)
- SD = 0 means all values are identical
5. Coefficient of Variation (CV)
CV expresses standard deviation as a percentage of the mean, allowing comparison between datasets with different units or scales.
Formula: Coefficient of Variation
Example: Comparing Variability
Stock A: Mean return = Rs. 100, SD = Rs. 20 → CV = 20%
Stock B: Mean return = Rs. 500, SD = Rs. 50 → CV = 10%
Stock B has higher absolute SD but lower relative variability (lower risk per unit return).
6. Five-Number Summary and Box Plots
The five-number summary provides a complete picture of data distribution:
| Component | Description |
|---|---|
| Minimum | Smallest value |
| Q1 | First quartile (25th percentile) |
| Median (Q2) | Middle value (50th percentile) |
| Q3 | Third quartile (75th percentile) |
| Maximum | Largest value |
Identifying Outliers with IQR
IQR Outlier Rule
Values outside these bounds are considered outliers.
Excel Functions for Dispersion
=MIN(A1:A100),=MAX(A1:A100)- Range components=QUARTILE.INC(A1:A100, 1)- First quartile=QUARTILE.INC(A1:A100, 3)- Third quartile=VAR.S(A1:A100)- Sample variance=VAR.P(A1:A100)- Population variance=STDEV.S(A1:A100)- Sample standard deviation=STDEV.P(A1:A100)- Population standard deviation=PERCENTILE.INC(A1:A100, 0.9)- 90th percentile
Try It Yourself - Day 2 Exercises
- Calculate range, variance, and standard deviation for: 10, 15, 20, 25, 30
- Find the five-number summary for: 2, 5, 7, 8, 12, 15, 18, 22, 35, 50
- Identify outliers using the IQR method for the above dataset
- Compare two products: Product A (mean=100, SD=15) vs Product B (mean=200, SD=25). Which has more relative variability?
Day 2 Key Takeaways
- Range is simple but sensitive to outliers
- IQR measures spread of middle 50% - robust to outliers
- Variance measures average squared distance from mean
- Standard deviation is in original units - easier to interpret
- CV allows comparison across different scales
- Use IQR rule (1.5 × IQR) to identify outliers
Data Distributions & Shape
Understanding How Your Data is Distributed
Frequency Distributions and Histograms
A frequency distribution shows how often each value (or range of values) occurs in a dataset. Histograms visualize this distribution.
Example: Test Scores Frequency Distribution
| Score Range | Frequency | Relative Frequency |
|---|---|---|
| 50-59 | 5 | 10% |
| 60-69 | 12 | 24% |
| 70-79 | 18 | 36% |
| 80-89 | 10 | 20% |
| 90-100 | 5 | 10% |
Skewness: The Shape of Distribution
Skewness measures the asymmetry of a distribution around its mean.
| Type | Description | Mean vs Median | Example |
|---|---|---|---|
| Left (Negative) Skewed | Tail extends to the left | Mean < Median | Age at retirement, Easy test scores |
| Symmetric | Mirror image on both sides | Mean ≈ Median | Heights, IQ scores |
| Right (Positive) Skewed | Tail extends to the right | Mean > Median | Income, House prices, Wait times |
Remember: The Tail Tells the Tale!
The skew is named for the direction of the tail, not the peak.
- Right-skewed: Long tail to the RIGHT → Mean pulled RIGHT (higher)
- Left-skewed: Long tail to the LEFT → Mean pulled LEFT (lower)
Kurtosis: Peaks and Tails
Kurtosis measures the "tailedness" of a distribution - how much data is in the tails compared to a normal distribution.
| Type | Kurtosis | Description |
|---|---|---|
| Mesokurtic | = 3 (or excess = 0) | Normal distribution (baseline) |
| Leptokurtic | > 3 (excess > 0) | Heavy tails, sharp peak, more outliers |
| Platykurtic | < 3 (excess < 0) | Light tails, flat peak, fewer outliers |
The Normal Distribution
The normal (Gaussian) distribution is the most important distribution in statistics. It's bell-shaped and symmetric around the mean.
- Symmetric around the mean
- Mean = Median = Mode
- Defined by two parameters: mean (μ) and standard deviation (σ)
- Total area under curve = 1 (100%)
- Tails extend to infinity but never touch the x-axis
The Empirical Rule (68-95-99.7 Rule)
For normally distributed data:
Example: Applying the Empirical Rule
IQ scores are normally distributed with mean = 100 and SD = 15.
- 68% of people have IQ between 85-115 (100 ± 15)
- 95% of people have IQ between 70-130 (100 ± 30)
- 99.7% of people have IQ between 55-145 (100 ± 45)
An IQ of 145+ is in the top 0.15% (very rare!)
Z-Scores and Standardization
A Z-score tells you how many standard deviations a value is from the mean.
Formula: Z-Score
Where X is the value, μ is the mean, and σ is the standard deviation.
Interpreting Z-Scores
- Z = 0: Value equals the mean
- Z = 1: Value is 1 SD above the mean
- Z = -2: Value is 2 SD below the mean
- |Z| > 3: Value is an extreme outlier
Example: Calculating and Interpreting Z-Scores
Test scores: Mean = 75, SD = 10
Student A scored 85: Z = (85-75)/10 = 1.0 → 1 SD above average
Student B scored 60: Z = (60-75)/10 = -1.5 → 1.5 SD below average
Student C scored 95: Z = (95-75)/10 = 2.0 → Top ~2.5% (excellent!)
Why Standardize?
Z-scores allow you to:
- Compare values from different distributions
- Identify outliers (|Z| > 2 or 3)
- Calculate probabilities using standard normal tables
- Create a common scale for different variables
Excel Functions for Distribution Shape
=SKEW(A1:A100)- Calculate skewness=KURT(A1:A100)- Calculate excess kurtosis=STANDARDIZE(x, mean, std_dev)- Calculate Z-score=NORM.DIST(x, mean, std_dev, TRUE)- Cumulative probability=NORM.INV(probability, mean, std_dev)- Value for given percentile
Try It Yourself - Day 3 Exercises
- Heights of adults are normally distributed with mean=170cm, SD=10cm. What percentage are between 160cm and 180cm?
- Calculate Z-scores for students with scores 45, 72, 88 if mean=70 and SD=12. Who performed best relative to the class?
- A dataset has mean=50 and median=62. Is it left-skewed, right-skewed, or symmetric?
- If SAT scores have mean=1050, SD=200, what score puts you in the top 2.5%?
Day 3 Key Takeaways
- Histograms visualize frequency distributions
- Skewness shows direction of tail; affects mean vs median relationship
- Kurtosis measures tail heaviness and peak sharpness
- Normal distribution: symmetric, bell-shaped, defined by mean and SD
- Empirical Rule: 68-95-99.7% within 1-2-3 standard deviations
- Z-scores standardize data for comparison and probability calculation
Probability Fundamentals
The Mathematics of Uncertainty
What is Probability?
Types of Probability
| Type | Definition | Example |
|---|---|---|
| Classical (Theoretical) | Based on equally likely outcomes | P(heads) = 1/2 for a fair coin |
| Empirical (Experimental) | Based on observed data | 30% of customers buy Product A (from past data) |
| Subjective | Based on judgment/experience | "I think there's a 70% chance it will rain" |
Basic Probability Formula
Example: Rolling a Die
What's the probability of rolling a number greater than 4?
Favorable outcomes: {5, 6} = 2 outcomes
Total outcomes: {1, 2, 3, 4, 5, 6} = 6 outcomes
P(X > 4) = 2/6 = 1/3 ≈ 33.3%
Complement Rule
The probability of an event NOT happening is 1 minus the probability it does happen.
Complement Rule
Example: Complement
If P(rain) = 0.3, then P(no rain) = 1 - 0.3 = 0.7 (70%)
Addition Rule (OR)
For finding the probability of event A OR event B occurring.
Addition Rule
For mutually exclusive events:
For non-mutually exclusive events:
Example: Addition Rule
Mutually Exclusive: P(rolling 1 OR 6) = 1/6 + 1/6 = 2/6 = 1/3
Non-Mutually Exclusive: In a class, 40% play cricket, 30% play football, 15% play both.
P(cricket OR football) = 0.40 + 0.30 - 0.15 = 0.55 (55%)
Multiplication Rule (AND)
For finding the probability of event A AND event B both occurring.
Multiplication Rule
For independent events:
For dependent events:
Example: Multiplication Rule
Independent: Flip two coins. P(both heads) = 1/2 × 1/2 = 1/4
Dependent: Draw 2 cards without replacement. P(both aces)?
P(1st ace) = 4/52, P(2nd ace | 1st was ace) = 3/51
P(both aces) = (4/52) × (3/51) = 12/2652 = 1/221 ≈ 0.45%
Conditional Probability
The probability of A given that B has already occurred.
Conditional Probability Formula
Read as "probability of A given B"
Example: Conditional Probability
In a company: 60% are developers, 40% are managers. Among developers, 30% are senior. Among managers, 50% are senior.
Given that someone is senior, what's the probability they're a developer?
P(Senior and Developer) = 0.60 × 0.30 = 0.18
P(Senior and Manager) = 0.40 × 0.50 = 0.20
P(Senior) = 0.18 + 0.20 = 0.38
P(Developer|Senior) = 0.18 / 0.38 = 47.4%
Bayes' Theorem
Bayes' theorem allows us to update probabilities based on new evidence.
Bayes' Theorem
This lets us "reverse" conditional probabilities.
Example: Medical Testing (Classic Bayes' Application)
A disease affects 1% of the population. A test is 99% accurate (detects disease when present) and has a 5% false positive rate.
If you test positive, what's the probability you have the disease?
- P(Disease) = 0.01
- P(Positive|Disease) = 0.99
- P(Positive|No Disease) = 0.05
P(Positive) = P(Pos|Disease)×P(Disease) + P(Pos|No Disease)×P(No Disease)
P(Positive) = 0.99×0.01 + 0.05×0.99 = 0.0099 + 0.0495 = 0.0594
P(Disease|Positive) = (0.99 × 0.01) / 0.0594 = 16.7%
Even with a positive test, there's only a 16.7% chance of actually having the disease!
Base Rate Fallacy
The example above shows why we must consider the base rate (prevalence). A "99% accurate" test can still give mostly false positives when the condition is rare!
Try It Yourself - Day 4 Exercises
- What's the probability of drawing a King OR a Heart from a standard deck?
- If P(sunny) = 0.7 and P(traffic|sunny) = 0.3, P(traffic|not sunny) = 0.6, what's P(traffic)?
- A spam filter is 98% accurate at detecting spam and has a 3% false positive rate. If 20% of emails are spam, what's P(spam|flagged)?
- You flip 4 coins. What's the probability of getting exactly 2 heads?
Day 4 Key Takeaways
- Probability ranges from 0 to 1 (or 0% to 100%)
- Complement Rule: P(not A) = 1 - P(A)
- Addition Rule: For OR, add probabilities (subtract overlap if not mutually exclusive)
- Multiplication Rule: For AND, multiply probabilities (use conditional if dependent)
- Conditional Probability: P(A|B) = P(A and B) / P(B)
- Bayes' Theorem: Update beliefs with new evidence; beware of base rate fallacy
Common Probability Distributions
Patterns in Random Events
Discrete vs Continuous Distributions
| Discrete | Continuous |
|---|---|
| Countable outcomes (0, 1, 2, 3...) | Infinite outcomes in a range |
| Number of customers, defects, sales | Height, weight, time, temperature |
| Probability Mass Function (PMF) | Probability Density Function (PDF) |
| Binomial, Poisson | Normal, Exponential |
1. Binomial Distribution
Models the number of successes in a fixed number of independent trials, each with the same probability of success.
Binomial Distribution
Use when: Fixed number of trials, two outcomes (success/failure), constant probability, independent trials.
Parameters: n (number of trials), p (probability of success)
Mean: μ = n × p
Variance: σ² = n × p × (1-p)
Where C(n,k) = n! / (k! × (n-k)!)
Example: Customer Conversion
A website has a 10% conversion rate. Out of 20 visitors, what's the probability exactly 3 will convert?
n = 20, p = 0.10, k = 3
C(20,3) = 1140
P(X=3) = 1140 × (0.10)³ × (0.90)¹&sup7; = 0.190 (19%)
Expected conversions: 20 × 0.10 = 2
2. Poisson Distribution
Models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant rate.
Poisson Distribution
Use when: Counting events in fixed time/space, events are rare, independent, rate is constant.
Parameter: λ (lambda) = average rate of occurrence
Mean: μ = λ
Variance: σ² = λ
Example: Customer Service Calls
A call center receives an average of 5 calls per hour. What's the probability of receiving exactly 3 calls in an hour?
λ = 5, k = 3
P(X=3) = (5³ × e⁻&sup5;) / 3! = (125 × 0.0067) / 6 = 0.140 (14%)
3. Normal Distribution (Revisited)
The most important continuous distribution - many natural phenomena follow it.
Normal Distribution
Use when: Continuous data, symmetric distribution, result of many small independent factors.
Parameters: μ (mean), σ (standard deviation)
Properties: Symmetric, mean=median=mode, 68-95-99.7 rule applies.
Example: Test Scores
Exam scores are normally distributed with mean=72, SD=8.
What percentage of students scored above 80?
Z = (80-72)/8 = 1.0
P(Z > 1.0) = 1 - P(Z ≤ 1.0) = 1 - 0.8413 = 0.1587 (15.87%)
4. Exponential Distribution
Models the time between events in a Poisson process - waiting times, lifetimes, etc.
Exponential Distribution
Use when: Time until next event, memoryless process (past doesn't affect future).
Parameter: λ (rate) or 1/λ = μ (mean)
Mean: μ = 1/λ
Variance: σ² = 1/λ²
Example: Time Between Customers
Customers arrive at a rate of 10 per hour (λ = 10). What's the probability the next customer arrives within 5 minutes (1/12 hour)?
P(X ≤ 1/12) = 1 - e⁻¹⁰×(¹/¹²) = 1 - e⁻⁰⋅⁸³³
P(X ≤ 1/12) = 1 - 0.435 = 0.565 (56.5%)
Choosing the Right Distribution
| Question | Distribution | Example |
|---|---|---|
| How many successes in n trials? | Binomial | Defects in batch of 100 |
| How many events in fixed time? | Poisson | Emails per hour |
| Where does this measurement fall? | Normal | Student height, test scores |
| How long until next event? | Exponential | Time until next customer |
Excel Functions for Distributions
=BINOM.DIST(k, n, p, FALSE)- Binomial P(X=k)=BINOM.DIST(k, n, p, TRUE)- Binomial P(X≤k)=POISSON.DIST(k, lambda, FALSE)- Poisson P(X=k)=NORM.DIST(x, mean, std, TRUE)- Normal P(X≤x)=NORM.INV(prob, mean, std)- Normal percentile=EXPON.DIST(x, lambda, TRUE)- Exponential P(X≤x)
Try It Yourself - Day 5 Exercises
- A factory has 5% defect rate. In a sample of 50, what's P(exactly 3 defects)?
- A website gets 100 visits per hour. What's P(more than 120 visits)?
- Heights are normal with mean=165cm, SD=7cm. What height is the 95th percentile?
- If buses arrive every 15 minutes on average, what's P(waiting more than 20 minutes)?
Day 5 Key Takeaways
- Binomial: Fixed trials, count successes (defects, conversions, yes/no)
- Poisson: Count events in fixed time/space (calls, arrivals, errors)
- Normal: Continuous measurements, symmetric (heights, scores, errors)
- Exponential: Time between events, memoryless (wait times, lifetimes)
- Match the distribution to the question being asked
Sampling & Estimation
Making Inferences About Populations
Population vs Sample
- Population: The entire group you want to study (e.g., all customers, all products)
- Sample: A subset of the population used to make inferences
- Parameter: A measure describing the population (μ, σ)
- Statistic: A measure calculated from a sample (x̄, s)
Why sample? Studying entire populations is often impossible, too expensive, or destructive (quality testing).
Sampling Methods
| Method | Description | When to Use |
|---|---|---|
| Simple Random | Every member has equal chance of selection | Homogeneous population |
| Stratified | Divide into groups (strata), sample from each | Population has distinct subgroups |
| Cluster | Divide into clusters, randomly select clusters | Population spread geographically |
| Systematic | Select every k-th member | Ordered list available |
| Convenience | Use whoever is available | Quick, informal insights (biased!) |
Example: Choosing a Sampling Method
Scenario: Survey customer satisfaction for a bank with branches across India.
Simple Random: Randomly select 1000 customers from the entire customer database.
Stratified: Sample proportionally from different account types (Savings, Current, Fixed Deposit).
Cluster: Randomly select 50 branches, survey all customers at those branches.
Sampling Distribution of the Mean
If we take many samples and calculate the mean of each, these sample means form their own distribution.
Properties of Sampling Distribution
- Center: Mean of sample means = Population mean (μᵣ = μ)
- Spread: Standard error = σ / √n (smaller than population SD)
- Shape: Approximately normal (if n is large enough - CLT)
Standard Error of the Mean
As sample size increases, standard error decreases (estimates become more precise).
Central Limit Theorem (CLT)
One of the most important theorems in statistics!
CLT in Action
Rolling a die has a uniform distribution (each number equally likely).
Population mean = 3.5, Population SD = 1.71
If we take samples of n=30 and calculate means:
- Mean of sample means ≈ 3.5
- Standard error = 1.71/√30 = 0.31
- Distribution of sample means is approximately normal!
Confidence Intervals
A confidence interval gives a range of plausible values for a population parameter.
Confidence Interval for Mean (known σ)
Common Z values: 90% CI → Z=1.645, 95% CI → Z=1.96, 99% CI → Z=2.576
Confidence Interval for Mean (unknown σ, use t-distribution)
Use t-distribution with (n-1) degrees of freedom when using sample standard deviation.
Example: Calculating a 95% Confidence Interval
A sample of 36 customers shows average spend of Rs. 500 with SD of Rs. 120.
n = 36, x̄ = 500, s = 120
Standard Error = 120/√36 = 20
For 95% CI with df=35, t ≈ 2.03
CI = 500 ± 2.03 × 20 = 500 ± 40.6
95% CI: (Rs. 459.40, Rs. 540.60)
Interpretation: We are 95% confident the true average spend is between Rs. 459.40 and Rs. 540.60.
Common Misconception
A 95% CI does NOT mean there's a 95% probability the true mean is in this interval. The true mean is fixed - it either is or isn't in the interval.
Instead: If we repeated this sampling process many times, 95% of the intervals would contain the true mean.
Factors Affecting Confidence Interval Width
| Factor | Effect on CI Width |
|---|---|
| Increase sample size (n) | Narrower CI (more precise) |
| Increase confidence level | Wider CI (more certain) |
| Larger population SD | Wider CI (more variability) |
Excel Functions for Sampling & Estimation
=CONFIDENCE.T(alpha, std_dev, size)- Margin of error (t-distribution)=CONFIDENCE.NORM(alpha, std_dev, size)- Margin of error (normal)=T.INV.2T(alpha, df)- Two-tailed t critical value=NORM.S.INV(probability)- Z critical value
Try It Yourself - Day 6 Exercises
- A sample of 49 products has mean weight 100g, SD 14g. Calculate 95% CI for true mean weight.
- If you want to cut the margin of error in half, how should sample size change?
- Why would stratified sampling be better than simple random sampling for a salary survey across different departments?
- Population SD is 50. How large a sample is needed for a margin of error of 5 at 95% confidence?
Day 6 Key Takeaways
- Samples allow inference about populations without measuring everyone
- Different sampling methods suit different situations
- Central Limit Theorem: Sample means are approximately normal for large n
- Standard error measures precision: SE = σ/√n
- Confidence intervals give a range of plausible values for parameters
- Larger samples give narrower (more precise) confidence intervals
Week 4 Project & Assessment
Apply Your Statistical Knowledge
Project: Statistical Analysis Report
Create a comprehensive statistical analysis of a business dataset, applying all concepts learned this week.
Project Requirements
- Dataset: Use the provided employee salary dataset or your own data (minimum 100 records)
- Descriptive Statistics:
- Calculate mean, median, mode for key variables
- Calculate range, IQR, variance, standard deviation
- Create five-number summary and identify outliers
- Distribution Analysis:
- Create histograms for numeric variables
- Calculate skewness and kurtosis
- Test for normality visually
- Calculate and interpret Z-scores
- Probability Calculations:
- Calculate relevant probabilities using appropriate distributions
- Apply conditional probability to a business question
- Sampling & Estimation:
- Calculate 95% confidence interval for mean salary
- Explain sampling strategy recommendations
- Insights & Recommendations:
- Summarize key findings
- Provide data-driven recommendations
Sample Project Structure
Statistical Analysis: Employee Compensation Study
1. Executive Summary
Brief overview of findings...
2. Data Overview
Dataset description, variables, data quality...
3. Descriptive Statistics
| Measure | Salary (Rs.) | Experience (Years) |
|---|---|---|
| Mean | 65,000 | 7.2 |
| Median | 55,000 | 6.0 |
| Std Dev | 25,000 | 4.5 |
4. Distribution Analysis
Salary is right-skewed (skewness = 1.2), suggesting a few high earners...
5. Key Findings & Recommendations
Based on our analysis, we recommend...
Assessment Breakdown
| Component | Weight | Criteria |
|---|---|---|
| Statistical Analysis Report | 50% | Completeness, accuracy, insights |
| Quiz: Statistics Fundamentals | 30% | Conceptual understanding |
| Problem-Solving Exercises | 20% | Calculation accuracy, interpretation |
Week 4 Summary
- Day 1: Central tendency - Mean, Median, Mode for summarizing data
- Day 2: Dispersion - Range, IQR, Variance, SD for measuring spread
- Day 3: Distributions - Shape, Skewness, Normal distribution, Z-scores
- Day 4: Probability - Rules, Conditional probability, Bayes' theorem
- Day 5: Distributions - Binomial, Poisson, Normal, Exponential
- Day 6: Sampling - Methods, CLT, Confidence intervals
- Day 7: Project applying all concepts
Self-Assessment Quiz
Test your understanding of this week's concepts:
1. A dataset has mean=50 and median=65. The distribution is:
2. Which measure of central tendency is most affected by outliers?
3. According to the Empirical Rule, what percentage of data falls within 2 standard deviations of the mean in a normal distribution?
4. A student scored 85 on a test where mean=75 and SD=5. What is their Z-score?
5. If P(A)=0.4 and P(B)=0.3, and A and B are independent, what is P(A and B)?
6. Which distribution would you use to model the number of customer arrivals per hour?
7. As sample size increases, what happens to the standard error?
8. A 95% confidence interval means:
9. Which sampling method divides population into subgroups and samples from each?
10. The Central Limit Theorem states that:
Quiz Answers
1-a, 2-c, 3-b, 4-b, 5-b, 6-c, 7-b, 8-c, 9-b, 10-b
Scoring: 9-10 correct: Excellent! | 7-8: Good understanding | 5-6: Review needed | Below 5: Re-study the material