Welcome to Week 3! You've mastered the basics of SQL - now it's time to unlock its full power. This week, we'll dive into advanced techniques that separate junior analysts from senior data professionals.
The techniques you'll learn this week - subqueries, window functions, and CTEs - are used daily in real-world analytics. They're also among the most common topics in data analyst interviews. By the end of this week, you'll be writing sophisticated queries that can answer complex business questions in a single statement.
Let's level up your SQL skills!
What You'll Learn This Week
- Day 1: Subqueries & Nested Queries
- Day 2: Correlated Subqueries
- Day 3: Window Functions - Part 1 (Ranking)
- Day 4: Window Functions - Part 2 (Aggregates & Navigation)
- Day 5: Common Table Expressions (CTEs)
- Day 6: Query Optimization & Best Practices
- Day 7: Week 3 Project & Assessment
- Self-Assessment Quiz
- About the Instructor
Subqueries & Nested Queries
Queries Within Queries - The Foundation of Complex Analysis
What is a Subquery?
Think of a subquery as asking a question to get an answer, then using that answer to ask another question. For example: "What's the average salary?" (subquery) then "Show me all employees who earn more than that average" (outer query).
Why Use Subqueries?
- Dynamic filtering: Filter based on calculated values, not hardcoded numbers
- Multi-step analysis: Break complex problems into logical steps
- Data comparison: Compare rows against aggregated results
- Existence checks: Verify if related records exist
Types of Subqueries
| Type | Returns | Used With | Example Use Case |
|---|---|---|---|
| Scalar | Single value | =, >, <, >=, <= | Find products priced above average |
| Column | List of values | IN, NOT IN, ANY, ALL | Find customers who ordered specific products |
| Table | Result set (rows & columns) | FROM clause, JOINs | Join with aggregated summary data |
Scalar Subqueries
A scalar subquery returns exactly one value (one row, one column). It can be used anywhere a single value is expected.
Example: Find Products Above Average Price
The subquery (SELECT AVG(price) FROM products) executes first, returning a single number. The outer query then uses this value to filter the results.
Example: Compare to Department Average
Column Subqueries (List of Values)
A column subquery returns multiple values (multiple rows, one column). Use with IN, NOT IN, ANY, or ALL operators.
Example: Find Customers Who Made Purchases
Example: Find Products Never Ordered
Watch Out: NULL Values with NOT IN
If the subquery returns any NULL values, NOT IN will return no results! Always add WHERE column IS NOT NULL in your subquery when using NOT IN.
Table Subqueries (Derived Tables)
A table subquery returns a complete result set that can be used in the FROM clause like a regular table. It must have an alias.
Example: Join with Aggregated Data
Subqueries in Different Clauses
In the SELECT Clause
In the FROM Clause
In the WHERE Clause
EXISTS and NOT EXISTS
EXISTS checks if a subquery returns any rows. It's often more efficient than IN for large datasets.
Example: Customers With Orders (Using EXISTS)
Example: Customers Without Orders (Using NOT EXISTS)
EXISTS vs IN: When to Use Which?
- EXISTS: Better for large subquery results, stops at first match
- IN: Better for small, static lists; easier to read
- NOT EXISTS: Handles NULLs correctly, preferred over NOT IN
Day 1 Key Takeaways
- Subqueries let you use the result of one query inside another
- Scalar subqueries return one value; column subqueries return a list
- Table subqueries (derived tables) return full result sets
- Use EXISTS for efficient existence checks on large tables
- Be careful with NULL values when using NOT IN
Correlated Subqueries
Row-by-Row Processing for Advanced Comparisons
What is a Correlated Subquery?
Think of it this way: A regular subquery answers a question once (e.g., "What's the average salary?"). A correlated subquery answers a question for each row (e.g., "For THIS employee, what's the average salary in THEIR department?").
Regular vs Correlated Subquery
| Aspect | Regular Subquery | Correlated Subquery |
|---|---|---|
| Execution | Runs once, result cached | Runs once per outer row |
| References outer query? | No | Yes |
| Performance | Generally faster | Can be slower on large tables |
| Use case | Compare to a fixed value | Row-specific comparisons |
Side-by-Side Comparison
How Correlated Subqueries Work
For each row in the outer query:
- The outer query passes the current row's values to the subquery
- The subquery executes using those values
- The result is used to evaluate the current row
- Move to the next row and repeat
Example: Employees Above Department Average
Result:
| employee_name | department | salary | dept_avg |
|---|---|---|---|
| Priya | Engineering | 95,000 | 78,000 |
| Rahul | Engineering | 88,000 | 78,000 |
| Amit | Sales | 72,000 | 62,000 |
Finding Latest/First Records Per Group
A common use case for correlated subqueries is finding the most recent (or first) record for each group.
Example: Latest Order for Each Customer
Example: Top Product in Each Category
Using ANY and ALL
ANY and ALL operators compare a value against a set of values returned by a subquery.
| Operator | Meaning | Example |
|---|---|---|
| = ANY | Equal to at least one value (same as IN) | salary = ANY (subquery) |
| > ANY | Greater than the minimum value | salary > ANY (subquery) |
| > ALL | Greater than the maximum value | salary > ALL (subquery) |
| < ALL | Less than the minimum value | price < ALL (subquery) |
Example: Using ALL
Performance Considerations
Correlated Subqueries Can Be Slow
Since correlated subqueries execute once per row, they can be slow on large tables. For a table with 1 million rows, the subquery runs 1 million times!
When to Use Alternatives
- Window functions: Often faster for ranking and comparing to group aggregates
- JOINs with derived tables: Calculate aggregates once, then join
- CTEs: More readable alternative that may optimize better
We'll learn these alternatives in the coming days!
Rewriting with a JOIN (Often Faster)
Day 2 Key Takeaways
- Correlated subqueries reference the outer query and run for each row
- Use them for row-specific comparisons (e.g., compare to group average)
- Great for finding latest/first records per group
- ANY and ALL operators compare against sets of values
- Consider JOINs or window functions for better performance
Window Functions - Part 1
Ranking Functions & The Power of OVER()
What are Window Functions?
Window functions are one of the most powerful features in SQL. They let you perform calculations across rows while keeping all individual rows in your result. This is perfect for rankings, running totals, and comparisons.
Window Functions vs GROUP BY
| Aspect | GROUP BY | Window Functions |
|---|---|---|
| Rows returned | One row per group | All original rows |
| Access to detail | Loses individual row data | Keeps all row data |
| Use case | Summaries, totals | Rankings, running totals, comparisons |
GROUP BY vs Window Function Comparison
The OVER() Clause
The OVER() clause defines the "window" - which rows the function should consider for its calculation.
function_name() OVER(
[PARTITION BY column1, column2, ...]
[ORDER BY column3, column4, ...]
)
- PARTITION BY: Divides rows into groups (like GROUP BY, but keeps all rows)
- ORDER BY: Determines the order of rows within each partition
Ranking Functions
Ranking functions assign a rank to each row based on specified ordering.
| Function | Behavior | Ties | Example Output |
|---|---|---|---|
| ROW_NUMBER() | Unique sequential number | No gaps, arbitrary tie-breaking | 1, 2, 3, 4, 5 |
| RANK() | Rank with gaps | Same rank for ties, skips next | 1, 2, 2, 4, 5 |
| DENSE_RANK() | Rank without gaps | Same rank for ties, no skip | 1, 2, 2, 3, 4 |
| NTILE(n) | Divides into n equal groups | N/A | 1, 1, 2, 2, 3, 3 |
Example: Comparing Ranking Functions
Result (notice how ties are handled):
| employee_name | salary | row_num | rank | dense_rank |
|---|---|---|---|---|
| Priya | 95,000 | 1 | 1 | 1 |
| Rahul | 88,000 | 2 | 2 | 2 |
| Amit | 88,000 | 3 | 2 | 2 |
| Sneha | 75,000 | 4 | 4 | 3 |
| Vikram | 65,000 | 5 | 5 | 4 |
PARTITION BY: Ranking Within Groups
Use PARTITION BY to reset rankings for each group.
Example: Rank Employees Within Each Department
Result:
| employee_name | department | salary | dept_rank |
|---|---|---|---|
| Priya | Engineering | 95,000 | 1 |
| Rahul | Engineering | 88,000 | 2 |
| Vikram | Engineering | 51,000 | 3 |
| Amit | Sales | 72,000 | 1 |
| Sneha | Sales | 65,000 | 2 |
NTILE: Creating Percentile Groups
NTILE divides rows into a specified number of roughly equal groups (quartiles, deciles, etc.).
Example: Divide Customers into Quartiles by Purchase Value
Practical Example: Top 3 Products Per Category
Finding Top N Per Group
Why Use a Subquery?
You cannot use window functions directly in a WHERE clause. The window function runs after WHERE. So we wrap it in a subquery to filter on the calculated rank.
Day 3 Key Takeaways
- Window functions perform calculations across rows without collapsing them
- OVER() defines the window; PARTITION BY creates groups; ORDER BY determines order
- ROW_NUMBER gives unique numbers; RANK and DENSE_RANK handle ties differently
- NTILE divides rows into equal groups (great for percentiles)
- To filter on window function results, use a subquery
Window Functions - Part 2
Aggregate Functions, Running Totals & Row Navigation
Aggregate Window Functions
You can use familiar aggregate functions (SUM, AVG, COUNT, MIN, MAX) as window functions to calculate aggregates while keeping all rows.
Example: Compare Each Sale to Department Average
Running Totals (Cumulative Sum)
Adding ORDER BY to an aggregate window function creates a running (cumulative) calculation.
Example: Running Total of Daily Sales
Result:
| sale_date | daily_sales | running_total | running_avg |
|---|---|---|---|
| 2024-01-01 | 1,500 | 1,500 | 1,500.00 |
| 2024-01-02 | 2,200 | 3,700 | 1,850.00 |
| 2024-01-03 | 1,800 | 5,500 | 1,833.33 |
| 2024-01-04 | 3,100 | 8,600 | 2,150.00 |
LAG and LEAD: Accessing Adjacent Rows
LAG looks at previous rows; LEAD looks at following rows. Essential for period-over-period comparisons!
LAG(column, offset, default) OVER(ORDER BY ...)
LEAD(column, offset, default) OVER(ORDER BY ...)
- offset: How many rows back/forward (default: 1)
- default: Value if no row exists (default: NULL)
Example: Month-over-Month Sales Growth
Result:
| month | monthly_sales | prev_month | growth | growth_pct |
|---|---|---|---|---|
| 2024-01 | 45,000 | NULL | NULL | NULL |
| 2024-02 | 52,000 | 45,000 | 7,000 | 15.6% |
| 2024-03 | 48,000 | 52,000 | -4,000 | -7.7% |
| 2024-04 | 61,000 | 48,000 | 13,000 | 27.1% |
Example: Year-over-Year Comparison
FIRST_VALUE and LAST_VALUE
Get the first or last value in a window partition.
Example: Compare Each Sale to First and Best Sale
Frame Specifications: ROWS vs RANGE
Frame specifications let you define exactly which rows to include in the window calculation.
ROWS BETWEEN start AND end
Options for start/end:
- UNBOUNDED PRECEDING (all rows from beginning)
- n PRECEDING (n rows before current)
- CURRENT ROW
- n FOLLOWING (n rows after current)
- UNBOUNDED FOLLOWING (all rows to end)
Example: 7-Day Moving Average
Example: Running Total vs Full Total
Common Window Frame Patterns
- Running total: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
- Moving average: ROWS BETWEEN n PRECEDING AND CURRENT ROW
- Centered average: ROWS BETWEEN n PRECEDING AND n FOLLOWING
- Entire partition: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Day 4 Key Takeaways
- Aggregate functions with OVER() calculate across windows while keeping rows
- Adding ORDER BY creates running/cumulative calculations
- LAG looks back; LEAD looks forward - essential for period comparisons
- Frame specifications (ROWS BETWEEN) control exactly which rows to include
- Moving averages smooth out fluctuations in time-series data
Common Table Expressions (CTEs)
Write Readable, Maintainable, and Reusable Queries
What is a CTE?
Think of CTEs as giving a name to a subquery that you can reference later. Instead of nesting queries inside queries (hard to read), you define named "building blocks" at the top, then use them below.
WITH cte_name AS (
SELECT ...
)
SELECT * FROM cte_name;
CTE vs Subquery: Readability Comparison
Nested Subquery (Hard to Read)
Same Query with CTEs (Easy to Read)
Benefits of CTEs
- Readability: Name your logic, making queries self-documenting
- Reusability: Reference the same CTE multiple times in one query
- Maintainability: Easier to modify and debug
- Recursion: CTEs support recursive queries (hierarchies)
Multiple CTEs
You can define multiple CTEs separated by commas, and later CTEs can reference earlier ones.
Example: Multi-Step Sales Analysis
Recursive CTEs
Recursive CTEs are powerful for working with hierarchical data like org charts, category trees, or bill of materials.
Example: Employee Hierarchy
Result:
| hierarchy | level | path |
|---|---|---|
| Rajesh (CEO) | 1 | Rajesh (CEO) |
| Priya (VP Engineering) | 2 | Rajesh (CEO) -> Priya |
| Rahul (Manager) | 3 | Rajesh (CEO) -> Priya -> Rahul |
| Amit (Manager) | 3 | Rajesh (CEO) -> Priya -> Amit |
| Sneha (VP Sales) | 2 | Rajesh (CEO) -> Sneha |
CTEs vs Subqueries vs Temp Tables
| Feature | CTE | Subquery | Temp Table |
|---|---|---|---|
| Scope | Single query | Single location | Entire session |
| Reusable in query | Yes | No (must repeat) | Yes |
| Supports recursion | Yes | No | No |
| Can be indexed | No | No | Yes |
| Best for | Readability, recursion | Simple, one-time use | Large data, multiple queries |
Day 5 Key Takeaways
- CTEs make complex queries readable by naming intermediate steps
- Use WITH clause before your main SELECT statement
- Multiple CTEs can be chained, with later ones referencing earlier ones
- Recursive CTEs solve hierarchical data problems elegantly
- CTEs are preferred over deeply nested subqueries for maintainability
Query Optimization & Best Practices
Writing Efficient, Production-Ready SQL
Why Query Performance Matters
A query that runs in 30 seconds on your test database might take 30 minutes on production with millions of rows. Understanding optimization is essential for any serious data work.
Understanding Query Execution
Most SQL databases process queries in this order:
- FROM / JOIN: Identify and combine tables
- WHERE: Filter rows
- GROUP BY: Aggregate rows
- HAVING: Filter aggregated results
- SELECT: Choose columns and calculate expressions
- DISTINCT: Remove duplicates
- ORDER BY: Sort results
- LIMIT: Restrict output rows
EXPLAIN: Understanding Query Plans
Use EXPLAIN to see how the database plans to execute your query.
Key Things to Look For in EXPLAIN
- Table scan: Reading entire table (slow for large tables)
- Index usage: Using index for fast lookups (good)
- Rows examined: How many rows the database checks
- Join type: How tables are being joined
Indexing Basics
| Column Type | Should Be Indexed? | Why |
|---|---|---|
| Primary Key | Automatic | Created automatically |
| Foreign Keys | Yes | Used frequently in JOINs |
| WHERE clause columns | Often yes | Speeds up filtering |
| ORDER BY columns | Sometimes | Speeds up sorting |
| Columns with few unique values | Usually no | Index won't help much |
Common Performance Pitfalls
1. SELECT * (Selecting All Columns)
2. Functions on Indexed Columns in WHERE
3. Using OR Instead of IN
4. Not Using LIMIT During Development
SQL Best Practices
Writing Clean, Maintainable SQL
- Use meaningful aliases:
customers cnotcustomers x - Format consistently: Keywords uppercase, columns lowercase
- Comment complex logic: Explain why, not what
- One column per line: Makes diffs and reviews easier
- Use CTEs over nested subqueries: More readable
Well-Formatted Production Query
Day 6 Key Takeaways
- Use EXPLAIN to understand how your queries execute
- Index columns used in WHERE, JOIN, and ORDER BY clauses
- Avoid SELECT * - only fetch columns you need
- Don't apply functions to indexed columns in WHERE clauses
- Write clean, commented, well-formatted SQL for maintainability
Week 3 Project & Assessment
Putting It All Together
Project: Business Analytics SQL Report
Create a comprehensive analytics report for a retail company using all the advanced SQL techniques you've learned this week.
Project Requirements
Using the e-commerce database from Week 2, write queries to answer the following business questions:
Task 1: Customer Segmentation (Window Functions)
Segment customers into tiers based on their total purchase value:
- Use NTILE to divide customers into 4 quartiles
- Label them as Premium, Gold, Silver, Bronze
- Show the count and average purchase value for each tier
Task 2: Sales Trend Analysis (Running Totals)
Create a monthly sales summary with:
- Monthly revenue
- Running total (year-to-date)
- Month-over-month growth percentage
- 3-month moving average
Task 3: Product Performance Ranking (RANK Functions)
For each product category:
- Rank products by total revenue
- Show top 3 products per category
- Include each product's percentage of category revenue
Task 4: Year-over-Year Comparison (LAG)
Compare this year's quarterly performance to last year:
- Revenue by quarter
- Same quarter last year revenue
- YoY growth percentage
- Classify as "Growth", "Stable", or "Decline"
Task 5: Complex Analytics Query (Multiple CTEs)
Build a comprehensive customer analysis using multiple CTEs:
- CTE 1: Customer purchase summary
- CTE 2: Customer recency (days since last order)
- CTE 3: Customer ranking by value
- Final: Combine into RFM-style segmentation
Assessment Criteria
| Component | Weight | Evaluation |
|---|---|---|
| Query Correctness | 40% | Results are accurate |
| Technique Usage | 30% | Appropriate use of advanced SQL |
| Code Quality | 20% | Clean, formatted, commented |
| Efficiency | 10% | No obvious performance issues |
Starter Template
Week 3 Complete!
Congratulations! You've mastered advanced SQL techniques that are used daily by data professionals worldwide. These skills will set you apart in job interviews and enable you to tackle complex analytical challenges.
Next Week: Data Visualization - learn to communicate your insights effectively!
Self-Assessment Quiz
Test your understanding of this week's concepts. Try to answer without looking back!
Question 1: Subqueries
What type of subquery returns a single value that can be compared using =, >, or
Show Answer
Scalar Subquery - Returns exactly one row and one column (a single value).
Question 2: Correlated vs Regular
A correlated subquery executes ___ time(s), while a regular subquery executes ___ time(s).
Show Answer
Once per row / once total - Correlated subqueries reference the outer query and must execute for each row.
Question 3: Ranking Functions
If two employees have the same salary and are ranked, RANK() gives them both 2. What does RANK() assign to the next person?
Show Answer
4 - RANK() skips numbers after ties. DENSE_RANK() would give 3.
Question 4: Window Functions
Which function would you use to access the previous row's value in a window?
Show Answer
LAG() - LAG looks at previous rows; LEAD looks at following rows.
Question 5: CTEs
What keyword is used to start a Common Table Expression?
Show Answer
WITH - CTEs are defined using WITH cte_name AS (query).
Question 6: Optimization
Why is WHERE YEAR(date_col) = 2024 slower than WHERE date_col >= '2024-01-01'?
Show Answer
Applying a function to a column prevents the database from using an index on that column. The second form allows index usage.
Question 7: Frame Specification
To calculate a 7-day moving average, which frame specification would you use?
Show Answer
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW - This includes the current row plus the 6 previous rows (7 total).
Scoring
- 7 correct: Excellent! You've mastered advanced SQL
- 5-6 correct: Good understanding, review weak areas
- 3-4 correct: Review the day's content for missed topics
- Below 3: Consider re-reading the week's content
About the Instructor
Pawan Rama Mali
Data Analytics professional and educator with extensive experience in SQL, database management, and teaching complex analytical concepts to beginners. Passionate about making data skills accessible to everyone and helping students build practical, job-ready expertise.
Specializations:
- SQL & Database Design
- Data Analytics & Business Intelligence
- R Programming & R Shiny Dashboards
- Python for Data Analysis