Week 3: Advanced SQL & Analytics Queries

Master Subqueries, Window Functions, CTEs & Query Optimization

Duration: 7 Days | Level: Intermediate to Advanced | Prerequisite: Week 2

Welcome to Week 3! You've mastered the basics of SQL - now it's time to unlock its full power. This week, we'll dive into advanced techniques that separate junior analysts from senior data professionals.

The techniques you'll learn this week - subqueries, window functions, and CTEs - are used daily in real-world analytics. They're also among the most common topics in data analyst interviews. By the end of this week, you'll be writing sophisticated queries that can answer complex business questions in a single statement.

Let's level up your SQL skills!

Day 1

Subqueries & Nested Queries

Queries Within Queries - The Foundation of Complex Analysis

What is a Subquery?

Subquery: A subquery (also called an inner query or nested query) is a SQL query embedded inside another query. The inner query executes first, and its result is used by the outer query.

Think of a subquery as asking a question to get an answer, then using that answer to ask another question. For example: "What's the average salary?" (subquery) then "Show me all employees who earn more than that average" (outer query).

Why Use Subqueries?

  • Dynamic filtering: Filter based on calculated values, not hardcoded numbers
  • Multi-step analysis: Break complex problems into logical steps
  • Data comparison: Compare rows against aggregated results
  • Existence checks: Verify if related records exist

Types of Subqueries

Type Returns Used With Example Use Case
Scalar Single value =, >, <, >=, <= Find products priced above average
Column List of values IN, NOT IN, ANY, ALL Find customers who ordered specific products
Table Result set (rows & columns) FROM clause, JOINs Join with aggregated summary data

Scalar Subqueries

A scalar subquery returns exactly one value (one row, one column). It can be used anywhere a single value is expected.

Example: Find Products Above Average Price

-- Find all products priced above the average SELECT product_name, price FROM products WHERE price > ( SELECT AVG(price) FROM products ); -- The subquery calculates: AVG(price) = 250.00 -- The outer query then finds: WHERE price > 250.00

The subquery (SELECT AVG(price) FROM products) executes first, returning a single number. The outer query then uses this value to filter the results.

Example: Compare to Department Average

-- Find employees earning more than their company's average salary SELECT employee_name, salary, department FROM employees WHERE salary > ( SELECT AVG(salary) FROM employees ) ORDER BY salary DESC;

Column Subqueries (List of Values)

A column subquery returns multiple values (multiple rows, one column). Use with IN, NOT IN, ANY, or ALL operators.

Example: Find Customers Who Made Purchases

-- Find customers who have placed at least one order SELECT customer_name, email, city FROM customers WHERE customer_id IN ( SELECT DISTINCT customer_id FROM orders ); -- The subquery returns: (1, 3, 5, 7, 12, ...) -- Outer query finds customers with those IDs

Example: Find Products Never Ordered

-- Find products that have never been ordered SELECT product_id, product_name, price FROM products WHERE product_id NOT IN ( SELECT DISTINCT product_id FROM order_items WHERE product_id IS NOT NULL );

Watch Out: NULL Values with NOT IN

If the subquery returns any NULL values, NOT IN will return no results! Always add WHERE column IS NOT NULL in your subquery when using NOT IN.

Table Subqueries (Derived Tables)

A table subquery returns a complete result set that can be used in the FROM clause like a regular table. It must have an alias.

Example: Join with Aggregated Data

-- Compare each product's price to its category average SELECT p.product_name, p.category, p.price, cat_avg.avg_price AS category_avg, p.price - cat_avg.avg_price AS difference FROM products p JOIN ( SELECT category, AVG(price) AS avg_price FROM products GROUP BY category ) AS cat_avg ON p.category = cat_avg.category ORDER BY difference DESC;

Subqueries in Different Clauses

In the SELECT Clause

-- Show each product with the total count of products in its category SELECT product_name, category, price, (SELECT COUNT(*) FROM products p2 WHERE p2.category = p1.category) AS products_in_category FROM products p1;

In the FROM Clause

-- Get statistics from aggregated data SELECT AVG(order_total) AS avg_order_value, MAX(order_total) AS max_order_value, MIN(order_total) AS min_order_value FROM ( SELECT order_id, SUM(quantity * unit_price) AS order_total FROM order_items GROUP BY order_id ) AS order_totals;

In the WHERE Clause

-- Find orders placed on the busiest day SELECT * FROM orders WHERE DATE(order_date) = ( SELECT DATE(order_date) FROM orders GROUP BY DATE(order_date) ORDER BY COUNT(*) DESC LIMIT 1 );

EXISTS and NOT EXISTS

EXISTS checks if a subquery returns any rows. It's often more efficient than IN for large datasets.

Example: Customers With Orders (Using EXISTS)

-- Find customers who have placed orders (efficient method) SELECT customer_name, email FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id ); -- EXISTS stops as soon as it finds one matching row -- More efficient than IN for large tables

Example: Customers Without Orders (Using NOT EXISTS)

-- Find customers who have never placed an order SELECT customer_name, email, signup_date FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id );

EXISTS vs IN: When to Use Which?

  • EXISTS: Better for large subquery results, stops at first match
  • IN: Better for small, static lists; easier to read
  • NOT EXISTS: Handles NULLs correctly, preferred over NOT IN

Day 1 Key Takeaways

  • Subqueries let you use the result of one query inside another
  • Scalar subqueries return one value; column subqueries return a list
  • Table subqueries (derived tables) return full result sets
  • Use EXISTS for efficient existence checks on large tables
  • Be careful with NULL values when using NOT IN
Day 2

Correlated Subqueries

Row-by-Row Processing for Advanced Comparisons

What is a Correlated Subquery?

Correlated Subquery: A subquery that references columns from the outer query. Unlike regular subqueries that execute once, correlated subqueries execute once for each row in the outer query.

Think of it this way: A regular subquery answers a question once (e.g., "What's the average salary?"). A correlated subquery answers a question for each row (e.g., "For THIS employee, what's the average salary in THEIR department?").

Regular vs Correlated Subquery

Aspect Regular Subquery Correlated Subquery
Execution Runs once, result cached Runs once per outer row
References outer query? No Yes
Performance Generally faster Can be slower on large tables
Use case Compare to a fixed value Row-specific comparisons

Side-by-Side Comparison

-- REGULAR SUBQUERY: Compare to company average SELECT employee_name, department, salary FROM employees WHERE salary > ( SELECT AVG(salary) FROM employees -- Runs ONCE ); -- CORRELATED SUBQUERY: Compare to department average SELECT employee_name, department, salary FROM employees e1 WHERE salary > ( SELECT AVG(salary) FROM employees e2 WHERE e2.department = e1.department -- References outer query! );

How Correlated Subqueries Work

For each row in the outer query:

  1. The outer query passes the current row's values to the subquery
  2. The subquery executes using those values
  3. The result is used to evaluate the current row
  4. Move to the next row and repeat

Example: Employees Above Department Average

-- Find employees who earn more than their department's average SELECT employee_name, department, salary, (SELECT ROUND(AVG(salary), 2) FROM employees e2 WHERE e2.department = e1.department) AS dept_avg FROM employees e1 WHERE salary > ( SELECT AVG(salary) FROM employees e2 WHERE e2.department = e1.department ) ORDER BY department, salary DESC;

Result:

employee_namedepartmentsalarydept_avg
PriyaEngineering95,00078,000
RahulEngineering88,00078,000
AmitSales72,00062,000

Finding Latest/First Records Per Group

A common use case for correlated subqueries is finding the most recent (or first) record for each group.

Example: Latest Order for Each Customer

-- Get the most recent order for each customer SELECT c.customer_name, o.order_id, o.order_date, o.total_amount FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date = ( SELECT MAX(order_date) FROM orders o2 WHERE o2.customer_id = c.customer_id ) ORDER BY o.order_date DESC;

Example: Top Product in Each Category

-- Find the highest-priced product in each category SELECT product_name, category, price FROM products p1 WHERE price = ( SELECT MAX(price) FROM products p2 WHERE p2.category = p1.category ) ORDER BY category;

Using ANY and ALL

ANY and ALL operators compare a value against a set of values returned by a subquery.

Operator Meaning Example
= ANY Equal to at least one value (same as IN) salary = ANY (subquery)
> ANY Greater than the minimum value salary > ANY (subquery)
> ALL Greater than the maximum value salary > ALL (subquery)
< ALL Less than the minimum value price < ALL (subquery)

Example: Using ALL

-- Find products more expensive than ALL products in 'Accessories' category SELECT product_name, category, price FROM products WHERE price > ALL ( SELECT price FROM products WHERE category = 'Accessories' ); -- This is equivalent to: -- WHERE price > (SELECT MAX(price) FROM products WHERE category = 'Accessories')

Performance Considerations

Correlated Subqueries Can Be Slow

Since correlated subqueries execute once per row, they can be slow on large tables. For a table with 1 million rows, the subquery runs 1 million times!

When to Use Alternatives

  • Window functions: Often faster for ranking and comparing to group aggregates
  • JOINs with derived tables: Calculate aggregates once, then join
  • CTEs: More readable alternative that may optimize better

We'll learn these alternatives in the coming days!

Rewriting with a JOIN (Often Faster)

-- Instead of correlated subquery: SELECT e1.employee_name, e1.department, e1.salary FROM employees e1 JOIN ( SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department ) dept_avgs ON e1.department = dept_avgs.department WHERE e1.salary > dept_avgs.avg_salary; -- This calculates department averages ONCE, then joins

Day 2 Key Takeaways

  • Correlated subqueries reference the outer query and run for each row
  • Use them for row-specific comparisons (e.g., compare to group average)
  • Great for finding latest/first records per group
  • ANY and ALL operators compare against sets of values
  • Consider JOINs or window functions for better performance
Day 3

Window Functions - Part 1

Ranking Functions & The Power of OVER()

What are Window Functions?

Window Function: A function that performs calculations across a set of rows related to the current row (the "window"), without collapsing them into a single output row like GROUP BY does.

Window functions are one of the most powerful features in SQL. They let you perform calculations across rows while keeping all individual rows in your result. This is perfect for rankings, running totals, and comparisons.

Window Functions vs GROUP BY

Aspect GROUP BY Window Functions
Rows returned One row per group All original rows
Access to detail Loses individual row data Keeps all row data
Use case Summaries, totals Rankings, running totals, comparisons

GROUP BY vs Window Function Comparison

-- GROUP BY: Returns 3 rows (one per department) SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department; -- Result: -- Engineering | 78000 -- Sales | 62000 -- Marketing | 55000 -- WINDOW FUNCTION: Returns ALL rows with avg added SELECT employee_name, department, salary, AVG(salary) OVER(PARTITION BY department) AS dept_avg FROM employees; -- Result: -- Rahul | Engineering | 88000 | 78000 -- Priya | Engineering | 95000 | 78000 -- Vikram | Engineering | 51000 | 78000 -- Amit | Sales | 72000 | 62000 -- ...

The OVER() Clause

The OVER() clause defines the "window" - which rows the function should consider for its calculation.

Syntax:
function_name() OVER(
  [PARTITION BY column1, column2, ...]
  [ORDER BY column3, column4, ...]
)
  • PARTITION BY: Divides rows into groups (like GROUP BY, but keeps all rows)
  • ORDER BY: Determines the order of rows within each partition

Ranking Functions

Ranking functions assign a rank to each row based on specified ordering.

Function Behavior Ties Example Output
ROW_NUMBER() Unique sequential number No gaps, arbitrary tie-breaking 1, 2, 3, 4, 5
RANK() Rank with gaps Same rank for ties, skips next 1, 2, 2, 4, 5
DENSE_RANK() Rank without gaps Same rank for ties, no skip 1, 2, 2, 3, 4
NTILE(n) Divides into n equal groups N/A 1, 1, 2, 2, 3, 3

Example: Comparing Ranking Functions

SELECT employee_name, department, salary, ROW_NUMBER() OVER(ORDER BY salary DESC) AS row_num, RANK() OVER(ORDER BY salary DESC) AS rank, DENSE_RANK() OVER(ORDER BY salary DESC) AS dense_rank FROM employees;

Result (notice how ties are handled):

employee_namesalaryrow_numrankdense_rank
Priya95,000111
Rahul88,000222
Amit88,000322
Sneha75,000443
Vikram65,000554

PARTITION BY: Ranking Within Groups

Use PARTITION BY to reset rankings for each group.

Example: Rank Employees Within Each Department

SELECT employee_name, department, salary, RANK() OVER( PARTITION BY department ORDER BY salary DESC ) AS dept_rank FROM employees ORDER BY department, dept_rank;

Result:

employee_namedepartmentsalarydept_rank
PriyaEngineering95,0001
RahulEngineering88,0002
VikramEngineering51,0003
AmitSales72,0001
SnehaSales65,0002

NTILE: Creating Percentile Groups

NTILE divides rows into a specified number of roughly equal groups (quartiles, deciles, etc.).

Example: Divide Customers into Quartiles by Purchase Value

SELECT customer_name, total_purchases, NTILE(4) OVER(ORDER BY total_purchases DESC) AS quartile, CASE NTILE(4) OVER(ORDER BY total_purchases DESC) WHEN 1 THEN 'Premium' WHEN 2 THEN 'Gold' WHEN 3 THEN 'Silver' ELSE 'Bronze' END AS customer_tier FROM customer_summary ORDER BY total_purchases DESC;

Practical Example: Top 3 Products Per Category

Finding Top N Per Group

-- Get top 3 selling products in each category SELECT * FROM ( SELECT product_name, category, total_sales, ROW_NUMBER() OVER( PARTITION BY category ORDER BY total_sales DESC ) AS sales_rank FROM products ) ranked WHERE sales_rank <= 3 ORDER BY category, sales_rank;

Why Use a Subquery?

You cannot use window functions directly in a WHERE clause. The window function runs after WHERE. So we wrap it in a subquery to filter on the calculated rank.

Day 3 Key Takeaways

  • Window functions perform calculations across rows without collapsing them
  • OVER() defines the window; PARTITION BY creates groups; ORDER BY determines order
  • ROW_NUMBER gives unique numbers; RANK and DENSE_RANK handle ties differently
  • NTILE divides rows into equal groups (great for percentiles)
  • To filter on window function results, use a subquery
Day 4

Window Functions - Part 2

Aggregate Functions, Running Totals & Row Navigation

Aggregate Window Functions

You can use familiar aggregate functions (SUM, AVG, COUNT, MIN, MAX) as window functions to calculate aggregates while keeping all rows.

Example: Compare Each Sale to Department Average

SELECT employee_name, department, sale_amount, AVG(sale_amount) OVER(PARTITION BY department) AS dept_avg, sale_amount - AVG(sale_amount) OVER(PARTITION BY department) AS diff_from_avg, ROUND(sale_amount * 100.0 / SUM(sale_amount) OVER(PARTITION BY department), 1) AS pct_of_dept FROM sales;

Running Totals (Cumulative Sum)

Adding ORDER BY to an aggregate window function creates a running (cumulative) calculation.

Example: Running Total of Daily Sales

SELECT sale_date, daily_sales, SUM(daily_sales) OVER(ORDER BY sale_date) AS running_total, AVG(daily_sales) OVER(ORDER BY sale_date) AS running_avg FROM daily_sales ORDER BY sale_date;

Result:

sale_datedaily_salesrunning_totalrunning_avg
2024-01-011,5001,5001,500.00
2024-01-022,2003,7001,850.00
2024-01-031,8005,5001,833.33
2024-01-043,1008,6002,150.00

LAG and LEAD: Accessing Adjacent Rows

LAG looks at previous rows; LEAD looks at following rows. Essential for period-over-period comparisons!

Syntax:
LAG(column, offset, default) OVER(ORDER BY ...)
LEAD(column, offset, default) OVER(ORDER BY ...)

- offset: How many rows back/forward (default: 1)
- default: Value if no row exists (default: NULL)

Example: Month-over-Month Sales Growth

SELECT month, monthly_sales, LAG(monthly_sales, 1) OVER(ORDER BY month) AS prev_month_sales, monthly_sales - LAG(monthly_sales, 1) OVER(ORDER BY month) AS growth, ROUND( (monthly_sales - LAG(monthly_sales, 1) OVER(ORDER BY month)) * 100.0 / LAG(monthly_sales, 1) OVER(ORDER BY month), 1 ) AS growth_pct FROM monthly_sales;

Result:

monthmonthly_salesprev_monthgrowthgrowth_pct
2024-0145,000NULLNULLNULL
2024-0252,00045,0007,00015.6%
2024-0348,00052,000-4,000-7.7%
2024-0461,00048,00013,00027.1%

Example: Year-over-Year Comparison

SELECT year, quarter, revenue, LAG(revenue, 4) OVER(ORDER BY year, quarter) AS same_quarter_last_year, ROUND( (revenue - LAG(revenue, 4) OVER(ORDER BY year, quarter)) * 100.0 / LAG(revenue, 4) OVER(ORDER BY year, quarter), 1 ) AS yoy_growth_pct FROM quarterly_revenue;

FIRST_VALUE and LAST_VALUE

Get the first or last value in a window partition.

Example: Compare Each Sale to First and Best Sale

SELECT sale_date, product_name, sale_amount, FIRST_VALUE(sale_amount) OVER( PARTITION BY product_name ORDER BY sale_date ) AS first_sale, MAX(sale_amount) OVER( PARTITION BY product_name ) AS best_sale, sale_amount - FIRST_VALUE(sale_amount) OVER( PARTITION BY product_name ORDER BY sale_date ) AS diff_from_first FROM sales;

Frame Specifications: ROWS vs RANGE

Frame specifications let you define exactly which rows to include in the window calculation.

Syntax:
ROWS BETWEEN start AND end

Options for start/end:
- UNBOUNDED PRECEDING (all rows from beginning)
- n PRECEDING (n rows before current)
- CURRENT ROW
- n FOLLOWING (n rows after current)
- UNBOUNDED FOLLOWING (all rows to end)

Example: 7-Day Moving Average

SELECT sale_date, daily_sales, ROUND(AVG(daily_sales) OVER( ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ), 2) AS moving_avg_7day FROM daily_sales ORDER BY sale_date;

Example: Running Total vs Full Total

SELECT sale_date, daily_sales, -- Running total (default frame when ORDER BY is present) SUM(daily_sales) OVER( ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total, -- Total of all rows SUM(daily_sales) OVER() AS grand_total, -- Percentage of total ROUND(daily_sales * 100.0 / SUM(daily_sales) OVER(), 1) AS pct_of_total FROM daily_sales;

Common Window Frame Patterns

  • Running total: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • Moving average: ROWS BETWEEN n PRECEDING AND CURRENT ROW
  • Centered average: ROWS BETWEEN n PRECEDING AND n FOLLOWING
  • Entire partition: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Day 4 Key Takeaways

  • Aggregate functions with OVER() calculate across windows while keeping rows
  • Adding ORDER BY creates running/cumulative calculations
  • LAG looks back; LEAD looks forward - essential for period comparisons
  • Frame specifications (ROWS BETWEEN) control exactly which rows to include
  • Moving averages smooth out fluctuations in time-series data
Day 5

Common Table Expressions (CTEs)

Write Readable, Maintainable, and Reusable Queries

What is a CTE?

Common Table Expression (CTE): A temporary named result set defined within a query using the WITH clause. It exists only for the duration of that query and makes complex queries more readable.

Think of CTEs as giving a name to a subquery that you can reference later. Instead of nesting queries inside queries (hard to read), you define named "building blocks" at the top, then use them below.

Basic Syntax:
WITH cte_name AS (
    SELECT ...
)
SELECT * FROM cte_name;

CTE vs Subquery: Readability Comparison

Nested Subquery (Hard to Read)

SELECT department, avg_salary, company_avg, avg_salary - company_avg AS diff FROM ( SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department ) dept_salaries CROSS JOIN ( SELECT AVG(salary) AS company_avg FROM employees ) company;

Same Query with CTEs (Easy to Read)

WITH dept_salaries AS ( SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department ), company_avg AS ( SELECT AVG(salary) AS avg_salary FROM employees ) SELECT d.department, d.avg_salary, c.avg_salary AS company_avg, d.avg_salary - c.avg_salary AS diff FROM dept_salaries d CROSS JOIN company_avg c;

Benefits of CTEs

  • Readability: Name your logic, making queries self-documenting
  • Reusability: Reference the same CTE multiple times in one query
  • Maintainability: Easier to modify and debug
  • Recursion: CTEs support recursive queries (hierarchies)

Multiple CTEs

You can define multiple CTEs separated by commas, and later CTEs can reference earlier ones.

Example: Multi-Step Sales Analysis

WITH -- Step 1: Calculate monthly sales monthly_sales AS ( SELECT DATE_FORMAT(order_date, '%Y-%m') AS month, SUM(total_amount) AS revenue FROM orders GROUP BY DATE_FORMAT(order_date, '%Y-%m') ), -- Step 2: Add previous month for comparison sales_with_prev AS ( SELECT month, revenue, LAG(revenue) OVER(ORDER BY month) AS prev_month_revenue FROM monthly_sales ), -- Step 3: Calculate growth metrics growth_analysis AS ( SELECT month, revenue, prev_month_revenue, revenue - prev_month_revenue AS growth, ROUND((revenue - prev_month_revenue) * 100.0 / prev_month_revenue, 1) AS growth_pct FROM sales_with_prev ) -- Final: Select with classifications SELECT month, revenue, growth, growth_pct, CASE WHEN growth_pct > 20 THEN 'Strong Growth' WHEN growth_pct > 0 THEN 'Moderate Growth' WHEN growth_pct > -10 THEN 'Slight Decline' ELSE 'Significant Decline' END AS performance FROM growth_analysis WHERE prev_month_revenue IS NOT NULL;

Recursive CTEs

Recursive CTEs are powerful for working with hierarchical data like org charts, category trees, or bill of materials.

Example: Employee Hierarchy

WITH RECURSIVE org_chart AS ( -- Base case: Start with the CEO (no manager) SELECT employee_id, employee_name, manager_id, 1 AS level, employee_name AS path FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive case: Find employees who report to previous level SELECT e.employee_id, e.employee_name, e.manager_id, o.level + 1, CONCAT(o.path, ' -> ', e.employee_name) FROM employees e JOIN org_chart o ON e.manager_id = o.employee_id ) SELECT REPEAT(' ', level - 1) || employee_name AS hierarchy, level, path FROM org_chart ORDER BY path;

Result:

hierarchylevelpath
Rajesh (CEO)1Rajesh (CEO)
  Priya (VP Engineering)2Rajesh (CEO) -> Priya
    Rahul (Manager)3Rajesh (CEO) -> Priya -> Rahul
    Amit (Manager)3Rajesh (CEO) -> Priya -> Amit
  Sneha (VP Sales)2Rajesh (CEO) -> Sneha

CTEs vs Subqueries vs Temp Tables

Feature CTE Subquery Temp Table
Scope Single query Single location Entire session
Reusable in query Yes No (must repeat) Yes
Supports recursion Yes No No
Can be indexed No No Yes
Best for Readability, recursion Simple, one-time use Large data, multiple queries

Day 5 Key Takeaways

  • CTEs make complex queries readable by naming intermediate steps
  • Use WITH clause before your main SELECT statement
  • Multiple CTEs can be chained, with later ones referencing earlier ones
  • Recursive CTEs solve hierarchical data problems elegantly
  • CTEs are preferred over deeply nested subqueries for maintainability
Day 6

Query Optimization & Best Practices

Writing Efficient, Production-Ready SQL

Why Query Performance Matters

A query that runs in 30 seconds on your test database might take 30 minutes on production with millions of rows. Understanding optimization is essential for any serious data work.

Understanding Query Execution

Most SQL databases process queries in this order:

  1. FROM / JOIN: Identify and combine tables
  2. WHERE: Filter rows
  3. GROUP BY: Aggregate rows
  4. HAVING: Filter aggregated results
  5. SELECT: Choose columns and calculate expressions
  6. DISTINCT: Remove duplicates
  7. ORDER BY: Sort results
  8. LIMIT: Restrict output rows

EXPLAIN: Understanding Query Plans

Use EXPLAIN to see how the database plans to execute your query.

EXPLAIN SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE o.order_date > '2024-01-01';

Key Things to Look For in EXPLAIN

  • Table scan: Reading entire table (slow for large tables)
  • Index usage: Using index for fast lookups (good)
  • Rows examined: How many rows the database checks
  • Join type: How tables are being joined

Indexing Basics

Index: A data structure that improves the speed of data retrieval. Think of it like a book's index - instead of reading every page, you look up keywords to find specific pages quickly.
Column Type Should Be Indexed? Why
Primary Key Automatic Created automatically
Foreign Keys Yes Used frequently in JOINs
WHERE clause columns Often yes Speeds up filtering
ORDER BY columns Sometimes Speeds up sorting
Columns with few unique values Usually no Index won't help much

Common Performance Pitfalls

1. SELECT * (Selecting All Columns)

-- BAD: Fetches all columns, even if you only need a few SELECT * FROM orders; -- GOOD: Only fetch what you need SELECT order_id, customer_id, order_date FROM orders;

2. Functions on Indexed Columns in WHERE

-- BAD: Function prevents index usage SELECT * FROM orders WHERE YEAR(order_date) = 2024; -- GOOD: Compare directly to enable index usage SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

3. Using OR Instead of IN

-- Less efficient SELECT * FROM products WHERE category = 'Electronics' OR category = 'Clothing' OR category = 'Books'; -- More efficient and readable SELECT * FROM products WHERE category IN ('Electronics', 'Clothing', 'Books');

4. Not Using LIMIT During Development

-- When testing on large tables, always limit results first SELECT * FROM orders WHERE total_amount > 1000 LIMIT 100; -- Test with small sample first!

SQL Best Practices

Writing Clean, Maintainable SQL

  • Use meaningful aliases: customers c not customers x
  • Format consistently: Keywords uppercase, columns lowercase
  • Comment complex logic: Explain why, not what
  • One column per line: Makes diffs and reviews easier
  • Use CTEs over nested subqueries: More readable

Well-Formatted Production Query

/* * Monthly Sales Report with YoY Comparison * Author: Data Team * Last Updated: 2024-01-15 */ WITH monthly_revenue AS ( -- Calculate revenue by month SELECT DATE_FORMAT(order_date, '%Y-%m-01') AS month_start, SUM(total_amount) AS revenue, COUNT(DISTINCT customer_id) AS unique_customers FROM orders WHERE order_status = 'completed' GROUP BY DATE_FORMAT(order_date, '%Y-%m-01') ) SELECT month_start, revenue, unique_customers, LAG(revenue, 12) OVER(ORDER BY month_start) AS revenue_ly, ROUND( (revenue - LAG(revenue, 12) OVER(ORDER BY month_start)) * 100.0 / NULLIF(LAG(revenue, 12) OVER(ORDER BY month_start), 0), 1 ) AS yoy_growth_pct FROM monthly_revenue ORDER BY month_start DESC LIMIT 24; -- Last 2 years

Day 6 Key Takeaways

  • Use EXPLAIN to understand how your queries execute
  • Index columns used in WHERE, JOIN, and ORDER BY clauses
  • Avoid SELECT * - only fetch columns you need
  • Don't apply functions to indexed columns in WHERE clauses
  • Write clean, commented, well-formatted SQL for maintainability
Day 7

Week 3 Project & Assessment

Putting It All Together

Project: Business Analytics SQL Report

Create a comprehensive analytics report for a retail company using all the advanced SQL techniques you've learned this week.

Project Requirements

Using the e-commerce database from Week 2, write queries to answer the following business questions:

Task 1: Customer Segmentation (Window Functions)

Segment customers into tiers based on their total purchase value:

  • Use NTILE to divide customers into 4 quartiles
  • Label them as Premium, Gold, Silver, Bronze
  • Show the count and average purchase value for each tier

Task 2: Sales Trend Analysis (Running Totals)

Create a monthly sales summary with:

  • Monthly revenue
  • Running total (year-to-date)
  • Month-over-month growth percentage
  • 3-month moving average

Task 3: Product Performance Ranking (RANK Functions)

For each product category:

  • Rank products by total revenue
  • Show top 3 products per category
  • Include each product's percentage of category revenue

Task 4: Year-over-Year Comparison (LAG)

Compare this year's quarterly performance to last year:

  • Revenue by quarter
  • Same quarter last year revenue
  • YoY growth percentage
  • Classify as "Growth", "Stable", or "Decline"

Task 5: Complex Analytics Query (Multiple CTEs)

Build a comprehensive customer analysis using multiple CTEs:

  • CTE 1: Customer purchase summary
  • CTE 2: Customer recency (days since last order)
  • CTE 3: Customer ranking by value
  • Final: Combine into RFM-style segmentation

Assessment Criteria

Component Weight Evaluation
Query Correctness 40% Results are accurate
Technique Usage 30% Appropriate use of advanced SQL
Code Quality 20% Clean, formatted, commented
Efficiency 10% No obvious performance issues

Starter Template

/* * Week 3 Project: Business Analytics Report * Student Name: [Your Name] * Date: [Date] */ -- Task 1: Customer Segmentation WITH customer_totals AS ( -- Your code here ) SELECT ...; -- Task 2: Sales Trend Analysis -- Your code here -- Continue for remaining tasks...

Week 3 Complete!

Congratulations! You've mastered advanced SQL techniques that are used daily by data professionals worldwide. These skills will set you apart in job interviews and enable you to tackle complex analytical challenges.

Next Week: Data Visualization - learn to communicate your insights effectively!

Self-Assessment Quiz

Test your understanding of this week's concepts. Try to answer without looking back!

Question 1: Subqueries

What type of subquery returns a single value that can be compared using =, >, or

Show Answer

Scalar Subquery - Returns exactly one row and one column (a single value).

Question 2: Correlated vs Regular

A correlated subquery executes ___ time(s), while a regular subquery executes ___ time(s).

Show Answer

Once per row / once total - Correlated subqueries reference the outer query and must execute for each row.

Question 3: Ranking Functions

If two employees have the same salary and are ranked, RANK() gives them both 2. What does RANK() assign to the next person?

Show Answer

4 - RANK() skips numbers after ties. DENSE_RANK() would give 3.

Question 4: Window Functions

Which function would you use to access the previous row's value in a window?

Show Answer

LAG() - LAG looks at previous rows; LEAD looks at following rows.

Question 5: CTEs

What keyword is used to start a Common Table Expression?

Show Answer

WITH - CTEs are defined using WITH cte_name AS (query).

Question 6: Optimization

Why is WHERE YEAR(date_col) = 2024 slower than WHERE date_col >= '2024-01-01'?

Show Answer

Applying a function to a column prevents the database from using an index on that column. The second form allows index usage.

Question 7: Frame Specification

To calculate a 7-day moving average, which frame specification would you use?

Show Answer

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW - This includes the current row plus the 6 previous rows (7 total).

Scoring

  • 7 correct: Excellent! You've mastered advanced SQL
  • 5-6 correct: Good understanding, review weak areas
  • 3-4 correct: Review the day's content for missed topics
  • Below 3: Consider re-reading the week's content

About the Instructor

Pawan Rama Mali

Data Analytics professional and educator with extensive experience in SQL, database management, and teaching complex analytical concepts to beginners. Passionate about making data skills accessible to everyone and helping students build practical, job-ready expertise.

Specializations:

  • SQL & Database Design
  • Data Analytics & Business Intelligence
  • R Programming & R Shiny Dashboards
  • Python for Data Analysis

Visit Instructor's Portfolio