Chapter 3 · Data Analyst
Advanced SQL that sets you apart
~10 min read
The verbs from Chapter 2 answer most questions. The patterns in this chapter are what let you answer the interesting ones cleanly, and they are the difference between a candidate who can query and one interviewers want to hire.
3.1 Window functions#
Window functions compute across a set of rows related to the current row without collapsing them the way GROUP BY does. They are the clearest junior-to-senior dividing line. Use them for running totals, rankings, period-over-period change, and top N per group.
| Function | What it gives you |
|---|---|
| ROW_NUMBER() | Unique sequential number per partition (dedupe, top-N) |
| RANK() / DENSE_RANK() | Ranking with or without gaps on ties |
| LAG() / LEAD() | Previous or next row's value (period-over-period) |
| SUM() OVER (...) | Running or cumulative totals |
| NTILE(n) | Split rows into n buckets (quartiles, deciles) |
-- Rank products by revenue WITHIN each category
SELECT category, product, revenue,
RANK() OVER (PARTITION BY category
ORDER BY revenue DESC) AS rank_in_cat
FROM product_revenue;
-- Month-over-month change with LAG
SELECT month, revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change
FROM monthly_revenue;3.2 CTEs, writing SQL humans can read#
A Common Table Expression (WITH name AS (...)) names a query step so you can build complex logic in readable stages instead of deeply nested subqueries. Senior analysts use them constantly, not because they run faster, but because they make logic reviewable. A query someone else can verify is worth more than a clever one nobody trusts.
WITH paid AS (
SELECT customer_id, amount
FROM orders WHERE status = 'paid'
),
by_customer AS (
SELECT customer_id, SUM(amount) AS total
FROM paid GROUP BY customer_id
)
SELECT * FROM by_customer WHERE total > 1000
ORDER BY total DESC;3.3 The fan-out trap#
One pattern causes more silent errors than any other. When you join to a table that has multiple matching rows per key, each row on the other side is duplicated once per match. This is called fan-out, and it quietly multiplies any value you then SUM. A revenue total that looks a little high after adding a join is fan-out until proven otherwise.
3.4 Conditional aggregation, the pivot trick#
A pattern you will use weekly: turning rows into columns by combining CASE with an aggregate. It answers questions like how many orders were paid versus refunded per month in a single, readable pass, without a self-join.
-- One row per month, split by status
SELECT DATE_TRUNC('month', created_at) AS month,
COUNT(*) FILTER (WHERE status='paid') AS paid,
COUNT(*) FILTER (WHERE status='refunded') AS refunded,
SUM(amount) FILTER (WHERE status='paid') AS paid_revenue
FROM orders
GROUP BY 1 ORDER BY month;Where FILTER is not supported, the same idea works with SUM(CASE WHEN status='paid' THEN 1 ELSE 0 END). Recognizing this pattern turns many awkward reporting requests into one clean query.
3.5 Working with dates#
| Task | Typical function | Example |
|---|---|---|
| Truncate to a period | DATE_TRUNC | Group events by week or month |
| Difference between dates | DATEDIFF / age | Days from signup to first order |
| Filter a rolling window | INTERVAL | Last 30 days of activity |
| Extract a part | EXTRACT | Day-of-week seasonality |
Vague dates cause silent errors. Does "last month" mean the previous calendar month or the trailing 30 days? Does a day boundary use UTC or the user's timezone? State the choice in your query and to your stakeholder. A report that shifts by a day because of a timezone assumption can send a whole team chasing a problem that does not exist.
3.6 Query performance basics#
- Select only the columns you need. SELECT star moves and processes far more data than required, especially on wide tables.
- Filter early. Push WHERE conditions as close to the source as possible so fewer rows flow downstream.
- Lean on indexes and partitions. Joins and filters on indexed or partitioned columns are dramatically faster; an unindexed join on millions of rows is the usual culprit behind a slow query.
- Read the query plan. EXPLAIN reveals full table scans and poor join orders so you fix the real bottleneck instead of guessing.
3.7 Advanced SQL practice#
| Mistake | What it causes | The fix |
|---|---|---|
| SUM after a fan-out join | Inflated totals | Aggregate the many-side first |
| COUNT(column) with NULLs | Undercounts silently | Use COUNT(*) or handle NULLs |
| AVG over NULLs meant as zero | Overstated average | COALESCE to zero if that is the intent |
| Filtering on a SELECT alias in WHERE | Query error | Filter on the expression, or use a subquery |
| Ambiguous date window | Off-by-one-day results | State calendar vs rolling, and timezone |
How would you find the second-highest salary in a table?
A LEFT JOIN returns more rows than the left table has. Why?
Get the next chapter and weekly interview tips by email
One short email per week. Skim in a minute. Unsubscribe anytime.
