Chapter 3 · Data Analyst

Advanced SQL that sets you apart

~10 min read

The verbs from Chapter 2 answer most questions. The patterns in this chapter are what let you answer the interesting ones cleanly, and they are the difference between a candidate who can query and one interviewers want to hire.

3.1 Window functions#

Window functions compute across a set of rows related to the current row without collapsing them the way GROUP BY does. They are the clearest junior-to-senior dividing line. Use them for running totals, rankings, period-over-period change, and top N per group.

Function	What it gives you
ROW_NUMBER()	Unique sequential number per partition (dedupe, top-N)
RANK() / DENSE_RANK()	Ranking with or without gaps on ties
LAG() / LEAD()	Previous or next row's value (period-over-period)
SUM() OVER (...)	Running or cumulative totals
NTILE(n)	Split rows into n buckets (quartiles, deciles)

sql

-- Rank products by revenue WITHIN each category
SELECT category, product, revenue,
       RANK() OVER (PARTITION BY category
                    ORDER BY revenue DESC) AS rank_in_cat
FROM   product_revenue;

-- Month-over-month change with LAG
SELECT month, revenue,
       revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change
FROM   monthly_revenue;

3.2 CTEs, writing SQL humans can read#

A Common Table Expression (WITH name AS (...)) names a query step so you can build complex logic in readable stages instead of deeply nested subqueries. Senior analysts use them constantly, not because they run faster, but because they make logic reviewable. A query someone else can verify is worth more than a clever one nobody trusts.

sql

WITH paid AS (
  SELECT customer_id, amount
  FROM   orders WHERE status = 'paid'
),
by_customer AS (
  SELECT customer_id, SUM(amount) AS total
  FROM   paid GROUP BY customer_id
)
SELECT * FROM by_customer WHERE total > 1000
ORDER BY total DESC;

3.3 The fan-out trap#

One pattern causes more silent errors than any other. When you join to a table that has multiple matching rows per key, each row on the other side is duplicated once per match. This is called fan-out, and it quietly multiplies any value you then SUM. A revenue total that looks a little high after adding a join is fan-out until proven otherwise.

3.4 Conditional aggregation, the pivot trick#

A pattern you will use weekly: turning rows into columns by combining CASE with an aggregate. It answers questions like how many orders were paid versus refunded per month in a single, readable pass, without a self-join.

sql

-- One row per month, split by status
SELECT DATE_TRUNC('month', created_at) AS month,
       COUNT(*) FILTER (WHERE status='paid')     AS paid,
       COUNT(*) FILTER (WHERE status='refunded') AS refunded,
       SUM(amount) FILTER (WHERE status='paid')  AS paid_revenue
FROM   orders
GROUP BY 1 ORDER BY month;

Where FILTER is not supported, the same idea works with SUM(CASE WHEN status='paid' THEN 1 ELSE 0 END). Recognizing this pattern turns many awkward reporting requests into one clean query.

3.5 Working with dates#

Task	Typical function	Example
Truncate to a period	DATE_TRUNC	Group events by week or month
Difference between dates	DATEDIFF / age	Days from signup to first order
Filter a rolling window	INTERVAL	Last 30 days of activity
Extract a part	EXTRACT	Day-of-week seasonality

Vague dates cause silent errors. Does "last month" mean the previous calendar month or the trailing 30 days? Does a day boundary use UTC or the user's timezone? State the choice in your query and to your stakeholder. A report that shifts by a day because of a timezone assumption can send a whole team chasing a problem that does not exist.

3.6 Query performance basics#

Select only the columns you need. SELECT star moves and processes far more data than required, especially on wide tables.
Filter early. Push WHERE conditions as close to the source as possible so fewer rows flow downstream.
Lean on indexes and partitions. Joins and filters on indexed or partitioned columns are dramatically faster; an unindexed join on millions of rows is the usual culprit behind a slow query.
Read the query plan. EXPLAIN reveals full table scans and poor join orders so you fix the real bottleneck instead of guessing.

3.7 Advanced SQL practice#

Mistake	What it causes	The fix
SUM after a fan-out join	Inflated totals	Aggregate the many-side first
COUNT(column) with NULLs	Undercounts silently	Use COUNT(*) or handle NULLs
AVG over NULLs meant as zero	Overstated average	COALESCE to zero if that is the intent
Filtering on a SELECT alias in WHERE	Query error	Filter on the expression, or use a subquery
Ambiguous date window	Off-by-one-day results	State calendar vs rolling, and timezone

How would you find the second-highest salary in a table?

A LEFT JOIN returns more rows than the left table has. Why?

Get the next chapter and weekly interview tips by email

One short email per week. Skim in a minute. Unsubscribe anytime.

SQL, the analyst's primary language

How data is stored and moved