SQL & Querying Data

Most data analysis begins with a deceptively simple question: how do I get exactly the rows I need out of a database? The answer, for fifty years and counting, is SQL — Structured Query Language. It has outlived countless trendier tools because it got something fundamentally right, and for an analyst it's the most-used skill of all: before you model, visualise, or report anything, you have to query it.

This page is the working analyst's SQL — not a syntax reference, but the handful of ideas that separate someone who fights the language from someone who's fluent in it: how a query actually executes, what joins really do, the modern superpower of window functions, and the one trap (NULL) that catches everyone at least once. It's the practical companion to the database systems page — that one is the theory, this is the skill.

Why SQL endures

SQL's longevity comes from one design choice: it's declarative. You describe what you want, not how to get it. You don't write loops over rows or specify which index to use — you state the result you want, and the database's query optimiser figures out the most efficient way to produce it.

That's a profound separation of concerns. Your query stays a clear statement of intent while the engine handles the messy mechanics, and the same query keeps working as the data grows from thousands of rows to billions. It's also why SQL reads almost like English — a strength that hides the one thing beginners get wrong, which is the next section.

The core verbs

The backbone of a query is a small set of clauses, each doing one job. A query that uses all of them:

SELECT   department, COUNT(*) AS staff
FROM     employees
WHERE    start_date >= '2020-01-01'
GROUP BY department
HAVING   COUNT(*) > 5
ORDER BY staff DESC;

SELECT — which columns (and computed values) to return.
FROM — which table(s) to pull from.
WHERE — filter individual rows before grouping.
GROUP BY — collapse rows into groups for aggregation (COUNT, SUM, AVG).
HAVING — filter the groups (after aggregation) — the distinction from WHERE trips people up constantly.
ORDER BY — sort the final result.

The order it really runs

Here's the single most clarifying fact about SQL: it does not execute in the order you write it. You write SELECT first, but the database runs it nearly last. The logical execution order is:

SQL's logical execution order. You write SELECT first, but it runs sixth — FROM and WHERE pick and filter rows first, then grouping, then SELECT computes its columns, then ORDER BY sorts. This order explains the rules that otherwise feel arbitrary.

This order isn't trivia — it explains the rules that otherwise feel arbitrary. Why can't you use a SELECT alias in WHERE? Because WHERE runs before SELECT exists. Why does WHERE filter rows but HAVING filter groups? Because WHERE runs before grouping and HAVING after. Why can't a window function go in WHERE? Same reason — it's computed at SELECT, too late to filter on (wrap it in a CTE and filter outside). Internalise the order and a dozen "gotchas" become obvious.

Joins: combining tables

Data lives in separate tables (customers here, orders there), and joins stitch them back together on a shared key. The four you need:

INNER JOIN — only rows that match in both tables.
LEFT JOIN — every row from the left table, plus matches from the right (NULLs where there's no match). The analyst's workhorse — "all customers, with their orders if any".
RIGHT JOIN — the mirror image (rarely needed; just flip the tables).
FULL OUTER JOIN — every row from both, matched where possible.

Window functions: the modern superpower

Window functions are the feature that turns SQL from a retrieval language into an analytical one. A normal aggregate (GROUP BY) collapses rows into one summary row. A window function computes across a set of related rows while keeping every row — so you can put a running total, a rank, or "compared to last month" right beside each record.

SELECT  month, revenue,
        SUM(revenue) OVER (ORDER BY month)          AS running_total,
        revenue - LAG(revenue) OVER (ORDER BY month) AS change_vs_prev,
        RANK() OVER (ORDER BY revenue DESC)         AS rank
FROM    monthly_sales;

The OVER (...) clause defines the "window" of rows to compute over — PARTITION BY splits into groups, ORDER BY orders within them. Running totals, rank-within-group, month-on-month change, moving averages, "top N per category" — all the questions that used to need awkward self-joins become one clean line. They're computed at the SELECT step, which is exactly why you can't filter on them directly (back to the execution order).

CTEs: readable, layered logic

Real questions need several steps, and the wrong way to write them is a pyramid of nested subqueries read inside-out. A Common Table Expression (the WITH clause) names each step so the query reads top to bottom like a recipe:

WITH recent AS (
    SELECT * FROM orders WHERE order_date >= '2026-01-01'
),
by_customer AS (
    SELECT customer_id, SUM(amount) AS total
    FROM recent
    GROUP BY customer_id
)
SELECT * FROM by_customer WHERE total > 1000;

Each CTE is a named, reusable building block. The query becomes a sequence of clear stages rather than a tangle — easier to read, debug, and hand to someone else. For an analyst whose queries have to be understood and re-run by others, this readability is not a luxury.

The NULL trap

The bug that catches everyone eventually: in SQL, NULL doesn't mean zero or empty — it means unknown. And because "unknown" infects any comparison, SQL runs on three-valued logic: TRUE, FALSE, and UNKNOWN.

Where it shows up in my work

Where every analysis begins

SQL is the first tool I reach for in almost any task — the data lives in databases, and getting the right rows out is step zero before any analysis. Fluency here pays off daily: knowing the execution order means I write queries that work the first time, window functions turn "running total" or "rank within group" into one line instead of a workaround, and CTEs keep a complex extract readable enough to be checked and reused.

And the traps are exactly the ones that produce wrong numbers quietly — a join fan-out doubling a total, a NULL comparison silently dropping the rows I meant to keep. Catching those is the difference between a query that looks right and one that is right. It pairs with data preparation (the query is where prep starts) and reproducibility (a saved, version-controlled query is a re-runnable step).

Refresh in 60 seconds

SQL endures because it's declarative — say what you want, let the optimiser find how. The skill behind every analysis.
Core verbs: SELECT / FROM / WHERE / GROUP BY / HAVING / ORDER BY. WHERE filters rows, HAVING filters groups.
It runs FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY, not top-to-bottom — which explains every "why can't I…" gotcha (aliases, window functions in WHERE).
Joins combine tables; LEFT JOIN is the workhorse. Beware the fan-out — a non-unique key multiplies rows and inflates sums silently. Know the grain.
Window functions (OVER) compute across rows while keeping each one — running totals, rank, lag. CTEs (WITH) make multi-step logic readable.
NULL = unknown → three-valued logic. NULL = NULL isn't TRUE; use IS NULL. != silently drops NULL rows. Handle it explicitly.

The logical-execution-order framing, window-function placement, and NULL three-valued-logic cautions reflect current SQL references alongside coursework.