Survival Analysis

Many of the most important questions aren't "will this happen?" but "how long until it happens?" — how long until a patient relapses, a customer churns, a machine fails, a case is resolved, a person re-offends. Survival analysis (or time-to-event analysis) is the branch of statistics built for exactly these questions, and it exists as its own field because of one peculiar, unavoidable feature of the data: when your study ends, the event hasn't happened to everyone yet — and what you do with those unfinished cases changes everything.

It's a genuinely distinct tool worth knowing, because the obvious approaches all quietly fail on time-to-event data. This page builds it up: why ordinary methods break, the central idea of censoring, the two functions that describe survival, and the two workhorse methods — Kaplan-Meier and the Cox model.

How long until…?

At first glance you might reach for tools you already have, and each fails in an instructive way. Treat it as a regression on "time until event"? You can't — for many subjects the event hasn't happened, so their time is unknown. Treat it as classification("did the event happen by time T?")? You throw away the rich information of when, and the answer depends arbitrarily on where you draw T.

The data has a special structure — a duration, plus whether the event has actually occurred yet — that needs purpose-built methods. The crux of all of them is how they handle the cases that haven't finished.

Censoring: the idea that defines the field

Censoring is the heart of survival analysis. A subject is right-censored when you know they survived up to a certain point, but not what happened after — because the study ended, or they dropped out, before the event occurred. You don't know their true event time; you only know it's longer than what you observed.

Censoring. Some subjects experience the event during the study (dot). Others are still event-free when observation ends — right-censored (arrow). Their time isn't missing or zero: it's a real lower bound (the event would happen later), and survival methods use exactly that partial information.

Censored cases carry real information — "survived at least this long" — and the cardinal sin is to mishandle them. Drop them and you bias the result (you'd systematically lose the longest survivors); treat them as if the event happened at the censoring time and you bias it the other way. The whole machinery below exists to use that partial information correctly.

Two ways to describe survival

Survival is described by two complementary functions. The survival function:

S(t) = \Pr(T > t)

— the probability of surviving (not having the event) beyond time $t$ . It starts at 1 and steps down toward 0. The hazard function $h(t)$ takes a different angle: it's the instantaneous rate of the event at time $t$ , given you've survived that far — the risk right now for those still at risk. Survival answers "what fraction last this long?"; hazard answers "for a survivor, how dangerous is this moment?" They're two views of the same process, and different methods model one or the other.

Kaplan-Meier: estimating the curve

The Kaplan-Meier estimator is the workhorse for estimating $S(t)$ from data, and it handles censoring elegantly. It produces the familiar step curve: survival stays flat, then drops a step at each time an event actually occurs, with the size of each drop set by how many were still at risk just before. Censored subjects don't cause a drop — they simply leave the "at risk" pool at their censoring time, so they correctly contribute to the denominator up to that point and no further.

That's the clever bit: by only stepping down at observed events and adjusting the at-risk count as censored cases exit, Kaplan-Meier extracts an unbiased survival curve from data that's riddled with unfinished cases. The result is the single most recognisable picture in the field — and the standard way to show "what fraction are still event-free over time".

Comparing groups: the log-rank test

Often the real question is comparative: does group A survive longer than group B (treatment vs control, one cohort vs another)? You plot a Kaplan-Meier curve for each and compare them with the log-rank test — a hypothesis test for whether two (or more) survival curves differ more than chance would explain. It's the survival-analysis counterpart to comparing group means, built to respect censoring. It tells you whether the curves differ, but not by how much, or while adjusting for other factors — which is where the Cox model comes in.

The Cox proportional hazards model

To ask "how does each factor affect survival, holding the others constant?" you need a regression — and the Cox proportional hazards model is the dominant one. Rather than model the survival curve directly, it models the hazard, because hazards are more stable and tractable. Its form:

h(t \mid \mathbf{x}) = h_0(t)\, \exp(\beta_1 x_1 + \cdots + \beta_p x_p)

The beauty is that it's semi-parametric: the baseline hazard $h_0(t)$ — how risk changes over time in general — is left unspecified, so you make no assumption about the shape of the survival curve. You only estimate the $\beta$ coefficients, the effect of each covariate. Exponentiating a coefficient gives a hazard ratio: $e^{\beta} = 2$ means that factor doubles the instantaneous risk at any time; below 1 it's protective. That single, interpretable number — "this factor multiplies the risk by X" — is why the Cox model is everywhere in medicine, reliability, and social science.

The proportional-hazards assumption

The Cox model buys its flexibility with one key assumption, hidden in the name: proportional hazards. It assumes a covariate's effect is a constant multiplier on the hazard at all times — the hazard ratio between two groups doesn't change as time passes.

Where it shows up in my work

When the question is 'how long?'

A surprising number of analytical questions are really time-to-event questions in disguise: how long until a case is resolved, the time-to-recurrence of an issue, how long someone stays in a program before exiting. The most valuable thing survival analysis gives me is the discipline around censoring — recognising that the cases that haven't finished yet carry real information, and that dropping them (the tempting shortcut) systematically biases the answer toward whatever finished quickly.

Kaplan-Meier is the honest way to show "what fraction remain over time", the log-rank test compares two groups' timelines properly, and the Cox model gives an interpretable hazard ratio — "this factor multiplies the risk by X" — while adjusting for confounders, which pairs naturally with the causal-inference mindset. Knowing the proportional-hazards assumption is what keeps that hazard ratio honest rather than a convenient average of a changing story.

Refresh in 60 seconds

Survival analysis answers "how long until the event?" — ordinary regression/classification fail because the event hasn't happened to everyone.
Censoring is the key idea: a right-censored subject was event-free up to some time (a real lower bound). Don't drop them and don't treat censoring as the event — both bias the result.
Two views: survival $S(t)=\Pr(T>t)$ (fraction lasting past t) and hazard $h(t)$ (instantaneous risk given survival so far).
Kaplan-Meier estimates the survival curve (the step curve), handling censoring via the at-risk pool. Log-rank test compares two curves.
Cox proportional hazards $h(t\mid x)=h_0(t)e^{\beta^\top x}$ — semi-parametric (baseline left free); $e^\beta$ = hazard ratio ("multiplies risk by X").
Check the proportional-hazards assumption — a constant hazard ratio over time; when it's violated (curves cross), the single number misleads.

The censoring framing, Kaplan-Meier/log-rank pairing, and the Cox model with its proportional-hazards caveat reflect current survival-analysis references alongside statistics coursework.