Elements of Data Processing

There's a number every data scientist learns the hard way: roughly 80% of the work is preparing the data, and only the last 20% is the modelling everyone talks about. Raw data is almost never ready to use — it's messy, inconsistent, scattered across sources, and full of gaps. Turning it into something clean and analysable is data processing, and it's the foundation the whole rest of the field stands on.

It's unglamorous, but it's where the leverage is: the best model in the world can't rescue bad inputs (garbage in, garbage out), while careful prep makes even a simple method work. This page is the practical craft — the steps, the principles, and the traps — that turns raw data into a clean table you can actually trust.

The unglamorous 80%

Why does data prep dominate? Because raw data is collected for some other purpose than your analysis — a transaction log records sales, not your research question — so it never arrives in the shape you need. It has typos, missing fields, inconsistent formats ("NSW" / "N.S.W." / "New South Wales"), duplicate records, and values that are simply wrong.

The discipline matters because every error here propagates. A mis-parsed date, a silently dropped row, a units mix-up — none of it announces itself, and all of it quietly corrupts everything downstream. So the goal isn't just "clean the data"; it's to clean it deliberately and reproducibly, knowing exactly what you changed and why. The analysts who are trusted are the ones whose data prep you can audit.

The data pipeline

Data processing is best seen as a pipeline — a sequence of stages that takes raw inputs and produces analysis-ready data. The stages are always roughly the same, whatever the project:

Acquire — pull the data from its sources (files, databases, APIs).
Clean — fix errors, handle missing values, remove duplicates.
Transform — reshape, derive new fields, standardise formats.
Integrate — combine multiple sources into one coherent dataset.
Store — save the result in a form ready for analysis.

The data pipeline. Raw sources flow through acquire → clean → transform → integrate, producing the analysis-ready dataset that modelling and visualisation depend on. Most of the real effort lives in the middle two stages.

Types and structures of data

How hard the processing is depends on how structured the data already is:

Structured — neat rows and columns with a fixed schema, like a database table or a CSV. Easiest to work with.
Semi-structured — has some organisation but no rigid table shape: JSON, XML, log files. Common from web APIs, and needs flattening into tables.
Unstructured — free text, images, audio. No inherent table form; extracting features from it is a project in itself (the NLP page is exactly this for text).

It also pays to know each column's measurement type — numerical (continuous or count), categorical (ordered or not), date/time — because that decides what cleaning and which analysis are valid. Treating a postcode as a number, or an ordered rating as unordered, is a classic and costly slip.

Tidy data

The single most useful organising principle is tidy data, and it's deceptively simple: each variable is a column, each observation is a row, and each cell holds one value. Data that follows this shape is trivial to filter, group, join, and plot; data that doesn't fights you at every step.

Most messy real data violates it — values stuffed into column headers (a column per year), multiple variables crammed in one cell ("Male 25–34"), or one observation spread across several rows. A huge share of "data wrangling" is simply reshaping messy data into the tidy form, after which the analysis becomes almost easy. Learn to recognise the tidy shape and you have a target to wrangle toward every time.

Cleaning

Cleaning is the heart of the work — finding and fixing what's wrong. The recurring jobs:

Missing values — decide per case: drop the row, drop the column, or impute (fill with the mean/median, or a model). The dangerous move is ignoring them — and always ask why it's missing, because "not recorded" and "not applicable" mean different things.
Duplicates — the same record entered twice silently double-counts; de-duplicate, but carefully (two real people can share a name).
Outliers — flag extreme values and investigate. Some are errors (a typo'd age of 200); some are the most important real signal. Never delete blindly.
Inconsistent formats & types — standardise dates, units, categories, and capitalisation; parse numbers stored as text. This is the tedious bulk of cleaning, and where reproducibility matters most.

Reshaping and joining

With clean columns, two transformations do most of the heavy lifting. Reshaping moves data between wide (a column per category) and long (a row per category) — pivoting and melting — to reach the tidy form a given task needs. Joining stitches datasets together on a shared key, the exact same operation as the SQL joins on the database page: an inner join keeps only matches, a left join keeps everything on one side. Integrating sources well — and not accidentally multiplying or dropping rows in the process — is a core data-processing skill.

Getting the data in

Before any of that, you have to get the data — and where it comes from shapes how you process it:

Files — CSV, Excel, JSON. Simple, but watch encodings and inconsistent schemas.
Databases — query exactly the slice you need with SQL, rather than pulling everything.
APIs — request structured data over the web, usually JSON, often paginated.
Web scraping — extract data from pages built for humans when there's no API. Powerful but brittle, and you must respect terms and rate limits.

Whatever the source, the first move is the same: understand the data before transforming it — its shape, its types, its quirks. Exploratory checks up front save you from cleaning the wrong thing.

Features and reproducibility

Processing shades into feature engineering — creating the input columns a model actually learns from: deriving "age" from a birth date, encoding categories as numbers, scaling values to a common range, bucketing a continuous variable. Thoughtful features routinely beat a fancier algorithm on raw inputs, which is why this step is where a lot of real modelling skill lives — it's the on-ramp to the machine learning page.

Underpinning all of it is reproducibility: the entire path from raw to ready should be a script anyone can re-run to get the identical result. That's what makes data work trustworthy and auditable — and it's the difference between an analysis people can rely on and a number nobody can explain. Data quality — completeness, accuracy, consistency, timeliness — is the standard you're processing toward.