Chapter 5 · Data Analyst

Data quality and cleaning

~6 min read

Practitioners and studies alike put cleaning and preparation at roughly sixty to eighty percent of analytics work. If that sounds tedious, reframe it: this is where you build the trust that makes every downstream number believable. It is the job, not a detour from it.

5.1 A cleaning routine#

  1. Profile first. Row counts, distinct counts, min and max, and null rates per column. Anomalies jump out immediately when you look at the shape of the data before diving in.
  2. Handle missing data deliberately. Decide per column whether to drop, fill, or flag. Never let NULLs silently ride into an average.
  3. De-duplicate on purpose. Know your true key. Use ROW_NUMBER() to keep the right copy rather than a blind DISTINCT that may hide real duplicates.
  4. Fix types and formats. Dates stored as text, numbers with currency symbols, inconsistent casing in categories. Standardize before analyzing.
  5. Validate against a known total. Reconcile to a trusted figure so you know the cleaned data still ties out to reality.

5.2 The dimensions of data quality#

DimensionQuestion it answers
AccuracyDoes the value reflect reality?
CompletenessAre required values present, or missing?
ConsistencyDo systems agree with each other?
TimelinessIs the data fresh enough for the decision?
UniquenessAre there unintended duplicates?
ValidityDoes the value conform to its expected format or range?

5.3 Outliers and missing values#

Two decisions come up constantly and deserve deliberate judgment rather than reflex. With outliers, first ask whether the extreme value is an error or a real, important event. A negative age is an error to fix; a single enormous order may be your most valuable customer and must not be dropped. With missing values, decide per column whether absence means zero, unknown, or not applicable, because each implies a different treatment and the wrong choice quietly distorts every aggregate that follows.

5.4 Cleaning practice#

Walk through how you would approach a messy dataset.

Name three dimensions of data quality.

Get the next chapter and weekly interview tips by email

One short email per week. Skim in a minute. Unsubscribe anytime.