Chapter 4 · Data Analyst

How data is stored and moved

~9 min read

Before you can analyze data, you should understand where it comes from and how it is shaped for you. This chapter covers the modern data stack at the level an analyst needs: how data is modeled in a warehouse, how it gets there, and the vocabulary that lets you talk credibly with data engineers. This is often missing from beginner material, and it is exactly what makes you effective on a real team.

4.1 OLTP versus OLAP#

Operational systems that run the business (the app, the checkout, the CRM) use OLTP databases: normalized into many small tables, optimized for fast writes and integrity. Analytics happens in OLAP systems: a data warehouse, denormalized into wide tables optimized for fast reads across huge volumes. The same data looks different in each because each is tuned for a different job.

OLTP (operational)OLAP (analytics)
ShapeMany small normalized tablesFew wide denormalized tables
Optimized forFast writes, integrityFast reads, aggregation
ExampleThe app's order databaseThe warehouse you query

4.2 The star schema#

Most analytics warehouses organize data into a star schema: fact tables holding the events or measurements (sales, clicks, sessions) surrounded by dimension tables holding descriptive context (who, what, when, where). It looks like a star, and it is designed so an analyst can slice any measure by any attribute with simple joins.

Figure 4. A star schema. The fact table holds measures and foreign keys; dimensions hold descriptive attributes.

4.3 ETL and ELT, how data arrives#

Data gets from source systems into the warehouse through a pipeline. In the classic ETL pattern, data is Extracted, Transformed, then Loaded. In the modern ELT pattern, raw data is Extracted and Loaded first, then Transformed inside the warehouse (often with a tool like dbt). ELT dominates today because warehouse compute is cheap and keeping the raw data lets you re-transform when definitions change.

StageWhat happensAnalyst's stake
ExtractPull from source systems and APIsKnow your sources and their quirks
LoadLand raw data in the warehouseUnderstand freshness and lag
TransformClean and model into usable tablesThis is where your tables come from

4.4 Slowly changing dimensions#

A subtle but important warehousing concept. When a descriptive attribute changes over time (a customer moves city, a product changes category), how should history be recorded? A Type 1 dimension overwrites the old value, keeping only the current state. A Type 2 dimension keeps a full history by adding a new row with valid-from and valid-to dates. Knowing the difference matters because it determines whether a historical report reflects the world as it was then or as it is now.

4.5 Data marts and the semantic layer#

Warehouses are often organized in layers. Raw tables land first, then cleaned and conformed tables, then data marts: curated, business-friendly tables built for a specific team such as finance or marketing. Many companies add a semantic layer on top, where metrics like revenue or active users are defined once so every dashboard agrees. As an analyst, knowing which layer you are querying tells you how much you can trust a table and whether a metric is already defined for you.

4.6 A worked modeling example#

Suppose you are asked for revenue by customer segment over time. In a star schema this is a clean join: the sales fact table gives you revenue and dates, the customer dimension gives you the segment, and the date dimension gives you the period. You group by segment and month and sum revenue. Good modeling turns hard-sounding questions into short queries. When a request feels painful to write, it is often a sign the underlying tables are poorly modeled, which is worth raising with the data team rather than working around with fragile SQL.

4.7 Warehousing practice#

What is the difference between a fact and a dimension table?

Why has ELT largely replaced ETL?

Get the next chapter and weekly interview tips by email

One short email per week. Skim in a minute. Unsubscribe anytime.