Skip to main content
Guides Skills and frameworks Data Modeling Interview Guide — Star Schema, Slowly-Changing Dimensions, and OBT Trade-Offs
Skills and frameworks

Data Modeling Interview Guide — Star Schema, Slowly-Changing Dimensions, and OBT Trade-Offs

10 min read · April 25, 2026

A practical guide to data modeling interviews: define grain, design facts and dimensions, handle slowly-changing dimensions, and explain when star schema or OBT is the right trade-off.

Data Modeling Interview Guide — Star Schema, Slowly-Changing Dimensions, and OBT Trade-Offs

This data modeling interview guide covers the star schema, slowly-changing dimensions, and OBT trade-offs that show up in analytics engineering, data engineering, and product analytics interviews. The goal is not to recite warehouse jargon. The goal is to prove you can design tables that answer real business questions without creating a brittle mess.

Strong candidates begin with the grain, define facts and dimensions, explain how history is handled, and make pragmatic trade-offs between normalized models and wide one-big-table designs. Interviewers want to hear that you can balance correctness, usability, cost, performance, and change.

What data modeling interviews test

Data modeling prompts usually sound like one of these:

| Prompt | What the interviewer is testing | |---|---| | Design a model for orders, customers, and payments | Can you identify grain, facts, dimensions, and relationships? | | Build analytics tables for a subscription product | Can you handle lifecycle events, revenue, churn, and time? | | Model user activity for a product dashboard | Can you convert event streams into queryable metrics? | | Handle customer attributes that change over time | Do you understand slowly-changing dimensions? | | Decide between star schema and OBT | Can you explain trade-offs instead of using dogma? |

The best answers are grounded in use cases. Before drawing tables, ask what questions the model must answer: daily revenue, active users, retention, cohort conversion, refund rate, sales pipeline, inventory availability, or support SLA. A model that is perfect for finance may be clumsy for product analytics. A model that is perfect for a dashboard may be unsafe for audit reporting.

Start with the grain

The most important sentence in a data modeling interview is: "First, I would define the grain." Grain is what one row represents. If the grain is unclear, every metric becomes ambiguous.

Examples:

  • fact_orders: one row per order placed.
  • fact_order_items: one row per item within an order.
  • fact_payments: one row per payment attempt or successful payment, depending on use case.
  • fact_subscription_daily: one row per subscription per calendar day.
  • fact_user_sessions: one row per user session.
  • fact_job_applications: one row per candidate application to a job.

Grain prevents double counting. If revenue is stored on both order and item rows, analysts can accidentally multiply revenue by item count. If subscription status is stored only as the latest state, churn history disappears. If payments are modeled only as successful charges, failed payment diagnostics vanish.

In interviews, say what your table excludes. "One row per order, excluding payment attempts and fulfillment events, which I would model separately." That clarity is a senior signal.

Facts and dimensions

A classic star schema has fact tables at the center and dimension tables around them.

Fact tables store events or measurements: orders, payments, shipments, sessions, page views, applications, messages, renewals, support tickets. They usually contain foreign keys to dimensions plus numeric measures like amount, quantity, duration, count, or latency.

Dimension tables describe the entities around facts: users, customers, products, plans, geographies, devices, campaigns, employers, stores, dates. They contain attributes analysts filter or group by.

A simple ecommerce star schema might include:

| Table | Grain | Example fields | |---|---|---| | fact_orders | One row per order | order_id, customer_key, order_date_key, gross_amount, discount_amount, net_amount | | fact_order_items | One row per order item | order_id, product_key, quantity, item_price, item_discount | | fact_payments | One row per payment attempt | order_id, payment_method_key, status, amount, failure_reason | | dim_customer | One row per customer version | customer_key, customer_id, segment, signup_channel, effective dates | | dim_product | One row per product version | product_key, sku, category, brand, price_band | | dim_date | One row per date | date_key, week, month, quarter, fiscal period |

The star pattern is popular because it makes analytics easier. Analysts can join facts to dimensions and group by business attributes. BI tools understand the shape. Metric definitions can be centralized.

Star schema: when it is the right answer

A star schema is usually the right default for curated analytics layers. It works especially well when:

  • Many teams query the same business entities.
  • You need consistent dimensions across facts.
  • Metrics must be auditable and reusable.
  • The warehouse feeds BI tools, semantic layers, or finance reporting.
  • History matters and dimensions need versioning.

The advantages are clear. Fact tables stay narrow and scalable. Dimensions centralize business attributes. Analysts can combine facts consistently. Changes to product categories, customer segments, or plan names can be managed in one place.

The cost is complexity. Users need to understand joins. Poorly designed dimensions can create fanout. Many small tables can slow exploratory work if the warehouse or BI layer is not optimized. For early-stage products, a full star schema can be overkill.

A strong interview answer frames star schema as a stable analytics contract, not a religious rule. "For the curated layer used by finance and company dashboards, I would use a star schema so metric definitions and historical attributes are consistent. For exploratory product analytics, I might also provide a denormalized table."

Slowly-changing dimensions: Type 1, Type 2, and practical history

Slowly-changing dimensions, or SCDs, handle attributes that change over time. Examples: customer segment, sales territory, product category, account owner, company size, plan tier, employer industry, or user country.

The three common approaches:

| SCD type | Behavior | Use when | |---|---|---| | Type 0 | Never change the original value | Immutable attributes like original signup date or first acquisition channel | | Type 1 | Overwrite with the latest value | Corrections or attributes where history does not matter | | Type 2 | Add a new version row with effective dates | Historical reporting must reflect the attribute at event time |

Type 1 is simple but destroys history. If a customer moves from SMB to Enterprise and you overwrite segment, last year's revenue will now appear to have come from Enterprise. That might be wrong for historical reporting.

Type 2 preserves history by creating versioned rows:

  • customer_key: surrogate key for the version
  • customer_id: natural business key
  • segment
  • effective_start_at
  • effective_end_at
  • is_current

Facts join to the dimension version valid at the event time. That allows revenue from January to be attributed to SMB and revenue after the upgrade to Enterprise.

In interviews, mention that not every attribute deserves Type 2. Versioning every field creates noisy dimensions and expensive joins. Use Type 2 for attributes that materially change analysis: segment, plan, region, owner, category, pricing tier. Use Type 1 for typo corrections or low-value display attributes.

OBT trade-offs: one big table is not lazy if it is intentional

OBT, or one big table, is a wide denormalized table that pre-joins facts and dimensions for common analysis. Example: analytics_orders_wide with order facts, customer attributes, product category, campaign, geography, and payment status all in one table.

OBT is useful when:

  • Analysts need fast self-serve exploration.
  • Query performance matters more than storage efficiency.
  • The metric has one dominant grain.
  • Source dimensions are stable enough to snapshot safely.
  • BI tools struggle with complex joins.

OBT is risky when:

  • Multiple grains are mixed in one table.
  • Arrays or repeated entities create fanout.
  • Historical attributes are unclear.
  • Sensitive fields are copied everywhere.
  • Every new question requires adding more columns.

The interview-safe answer is: use star schema for canonical modeling and OBT for consumption when it has a clear grain. Do not use OBT as a dumping ground.

For example, an orders OBT can be excellent if it has one row per order and includes customer segment at order time, acquisition channel, shipping region, payment status, and refund flags. It becomes dangerous if it includes item-level fields without one row per item, multiple payment attempts, multiple shipments, and latest customer attributes with no snapshot logic.

Star schema vs OBT decision table

| Decision factor | Star schema | OBT | |---|---|---| | Metric consistency | Strong | Depends on build logic | | Ease for analysts | Medium; requires joins | High for common questions | | Historical accuracy | Strong with SCDs | Strong only if snapshots are explicit | | Performance | Good with proper warehouse design | Often faster for dashboards | | Storage cost | Lower | Higher due to duplicated attributes | | Change management | Cleaner | Can become wide and brittle | | Best use | Curated semantic layer | Dashboard/exploration layer |

A mature answer often proposes both: raw sources, cleaned staging, dimensional marts, then optional wide tables or semantic-layer models for high-traffic use cases.

Worked example: model a subscription product

Prompt: "Design a data model for a subscription app that tracks trials, paid plans, cancellations, and revenue."

Clarify questions: daily active subscriptions, trial conversion, MRR, churn, refunds, plan mix, acquisition channel performance, and cohort retention.

Core tables:

  • dim_customer: one row per customer version, with customer_id, signup_channel, region, segment, and SCD2 fields for segment or region if needed.
  • dim_plan: plan_key, plan_id, plan_name, billing_period, list_price, product_family, effective dates if pricing changes.
  • fact_subscription_events: one row per subscription event such as trial_started, converted, renewed, canceled, reactivated, plan_changed.
  • fact_subscription_daily: one row per subscription per day with status, plan_key, mrr, is_trial, is_paid, is_canceled, days_since_start.
  • fact_invoices: one row per invoice with amount_due, amount_paid, discount, tax, refund, payment_status.
  • fact_payment_attempts: one row per attempt to capture failures, retry success, and dunning.

Why both events and daily snapshot? Events preserve lifecycle detail. Daily snapshot makes MRR, active subscriptions, and churn easy to query. If you only store events, analysts must reconstruct state constantly. If you only store daily snapshots, you lose event-level causality.

SCD handling: plan changes should be captured at event time and in daily snapshots. Customer segment may be Type 2 if segment reporting matters historically. Acquisition channel is usually Type 0 because first-touch source should not change.

OBT option: create subscription_daily_wide with one row per subscription per day, including plan, customer segment at that day, acquisition channel, region, MRR, and status flags. Use it for dashboards. Keep canonical facts and dimensions underneath.

Modeling event data for product analytics

Event streams are tempting to query directly, but raw events are rarely enough. They can have duplicate events, schema drift, bot traffic, anonymous-to-logged-in identity stitching, late arrivals, and inconsistent properties.

A good event modeling flow:

  1. Raw events land unchanged.
  2. Staging models clean names, timestamps, identities, and event properties.
  3. Session or user-activity facts aggregate repeated events into useful grains.
  4. Metric tables define activation, retention, conversion, and engagement.
  5. Wide consumption tables support common product analytics questions.

For a job marketplace, raw events might include job_viewed, apply_clicked, application_submitted, employer_viewed_candidate, employer_replied, interview_requested. A useful model would include fact_applications at one row per application and fact_candidate_activity_daily at one row per candidate per day. The first supports marketplace quality and employer outcomes. The second supports retention and engagement.

Common traps in data modeling interviews

The first trap is mixing grains. If one table has one row per order but includes product category from multiple items, category-level revenue will be wrong. Either move to order-item grain or use carefully defined derived fields like primary_category.

Second, candidates forget history. Latest customer segment is not the same as customer segment at purchase time. If historical reporting matters, use SCD2 or snapshot facts.

Third, candidates model only happy paths. Payments fail, orders refund, subscriptions pause, users merge accounts, products change categories, employers repost jobs, and shipments split. You do not need every edge case, but mention the important ones.

Fourth, candidates over-normalize analytics models. A perfectly normalized OLTP-style model can make simple BI questions painful. Analytics models should be built for reading and aggregation.

Fifth, candidates ignore governance. Sensitive attributes, deletion requests, and access control matter. Wide tables copy data everywhere, so be careful with PII and compliance-sensitive fields.

Prep checklist for data modeling interviews

Practice these steps until they are automatic:

  • Ask what business questions the model must answer.
  • Define the grain of every fact table.
  • Separate facts, dimensions, and bridge tables when needed.
  • Identify mutable attributes and choose Type 0, Type 1, or Type 2.
  • Name the natural key and surrogate key where history matters.
  • Check for fanout and double counting.
  • Decide whether a snapshot table is needed.
  • Explain whether you would provide an OBT for consumption.
  • Mention data quality tests: uniqueness, not-null keys, referential integrity, accepted values, freshness.
  • Call out privacy and access considerations.

A concise closing answer might be: "I would model canonical facts and dimensions in a star schema, use SCD2 for historically important customer and plan attributes, add daily snapshots where state needs to be measured over time, and expose a wide table for common dashboard queries without making it the source of truth." That is the balance interviewers want: principled, but practical.