Skip to main content
Guides Interview prep Data Engineer Interview Questions — Pipelines, SQL Optimization, and Warehouse Design
Interview prep

Data Engineer Interview Questions — Pipelines, SQL Optimization, and Warehouse Design

9 min read · April 25, 2026

Data engineer interviews test practical judgment: modeling data, moving it reliably, optimizing SQL, and designing warehouses that analysts and products can trust. This guide covers the 2026 questions, answer patterns, and senior-level signals.

Data Engineer Interview Questions — Pipelines, SQL Optimization, and Warehouse Design

A data engineer interview in 2026 is about trust. Can the company trust you to move data correctly, model it clearly, recover when pipelines fail, and build systems that analysts, ML teams, finance, and product teams can depend on? The work is less glamorous than a whiteboard full of distributed systems boxes, but the business impact is enormous. Bad data pipelines create bad decisions, broken experiments, missed revenue, and angry customers.

The senior data engineering bar has moved up. Companies expect SQL depth, data modeling judgment, orchestration experience, cloud warehouse knowledge, streaming basics, data quality discipline, and enough product sense to understand why the data exists. The interview loop is designed to find people who can build boring, durable systems and explain tradeoffs clearly.

What the data engineer loop usually tests

| Round | What they test | Strong signal | |---|---|---| | SQL | Can you transform and debug real data? | Correct joins, windows, deduping, performance awareness | | Data modeling | Can you design useful tables? | Grain, dimensions, facts, slowly changing history | | Pipeline design | Can you move data reliably? | Idempotency, retries, backfills, observability | | Warehouse architecture | Can you support analytics at scale? | Partitioning, clustering, governance, cost control | | Systems / streaming | Can you reason about events and latency? | Ordering, exactly-once limits, schemas, replay | | Behavioral | Can you work with stakeholders? | Clear ownership, incident response, prioritization |

Expect the interviewer to care about edge cases: late-arriving events, duplicate rows, schema changes, timezone problems, backfills, PII, cost explosions, and consumers who interpret metrics differently.

SQL questions

Common SQL prompts include:

  • Deduplicate events and keep the latest record per entity.
  • Calculate daily active users and seven-day retention.
  • Build a funnel from page view to purchase.
  • Find customers whose revenue dropped by more than 50% month over month.
  • Compute rolling averages and rank top products within categories.
  • Identify gaps or overlaps in subscription periods.
  • Reconcile payments between an internal ledger and a processor export.

Senior candidates talk through assumptions before writing SQL. What is the primary key? Can events arrive late? Are timestamps UTC? Are there soft deletes? Is revenue net of refunds? Does one user have multiple accounts? These questions show production awareness.

For performance, know how joins, filters, partitions, clustering, sort keys, and window functions affect cost. In Snowflake, BigQuery, Redshift, Databricks, and Postgres, the syntax differs but the principles rhyme: reduce scanned data, join at the right grain, avoid exploding intermediate results, aggregate when possible, and inspect query plans or warehouse execution details.

A common optimization example: a query joins raw events to users and orders, then filters to one month at the end. Move the date filter into the earliest event CTE. Deduplicate before joining. Select only needed columns. Aggregate high-cardinality events before joining to dimensions. These simple moves can cut cost by 80%.

Data modeling questions

"Design a warehouse model for an e-commerce business." Start with business processes: orders, payments, shipments, refunds, inventory, customers, sessions, marketing touches. Define grain. A fact_orders table might be one row per order; fact_order_items one row per item; fact_payments one row per payment transaction. Dimensions include customer, product, merchant, date, channel, and geography.

The interviewer is listening for grain discipline. Many warehouse problems come from mixing grains: order-level revenue joined to item-level rows, daily snapshots joined to event streams, or customer attributes that change over time. State the grain in every table.

"How do you handle slowly changing dimensions?" Explain Type 1 overwrite for attributes where history does not matter, Type 2 for history-preserving attributes, and snapshots for facts that change over time. For example, customer plan tier may need history because revenue analysis depends on what tier the customer had at the time of purchase. Use valid_from, valid_to, current_flag, and surrogate keys when appropriate.

"Star schema or wide tables?" The senior answer is "it depends." Star schemas are clean, governed, and reusable. Wide tables are easier for analysts and BI tools. Many modern teams use layered models: normalized or raw ingest, cleaned staging, conformed facts/dimensions, and curated marts or wide tables for common use cases. The key is ownership and documentation.

Pipeline design questions

"Design a pipeline that ingests product events into the warehouse." A strong answer covers event collection, schema validation, ingestion, storage, transformation, quality checks, and consumption. Mention event contracts, versioning, dead-letter queues, replay, partitioning, and monitoring.

A practical architecture:

  1. Product emits events with a versioned schema.
  2. Events land in a durable stream or object storage.
  3. Validation separates good records from malformed records.
  4. Raw data is stored immutably for replay.
  5. Transformations create cleaned event tables and business marts.
  6. Quality checks validate volume, freshness, uniqueness, and referential integrity.
  7. Dashboards and downstream jobs consume governed tables.

"How do you make a pipeline idempotent?" Idempotency means rerunning the job produces the same result. Use deterministic keys, merge/upsert semantics, partition replacement, checkpointing, and immutable raw inputs. Avoid append-only transforms without dedupe unless duplicates are acceptable. For batch jobs, partition-based rebuilds are often simpler and safer than row-by-row updates.

"A pipeline failed for three days. How do you backfill?" First stop the bleeding: identify root cause, fix code or upstream data, and prevent new corrupt output. Then assess downstream impact. Backfill raw-to-cleaned tables in dependency order, ideally by partition. Validate row counts, checksums, sample records, and key metrics. Communicate to consumers. If dashboards were wrong, say so clearly. Senior data engineers are trusted because they handle bad news cleanly.

Warehouse and lakehouse design

Modern data stacks often combine object storage, cloud warehouses, transformation tools, orchestration, and BI. You should be ready to discuss Snowflake, BigQuery, Redshift, Databricks, Spark, dbt-style transformations, Airflow/Dagster/Prefect, Kafka/Kinesis/Pub/Sub, and table formats like Iceberg or Delta if relevant. You do not need every tool, but you need the concepts.

Key design questions:

  • What data needs low latency versus daily batch?
  • Which tables are raw, cleaned, conformed, and curated?
  • How are schemas versioned and documented?
  • How are PII, access controls, retention, and deletion handled?
  • How are costs monitored by team, pipeline, or query pattern?
  • How do analysts discover trusted metrics?
  • What is the recovery plan when upstream systems change?

For warehouses, discuss partitioning by date or ingestion time, clustering by commonly filtered keys, materialized views for expensive aggregates, and workload isolation. In BigQuery, scanned bytes matter. In Snowflake, warehouse size and runtime matter. In Databricks, file size, partition count, and compaction matter. Senior candidates adapt the principle to the platform.

Streaming questions

Streaming interviews often test whether you understand tradeoffs, not whether you can recite Kafka internals.

"Batch or streaming?" Choose streaming when latency matters: fraud detection, inventory availability, notifications, operational alerts. Choose batch when daily correctness is enough and simplicity matters. A near-real-time pipeline can be much more expensive and harder to debug than an hourly batch job. Senior data engineers do not build streaming systems for dashboard vanity.

"How do you handle late events?" Use event time, watermarks, allowed lateness windows, and correction processes. For financial or compliance data, late events may require restating historical partitions. For analytics dashboards, you might show a metric as preliminary for 48 hours. Name the business tolerance.

"Exactly once?" Be careful. Exactly-once semantics are often limited to a specific system boundary. End-to-end exactly once across producers, streams, storage, and consumers is hard. Practical systems use idempotent producers, deterministic event IDs, transactional sinks where available, and deduplication. A senior answer avoids magical guarantees.

Design exercise answer pattern

For any architecture prompt, use the same sequence: consumers, source systems, freshness requirement, correctness requirement, data model, orchestration, validation, observability, access control, and cost. Then state the tradeoff. For example, an executive revenue dashboard may need daily audited accuracy, while a fraud operations dashboard may accept approximate real-time counts with later correction. That single distinction changes storage, SLA, ownership, and alerting. Senior candidates do not treat all pipelines as equally urgent; they match engineering complexity to the decision the data supports.

Data quality and observability

Data quality is now a first-class interview topic. Discuss tests for freshness, volume, uniqueness, null rates, accepted values, referential integrity, distribution drift, and reconciliation against source systems. For critical metrics, add alerting and ownership. A failed data test without an owner is just noise.

Good data observability includes:

  • Pipeline runtime and failure alerts
  • Data freshness SLAs
  • Row count and volume anomaly detection
  • Schema change detection
  • Lineage from source to dashboard
  • Cost monitoring
  • Incident runbooks
  • Consumer communication channels

When asked about a broken dashboard, diagnose systematically: data source, ingestion, transformation, metric definition, BI filter, permissions, cache, and recent code or schema changes. Do not assume the dashboard is wrong; sometimes the business actually changed.

Behavioral questions

Prepare stories for:

  • A pipeline incident and how you handled communication
  • A data model you redesigned because the old one could not scale
  • A time you pushed back on a stakeholder's metric definition
  • A cost optimization project
  • A migration from one warehouse or orchestration system to another
  • A time you balanced speed and correctness
  • Mentoring analysts or engineers on data usage

For stakeholder stories, be concrete. "Marketing wanted a real-time attribution dashboard. I showed that daily freshness met the decision cadence, saving an estimated $6K/month in streaming costs, and built an hourly alert only for campaign outages." That is better than saying "I prioritize business value."

Questions to ask the company

Ask:

  • What data is considered mission-critical here?
  • Who owns metric definitions and data contracts?
  • What are the biggest reliability problems in the current stack?
  • How much of the work is analytics engineering versus platform engineering?
  • What latency requirements are real versus aspirational?
  • How are data incidents communicated?
  • What warehouse or pipeline costs are currently painful?
  • What would a successful first six months look like?

The answers reveal whether you are joining a mature data platform, a dashboard factory, or a rescue mission.

Final prep checklist

Practice SQL with windows, CTEs, dedupe, cohorts, funnels, and reconciliation. Review dimensional modeling and slowly changing dimensions. Prepare to design an ingestion pipeline with idempotency, backfills, quality checks, and observability. Have one cost optimization story and one incident story ready.

The 2026 data engineering bar is practical reliability. Show that you can make data correct, understandable, recoverable, and affordable. If your answers keep returning to grain, ownership, validation, and business use, you will sound like the person companies trust with their data foundation.