Data Engineer Interview Questions — Pipelines, SQL Optimization, and Warehouse Design
Data engineer interviews test practical judgment: modeling data, moving it reliably, optimizing SQL, and designing warehouses that analysts and products can trust. This guide covers the 2026 questions, answer patterns, and senior-level signals.
Data Engineer Interview Questions — Pipelines, SQL Optimization, and Warehouse Design
A data engineer interview in 2026 is about trust. Can the company trust you to move data correctly, model it clearly, recover when pipelines fail, and build systems that analysts, ML teams, finance, and product teams can depend on? The work is less glamorous than a whiteboard full of distributed systems boxes, but the business impact is enormous. Bad data pipelines create bad decisions, broken experiments, missed revenue, and angry customers.
The senior data engineering bar has moved up. Companies expect SQL depth, data modeling judgment, orchestration experience, cloud warehouse knowledge, streaming basics, data quality discipline, and enough product sense to understand why the data exists. The interview loop is designed to find people who can build boring, durable systems and explain tradeoffs clearly.
What the data engineer loop usually tests
| Round | What they test | Strong signal | |---|---|---| | SQL | Can you transform and debug real data? | Correct joins, windows, deduping, performance awareness | | Data modeling | Can you design useful tables? | Grain, dimensions, facts, slowly changing history | | Pipeline design | Can you move data reliably? | Idempotency, retries, backfills, observability | | Warehouse architecture | Can you support analytics at scale? | Partitioning, clustering, governance, cost control | | Systems / streaming | Can you reason about events and latency? | Ordering, exactly-once limits, schemas, replay | | Behavioral | Can you work with stakeholders? | Clear ownership, incident response, prioritization |
Expect the interviewer to care about edge cases: late-arriving events, duplicate rows, schema changes, timezone problems, backfills, PII, cost explosions, and consumers who interpret metrics differently.
SQL questions
Common SQL prompts include:
- Deduplicate events and keep the latest record per entity.
- Calculate daily active users and seven-day retention.
- Build a funnel from page view to purchase.
- Find customers whose revenue dropped by more than 50% month over month.
- Compute rolling averages and rank top products within categories.
- Identify gaps or overlaps in subscription periods.
- Reconcile payments between an internal ledger and a processor export.
Senior candidates talk through assumptions before writing SQL. What is the primary key? Can events arrive late? Are timestamps UTC? Are there soft deletes? Is revenue net of refunds? Does one user have multiple accounts? These questions show production awareness.
For performance, know how joins, filters, partitions, clustering, sort keys, and window functions affect cost. In Snowflake, BigQuery, Redshift, Databricks, and Postgres, the syntax differs but the principles rhyme: reduce scanned data, join at the right grain, avoid exploding intermediate results, aggregate when possible, and inspect query plans or warehouse execution details.
A common optimization example: a query joins raw events to users and orders, then filters to one month at the end. Move the date filter into the earliest event CTE. Deduplicate before joining. Select only needed columns. Aggregate high-cardinality events before joining to dimensions. These simple moves can cut cost by 80%.
Data modeling questions
"Design a warehouse model for an e-commerce business." Start with business processes: orders, payments, shipments, refunds, inventory, customers, sessions, marketing touches. Define grain. A fact_orders table might be one row per order; fact_order_items one row per item; fact_payments one row per payment transaction. Dimensions include customer, product, merchant, date, channel, and geography.
The interviewer is listening for grain discipline. Many warehouse problems come from mixing grains: order-level revenue joined to item-level rows, daily snapshots joined to event streams, or customer attributes that change over time. State the grain in every table.
"How do you handle slowly changing dimensions?" Explain Type 1 overwrite for attributes where history does not matter, Type 2 for history-preserving attributes, and snapshots for facts that change over time. For example, customer plan tier may need history because revenue analysis depends on what tier the customer had at the time of purchase. Use valid_from, valid_to, current_flag, and surrogate keys when appropriate.
"Star schema or wide tables?" The senior answer is "it depends." Star schemas are clean, governed, and reusable. Wide tables are easier for analysts and BI tools. Many modern teams use layered models: normalized or raw ingest, cleaned staging, conformed facts/dimensions, and curated marts or wide tables for common use cases. The key is ownership and documentation.
Pipeline design questions
"Design a pipeline that ingests product events into the warehouse." A strong answer covers event collection, schema validation, ingestion, storage, transformation, quality checks, and consumption. Mention event contracts, versioning, dead-letter queues, replay, partitioning, and monitoring.
A practical architecture:
- Product emits events with a versioned schema.
- Events land in a durable stream or object storage.
- Validation separates good records from malformed records.
- Raw data is stored immutably for replay.
- Transformations create cleaned event tables and business marts.
- Quality checks validate volume, freshness, uniqueness, and referential integrity.
- Dashboards and downstream jobs consume governed tables.
"How do you make a pipeline idempotent?" Idempotency means rerunning the job produces the same result. Use deterministic keys, merge/upsert semantics, partition replacement, checkpointing, and immutable raw inputs. Avoid append-only transforms without dedupe unless duplicates are acceptable. For batch jobs, partition-based rebuilds are often simpler and safer than row-by-row updates.
"A pipeline failed for three days. How do you backfill?" First stop the bleeding: identify root cause, fix code or upstream data, and prevent new corrupt output. Then assess downstream impact. Backfill raw-to-cleaned tables in dependency order, ideally by partition. Validate row counts, checksums, sample records, and key metrics. Communicate to consumers. If dashboards were wrong, say so clearly. Senior data engineers are trusted because they handle bad news cleanly.
Warehouse and lakehouse design
Modern data stacks often combine object storage, cloud warehouses, transformation tools, orchestration, and BI. You should be ready to discuss Snowflake, BigQuery, Redshift, Databricks, Spark, dbt-style transformations, Airflow/Dagster/Prefect, Kafka/Kinesis/Pub/Sub, and table formats like Iceberg or Delta if relevant. You do not need every tool, but you need the concepts.
Key design questions:
- What data needs low latency versus daily batch?
- Which tables are raw, cleaned, conformed, and curated?
- How are schemas versioned and documented?
- How are PII, access controls, retention, and deletion handled?
- How are costs monitored by team, pipeline, or query pattern?
- How do analysts discover trusted metrics?
- What is the recovery plan when upstream systems change?
For warehouses, discuss partitioning by date or ingestion time, clustering by commonly filtered keys, materialized views for expensive aggregates, and workload isolation. In BigQuery, scanned bytes matter. In Snowflake, warehouse size and runtime matter. In Databricks, file size, partition count, and compaction matter. Senior candidates adapt the principle to the platform.
Streaming questions
Streaming interviews often test whether you understand tradeoffs, not whether you can recite Kafka internals.
"Batch or streaming?" Choose streaming when latency matters: fraud detection, inventory availability, notifications, operational alerts. Choose batch when daily correctness is enough and simplicity matters. A near-real-time pipeline can be much more expensive and harder to debug than an hourly batch job. Senior data engineers do not build streaming systems for dashboard vanity.
"How do you handle late events?" Use event time, watermarks, allowed lateness windows, and correction processes. For financial or compliance data, late events may require restating historical partitions. For analytics dashboards, you might show a metric as preliminary for 48 hours. Name the business tolerance.
"Exactly once?" Be careful. Exactly-once semantics are often limited to a specific system boundary. End-to-end exactly once across producers, streams, storage, and consumers is hard. Practical systems use idempotent producers, deterministic event IDs, transactional sinks where available, and deduplication. A senior answer avoids magical guarantees.
Design exercise answer pattern
For any architecture prompt, use the same sequence: consumers, source systems, freshness requirement, correctness requirement, data model, orchestration, validation, observability, access control, and cost. Then state the tradeoff. For example, an executive revenue dashboard may need daily audited accuracy, while a fraud operations dashboard may accept approximate real-time counts with later correction. That single distinction changes storage, SLA, ownership, and alerting. Senior candidates do not treat all pipelines as equally urgent; they match engineering complexity to the decision the data supports.
Data quality and observability
Data quality is now a first-class interview topic. Discuss tests for freshness, volume, uniqueness, null rates, accepted values, referential integrity, distribution drift, and reconciliation against source systems. For critical metrics, add alerting and ownership. A failed data test without an owner is just noise.
Good data observability includes:
- Pipeline runtime and failure alerts
- Data freshness SLAs
- Row count and volume anomaly detection
- Schema change detection
- Lineage from source to dashboard
- Cost monitoring
- Incident runbooks
- Consumer communication channels
When asked about a broken dashboard, diagnose systematically: data source, ingestion, transformation, metric definition, BI filter, permissions, cache, and recent code or schema changes. Do not assume the dashboard is wrong; sometimes the business actually changed.
Behavioral questions
Prepare stories for:
- A pipeline incident and how you handled communication
- A data model you redesigned because the old one could not scale
- A time you pushed back on a stakeholder's metric definition
- A cost optimization project
- A migration from one warehouse or orchestration system to another
- A time you balanced speed and correctness
- Mentoring analysts or engineers on data usage
For stakeholder stories, be concrete. "Marketing wanted a real-time attribution dashboard. I showed that daily freshness met the decision cadence, saving an estimated $6K/month in streaming costs, and built an hourly alert only for campaign outages." That is better than saying "I prioritize business value."
Questions to ask the company
Ask:
- What data is considered mission-critical here?
- Who owns metric definitions and data contracts?
- What are the biggest reliability problems in the current stack?
- How much of the work is analytics engineering versus platform engineering?
- What latency requirements are real versus aspirational?
- How are data incidents communicated?
- What warehouse or pipeline costs are currently painful?
- What would a successful first six months look like?
The answers reveal whether you are joining a mature data platform, a dashboard factory, or a rescue mission.
Final prep checklist
Practice SQL with windows, CTEs, dedupe, cohorts, funnels, and reconciliation. Review dimensional modeling and slowly changing dimensions. Prepare to design an ingestion pipeline with idempotency, backfills, quality checks, and observability. Have one cost optimization story and one incident story ready.
The 2026 data engineering bar is practical reliability. Show that you can make data correct, understandable, recoverable, and affordable. If your answers keep returning to grain, ownership, validation, and business use, you will sound like the person companies trust with their data foundation.
Related guides
- Android Engineer Interview Questions in 2026 — Kotlin, Jetpack Compose, and Android System Design — Android interviews in 2026 test Kotlin, coroutines, Jetpack Compose, lifecycle, offline behavior, and release judgment. This guide gives the questions and answer patterns that show native Android production maturity.
- Data Analyst Interview Questions in 2026: SQL, Cases & Dashboards — A no-fluff guide to exactly what data analyst interviews test in 2026—SQL, business cases, and dashboard design—with real examples and salary context.
- Data Scientist Interview Questions in 2026: Stats, SQL & Cases — Crack DS interviews in 2026 with this no-fluff guide to statistics, SQL, and case study questions—plus concrete prep steps for the week ahead.
- Frontend Engineer Interview Questions — React, Browser Internals, and Frontend System Design — Frontend interviews in 2026 blend React fluency, JavaScript fundamentals, browser knowledge, accessibility, performance, and frontend system design. This guide explains the questions to expect and how to answer with senior-level judgment.
- iOS Engineer Interview Questions in 2026 — Swift, UIKit, SwiftUI, and Mobile System Design — iOS interviews in 2026 combine Swift depth, UIKit maintenance, SwiftUI judgment, concurrency, and mobile system design. This guide gives practical questions, strong-answer patterns, and prep steps for native app roles.
