Data Engineer

Role Summary

Hands-on engineer building reliable data pipelines, streaming systems, and the embedding and feature platforms that AI applications depend on. Comfortable across batch ETL, Kafka and Flink streaming, and the operational discipline that distinguishes a working pipeline from a production-grade one.

Ships pipelines that are observable, recoverable, and obviously correct rather than impressively complex. Treats data quality as a contract, not a downstream concern. Maintains strong opinions on idempotency, late-arriving data, and the failure modes that show up only in production.

Skills

Python production engineering for data pipelines
SQL fluency including window functions, CTEs, and query optimization
Apache Spark (Scala and Python) at production scale
dbt for transformation and testing
Workflow orchestration (Airflow, Dagster, Prefect, native cloud)
Apache Kafka and Confluent ecosystem
Apache Flink for true streaming workloads
Spark Streaming and Structured Streaming for micro-batch use cases
Cloud-native streaming (Kinesis, Pub/Sub, Event Hubs)
CDC tooling (Debezium, AWS DMS, native cloud CDC)
Lakehouse table formats (Delta, Iceberg, Hudi)
Cloud data warehouses (Snowflake, BigQuery, Databricks, Redshift)
Feature stores (Feast, Tecton, Databricks Feature Store, custom)
Vector databases for embedding pipelines (Pinecone, Weaviate, pgvector)
Data-quality frameworks (Great Expectations, Soda)
Data observability tooling integration
Schema-registry usage and evolution discipline
Idempotency and replay patterns for streaming pipelines
Late-arriving and out-of-order data handling
Backfill and reprocessing pipeline design
Container packaging for data jobs
Infrastructure-as-code for data infrastructure

Capabilities & Focus Areas

Batch and streaming pipeline construction for analytical and ML workloads
Data-quality validation, lineage, and observability instrumentation
Feature stores supporting both batch and online inference
Embedding pipelines for retrieval-augmented generation use cases
Streaming infrastructure with appropriate exactly-once and ordering guarantees
Data-contract validation at ingestion boundaries
CDC pipeline construction from source systems to lakehouse

Typical Engagement Patterns

Four to twelve week pipeline build and remediation engagements
Embedded data engineering augmentation for client teams (three to twelve months)
Feature-store implementation and integration engagements
Streaming-platform consolidation and migration programs
Discrete embedding pipeline builds for new RAG use cases

Outcomes Delivered

Pipelines that survive late-arriving data and source-system schema changes
Feature stores with point-in-time correctness and zero training-serving skew
Streaming systems meeting exactly-once and ordering guarantees in production
Data-quality issues surfaced at ingestion, not by downstream consumers
Engineering teams that can ship the next pipeline without consulting support

Need this role for an engagement?

Brief us on the scope and timeline and we'll match a senior practitioner.

Get in touch →