Role Summary

Hands-on engineer building reliable data pipelines, streaming systems, and the embedding and feature platforms that AI applications depend on. Comfortable across batch ETL, Kafka and Flink streaming, and the operational discipline that distinguishes a working pipeline from a production-grade one.

Ships pipelines that are observable, recoverable, and obviously correct rather than impressively complex. Treats data quality as a contract, not a downstream concern. Maintains strong opinions on idempotency, late-arriving data, and the failure modes that show up only in production.

Skills

  • Python production engineering for data pipelines
  • SQL fluency including window functions, CTEs, and query optimization
  • Apache Spark (Scala and Python) at production scale
  • dbt for transformation and testing
  • Workflow orchestration (Airflow, Dagster, Prefect, native cloud)
  • Apache Kafka and Confluent ecosystem
  • Apache Flink for true streaming workloads
  • Spark Streaming and Structured Streaming for micro-batch use cases
  • Cloud-native streaming (Kinesis, Pub/Sub, Event Hubs)
  • CDC tooling (Debezium, AWS DMS, native cloud CDC)
  • Lakehouse table formats (Delta, Iceberg, Hudi)
  • Cloud data warehouses (Snowflake, BigQuery, Databricks, Redshift)
  • Feature stores (Feast, Tecton, Databricks Feature Store, custom)
  • Vector databases for embedding pipelines (Pinecone, Weaviate, pgvector)
  • Data-quality frameworks (Great Expectations, Soda)
  • Data observability tooling integration
  • Schema-registry usage and evolution discipline
  • Idempotency and replay patterns for streaming pipelines
  • Late-arriving and out-of-order data handling
  • Backfill and reprocessing pipeline design
  • Container packaging for data jobs
  • Infrastructure-as-code for data infrastructure

Capabilities & Focus Areas

  • Batch and streaming pipeline construction for analytical and ML workloads
  • Data-quality validation, lineage, and observability instrumentation
  • Feature stores supporting both batch and online inference
  • Embedding pipelines for retrieval-augmented generation use cases
  • Streaming infrastructure with appropriate exactly-once and ordering guarantees
  • Data-contract validation at ingestion boundaries
  • CDC pipeline construction from source systems to lakehouse

Typical Engagement Patterns

  • Four to twelve week pipeline build and remediation engagements
  • Embedded data engineering augmentation for client teams (three to twelve months)
  • Feature-store implementation and integration engagements
  • Streaming-platform consolidation and migration programs
  • Discrete embedding pipeline builds for new RAG use cases

Outcomes Delivered

  • Pipelines that survive late-arriving data and source-system schema changes
  • Feature stores with point-in-time correctness and zero training-serving skew
  • Streaming systems meeting exactly-once and ordering guarantees in production
  • Data-quality issues surfaced at ingestion, not by downstream consumers
  • Engineering teams that can ship the next pipeline without consulting support

Need this role for an engagement?

Brief us on the scope and timeline and we'll match a senior practitioner.

Get in touch →