Python Developer
Developer / Senior Developer – PySpark & Python Data EngineeringWe are seeking a skilled PySpark and Python Data Engineer to design, build, and optimize large-scale data pipelines for batch and streaming workloads. This role supports enterprise data engineering initiatives within a multinational FMCG environment and requires strong expertise in modern lakehouse platforms, cloud-native data processing, and scalable integration patterns.Primary Skill: PySpark and Python Development Location: Cincinnati, OH Function: Development Services – Data Engineering, Analytics, Digital & Emerging Tech Industry: FMCG Cloud Strategy: Hyperscaler-first approach across Azure, GCP, and AWS using Databricks and Delta Lake Level: Developer / Senior DeveloperKey ResponsibilitiesPySpark DevelopmentDesign and develop production-grade PySpark applications for large-scale batch and streaming data processing.Implement advanced PySpark DataFrame operations, including complex transformations, window functions, pivot/unpivot logic, and nested structure handling.Build efficient multi-dataset joins using broadcast joins, sort-merge joins, and skew-handling strategies.Develop custom UDFs and Pandas UDFs for performance-critical transformations.Optimise aggregations and group-by operations for large FMCG datasets.Implement Structured Streaming pipelines using sources such as Kafka, Azure Event Hubs, and GCP Pub/Sub.Apply watermarking, windowing, and stateful streaming strategies, including mapGroupsWithState, to handle late-arriving data and real-time processing needs.Ensure appropriate delivery semantics, including exactly-once and at-least-once processing where required.Apply advanced Spark performance tuning techniques, including partition optimisation, skew mitigation, broadcast management, AQE tuning, and executor sizing.Develop and maintain reusable PySpark libraries to support shared data processing capabilities across the platform.Implement advanced PySpark DataFrame API operationsDevelop and maintain reusable PySpark libraries for shared data processing capabilitiesPython EngineeringBuild Python-based data services, automation scripts, and utility frameworks that support the enterprise data platform.Develop REST API integrations using Python libraries such as requests and httpx to consume SAP OData, Salesforce, and third-party FMCG APIs.Implement data validation and reconciliation frameworks using tools such as Great Expectations and Pandera.Develop orchestration scripts and helper utilities for Airflow DAGs and Databricks Workflows.Apply sound software engineering practices, including unit testing with pytest, integration testing with Testcontainers, type hints, modular design, and strong dependency management.Implement Python-based data quality checks focused on completeness, consistency, and conformity.Data Lakehouse and Cloud Platform EngineeringBuild and manage lakehouse architectures on hyperscaler platforms using Azure Databricks, GCP Dataproc, and AWS EMR.Work with ACID-compliant data lake technologies such as Delta Lake, Apache Iceberg, and Apache Hudi.Implement medallion architecture patterns (Bronze, Silver, Gold) to support progressive data refinement.Leverage Delta Lake capabilities including ACID transactions, schema enforcement, Time Travel, Delta Live Tables, optimise and Z-Order, and Change Data Feed.Manage Databricks Workflows and job clusters for production pipeline execution.Implement Databricks Auto Loader for scalable, incremental ingestion from cloud storage.Use Unity Catalog to support governance, lineage, and access control across the data estate.Data Ingestion and IntegrationBuild ingestion pipelines from diverse FMCG data sources, including SAP S/4HANA, Salesforce, operational databases, streaming platforms, and file-based feeds.Support integrations across SAP OData APIs, BAPI extracts, IDoc-based feeds, Salesforce REST and Bulk APIs, Oracle, Azure SQL, Cloud Spanner, Kafka, Event Hubs, Pub/Sub, SFTP, Azure Blob, GCS, S3, and common file formats such as CSV, Parquet, Avro, and JSON.Implement Change Data Capture (CDC) patterns for near real-time database synchronisation.Design schema evolution strategies that accommodate upstream source changes with minimal disruption.Publish processed data to downstream systems such as BigQuery, Azure Synapse, Snowflake, feature stores, Power BI, and Looker.SQL and Data ModellingWrite and optimise complex SQL queries for extraction, transformation, validation, and reconciliation.Design star and snowflake schemas to support FMCG analytics domains and reporting needs.Use Spark SQL for large-scale analytical processing.Develop SQL-based data quality checks and reconciliation frameworks.Improve query performance through execution plan analysis, partition pruning, and predicate pushdown.SkillsMandatory Skill: Python