Data Engineer
Data EngineerLocation Phoenix AZData Engineer with strong PySpark experience to work on large-scale data processing and analytics initiatives. The ideal candidate will have hands-on experience working with large datasets, complex joins, and performance optimization, along with the ability to apply basic analytical thinking and deliver clear, stakeholder-ready outputs.Key ResponsibilitiesData Engineering & DevelopmentDesign, develop, and maintain scalable data pipelines using PySpark.Write efficient and optimized PySpark code to process and transform large-scale datasets.Handle joins across multiple large databases, ensuring performance, accuracy, and scalability.Optimize Spark jobs to minimize runtime, memory usage, and compute cost.Work with structured and semi-structured data from multiple sources.Data Preparation & Analysis SupportBuild and curate training and analytical datasets by joining and transforming multiple data sources.Apply basic analytical skills to understand data patterns, anomalies, and business relevance.Perform data validation and quality checks, including:Record counts and reconciliationDuplicate detectionNull and outlier checksSchema and data-type validationEnsure datasets are analysis-ready and trustworthy.Stakeholder Interaction & ReportingUnderstand business objectives and translate them into data requirements.Ask the right questions to determine:Level of aggregation requiredMetrics definitionsData freshness and accuracy expectationsPreferred output and reporting formatsPresent results and insights clearly to stakeholders.Create reports and summaries using Excel for business users and leadership.Expected Technical Approach (Problem-Solving Mindset)Candidates are expected to demonstrate the ability to:Approach complex data projects methodically, starting with:Understanding business objectivesReviewing source data structure and volumeDesigning efficient join strategiesChoose the right join types, partitioning strategies, and caching techniques.Validate data at every stage of the pipeline.Balance technical accuracy with business usability when presenting results.Core Skill Sets (Must-Have)Strong hands-on experience with PySparkExtensive experience working with large datasetsProven expertise in joining large databases efficientlyAbility to write high-performance, optimized codeBasic analytical skills to interpret and validate dataReporting skills using ExcelGood to Have SkillsExperience in model development or supporting analytics/modeling teamsSAS experienceExposure to Cloudera or similar big data platformsUnderstanding of data warehousing and analytics workflowsSoft Skills & CompetenciesStrong problem-solving and logical thinking