JOBSEARCHER

Senior Machine Learning Engineer – GPU Optimization & CUDA Systems

A market-leading high-frequency trading firm is seeking Senior Machine Learning Engineers to join a specialist performance engineering team focused on low-level optimization for large-scale AI workloads.This role is heavily focused on GPU performance, CUDA kernel optimization, and systems-level acceleration work later in the ML pipeline. The team works on extracting maximum performance from modern hardware architectures to support highly demanding training and inference workloads.You will work close to the metal, optimizing critical components across CUDA, C++, memory management, and GPU execution paths. The work combines deep systems engineering with cutting-edge machine learning infrastructure.Key responsibilities:Develop and optimize CUDA kernels for high-performance ML workloadsImprove GPU utilization, memory efficiency, and execution performanceProfile and optimize bottlenecks across training and inference pipelinesWork on compiler/runtime-level optimizations and kernel fusion strategiesCollaborate with ML systems and infrastructure teams on end-to-end accelerationBuild highly optimized C++ components for latency and throughput-sensitive systemsRequirements:Strong C++ and CUDA development experienceDeep understanding of GPU architecture and performance optimizationExperience profiling and debugging GPU workloads using tools such as NsightKnowledge of PyTorch internals, Triton, NCCL, CUTLASS, or similar frameworksStrong systems programming background with focus on performance engineeringExperience working on high-throughput or low-latency distributed systemsComputer Science, Mathematics, Physics, Engineering, or related technical degree preferredThis is an opportunity to work on some of the most technically challenging AI infrastructure problems in the industry, within an environment that values engineering excellence, autonomy, and performance.