About the Company:
Our client is a leader in cloud computing for the AI economy, providing innovative solutions that allow businesses to tackle real-world challenges without the need for costly infrastructure or large in-house AI/ML teams.
About the Role:
Our client is building a cutting-edge inference platform for AI models, enabling efficient, large-scale deployment of text, vision, audio, and multimodal architectures. This role involves working with one of the world’s largest GPU clouds, running tens of thousands of GPUs.
Responsibilities:
Develop and optimize low-level kernels and runtime components for AI inference
Improve the performance of inference engines on GPU platforms
Profile and debug system and hardware-level performance issues
Integrate support for new hardware architectures (Hopper, Blackwell, Rubin)
Collaborate with ML and backend teams to optimize end-to-end execution
Required Qualifications:
Strong proficiency in C++ or expertise in GPU programming with a focus on high-performance coding and memory management
Experience in GPU programming or systems-level software development (e.g., OS internals, kernel modules, or device drivers)
Hands-on experience with profiling and debugging tools for both CPU and GPU performance optimization
Solid understanding of CPU/GPU architecture and memory hierarchy
Preferred Qualifications:
Experience with GPU computing programming (CUDA, ROCm, CUTLASS, Cute, ThunderKittens, Triton, Pallas, Mosaic GPU)
Familiarity with ML inference runtimes (e.g., TensorRT, TVM)
Knowledge of Linux internals, drivers, or compiler toolchains
Experience with tools like perf, VTune, Nsight, or ROCm profiler
Familiarity with popular inference engines (e.g., vLLM, sglang, TGI)
Join the Znoydzem community.
Similar Resumes