See more Collapse

Senior GPU Cluster Software Engineer

2 months ago


Shanghai, China NVIDIA Full time

As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running  on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster.  Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster.  Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work.  We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime

  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs

  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features

What we need to see:

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)

  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.

  • Solid understanding of algorithms, data structures, and runtime/space complexity

  • Experience working with distributed system software architecture

  • Basic understanding of HPC GPU cluster, slurm

  • Basic understanding of Machine learning concepts and terminologies

  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)

  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Ways to stand out from crowd:

  • Experience debugging functional and performance issues in HPC GPU clusters

  • Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster

  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL

  • Experience with HPC schedulers such as Slurm

  • Background with Opentelemetry


We have other current jobs related to this field that you can find below


  • Shanghai, Shanghai, China Bosch Full time

    Job Title: GPU Cluster DevOps Engineer About the Company: Join an international DevOps team at a leading tech company specializing in AI Deep Learning Platforms. As a GPU Cluster DevOps Engineer, you will play a key role in the operation and development of cutting-edge technology. Job Description: Work in an international DevOps team responsible for GPU...


  • Shanghai, Shanghai, China Bosch Group Full time

    Job DescriptionWording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform.Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces(e.g. restful API, load balancing)Implementation of performance monitoring...


  • Shanghai, China Bosch Full time

    Job Description Wording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform. Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces( restful API, load balancing) Implementation of performance...


  • Shanghai, China Bosch Group Full time

    Job DescriptionWording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform.Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces(e.g. restful API, load balancing)Implementation of performance monitoring...


  • Shanghai, Shanghai, China NVIDIA Full time

    NVIDIA has a rich history of revolutionizing computer graphics, PC gaming, and accelerated computing over the past three decades. This legacy of innovation thrives on cutting-edge technology and an exceptional team of professionals.Embarking on uncharted territory requires vision, creativity, and top-tier talent. As part of the NVIDIA family, you'll immerse...


  • Shanghai, China NVIDIA Full time

    A key part of NVIDIA's strength is our sophisticated development tools and modelling environments that enable our incredible pace of delivering new technology to market. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high production-quality standards. This software engineering role involves...


  • Shanghai, China Optiver Full time

    WHO WE ARE: Optiver is a global market maker with offices in Amsterdam, London, Chicago, Austin, Sydney, Shanghai, Hong Kong, Singapore and Taipei. Founded in 1986, today we are a leading liquidity provider, with close to 2,000 employees in offices around the world, united in our commitment to improve the market through competitive pricing, execution and...


  • Shanghai, Shanghai, China Optiver Full time

    WHO WE ARE: Optiver is a global market maker with offices in Amsterdam, London, Chicago, Austin, Sydney, Shanghai, Hong Kong, Singapore and Taipei. Founded in 1986, today we are a leading liquidity provider, with close to 2,000 employees in offices around the world, united in our commitment to improve the market through competitive pricing, execution and...


  • Shanghai, China Optiver Full time

    WHO WE ARE: Optiver is a global market maker with offices in Amsterdam, London, Chicago, Austin, Sydney, Shanghai, Hong Kong, Singapore and Taipei. Founded in 1986, today we are a leading liquidity provider, with close to 2,000 employees in offices around the world, united in our commitment to improve the market through competitive pricing, execution and...


  • Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full time

    This Job is mainly responsible for the architecture design and implementation of graphics software on embedded systems, including GPU middleware, drivers, and virtualization. The goal is to build an advanced, high-performance graphics software system that can be easily adapted to various hardware platforms.The responsibilities of a graphics software...


  • Shanghai, Shanghai, China NVIDIA Full time

    We are looking for a Senior Software Test Development Engineer in NVIDIA's Deep Learning SWQA team.The position is in NVIDIA Deep Learning Software Quality Assurance team that defines, develops and performs tests to validate robustness and measure the performance of NVIDIA's Deep Learning software and GPU Infrastructure for autonomous driving, healthcare,...


  • Shanghai, China NVIDIA Full time

    NVIDIA is hiring Senior Software Engineer to help develop its world-class AI Infrastructure and leading-edge software on NVIDIA’s high-performance DRIVE platform for Autonomous Vehicles. We aim to build a high-efficient end-to-end data pipeline for ground truth production, to satisfy the needs of various AV teams and achieve the high quality and...


  • Shanghai, China NVIDIA Full time

    A key part of NVIDIA's strength is our sophisticated analysis tools that empower NVIDIA engineers to improve perf and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing...


  • Shanghai, China NVIDIA Full time

    We are now looking for a Power Methodology and Analysis engineer.NVIDIA prides ourselves in having energy efficient products. We believe that continuing to maintain our products' energy efficiency compared to competition is key to our continued success. Our team is responsible for researching, developing, and deploying methodologies to help NVIDIA's products...


  • Shanghai, China NVIDIA Full time

    We are looking for a Senior Software Test Development Engineer in NVIDIA’s Deep Learning SWQA team.The position is in NVIDIA Deep Learning Software Quality Assurance team that defines, develops and performs tests to validate robustness and measure the performance of NVIDIA‘s Deep Learning software and GPU Infrastructure for autonomous driving,...

  • Senior HPC Engineer

    2 weeks ago


    Shanghai, Shanghai, China NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology—and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots,...


  • Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full time

    This Job is mainly responsible for the architecture design and implementation of graphics software on embedded systems, including GPU middleware, drivers, and virtualization. The goal is to build an advanced, high-performance graphics software system that can be easily adapted to various hardware platforms.The responsibilities of a graphics software...


  • Shanghai, Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full time

    This Job is mainly responsible for the architecture design and implementation of graphics software on embedded systems, including GPU middleware, drivers, and virtualization. The goal is to build an advanced, high-performance graphics software system that can be easily adapted to various hardware platforms.The responsibilities of a graphics software...


  • Shanghai, China NVIDIA Full time

    NVIDIA is hiring Senior Computer Vision Software Engineer to help develop its world-class AI Infrastructure and leading-edge software on NVIDIA’s high-performance DRIVE platform for Autonomous Vehicles. We aim to build a high-efficient end-to-end data pipeline for ground truth production, to satisfy the needs of various AV teams and achieve the high...


  • Shanghai, Shanghai, China NVIDIA Full time

    We are currently seeking a CPU computing engineer located in Shanghai.NVIDIA's development of the GPU in the late 90s was the catalyst for the expansion of the PC gaming industry, reshaped modern computer graphics, and transformed parallel computing. More recently, GPU deep learning has ushered in a new era of AI, with the GPU serving as the central hub for...