Senior GPU Cluster Software Engineer

20 hours ago


Shanghai, Shanghai, China NVIDIA Full time
Job Title: Senior GPU Cluster Software Engineer

We are seeking a highly skilled Senior GPU Cluster Software Engineer to join our System Software team at NVIDIA. As a member of this team, you will be responsible for designing, developing, and deploying large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities.

Key Responsibilities:
  • Design and implement large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities.
  • Develop internal profiling tools for real-world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis.
  • Collaborate with various application owners and research teams to add/improve profiling needs for current and potential future supported features.
Requirements:
  • BS+ in Computer Science or related field (or equivalent experience) and 5+ years of software development experience in Python.
  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.
  • Solid understanding of algorithms, data structures, and runtime/space complexity.
  • Experience working with distributed system software architecture.
  • Basic understanding of HPC GPU cluster, Slurm.
  • Basic understanding of Machine learning concepts and terminologies.
  • Background with databases - SQL and NoSQL (Prometheus, Elasticsearch, OpenSearch, Redis, etc.).
  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (PagerDuty, etc.).
Preferred Qualifications:
  • Experience debugging functional and performance issues in HPC GPU clusters.
  • Background in running and instrumenting distributed LLM training on a multi-GPU HPC cluster.
  • Knowledge of LLM training features and libraries - Checkpointing, Parallelism, PyTorch, Megatron-LM, NCCL.
  • Experience with HPC schedulers such as Slurm.
  • Background with OpenTelemetry.

  • Software Engineer

    1 week ago


    Shanghai, Shanghai, China Qualcomm Full time

    Job Title: Software Engineer - GPUQualcomm is seeking a talented Software Engineer to join our GPU Software Engineering team. As a key member of our team, you will design and develop new features, debug issues, optimize software for performance and power, and work with our partners and OEMs.Responsibilities:Design and develop new features for our GPU...

  • Software Engineer

    2 weeks ago


    Shanghai, Shanghai, China Qualcomm Full time

    Job Title: Software Engineer - GPUQualcomm is seeking a talented Software Engineer to join our GPU Software Engineering team. As a key member of our team, you will design and develop new features, debug issues, optimize software for performance and power, and work with our partners and OEMs.Responsibilities:Design and develop new features for our GPU...

  • Software Engineer

    1 week ago


    Shanghai, Shanghai, China Qualcomm Full time

    Job Title: Software Engineer - GPUQualcomm is seeking a talented Software Engineer to join our GPU Software Engineering team. As a key member of our team, you will design and develop new features, debug issues, optimize software for performance and power, and work with our partners and OEMs.Responsibilities:Design and develop new features for our GPU...

  • Software Engineer

    6 days ago


    Shanghai, Shanghai, China Qualcomm Full time

    Job Title: Software Engineer - GPUQualcomm is seeking a talented Software Engineer to join our GPU Software Engineering team. As a key member of our team, you will design and develop new features, debug issues, optimize software for performance and power, and work with our partners and OEMs.Responsibilities:Design and develop new features for our GPU...


  • Shanghai, Shanghai, China NVIDIA Full time

    Are you passionate about developing high-performance software?We are seeking dedicated software developers to contribute to the design, development, and deployment of cuDNN: our GPU-accelerated library tailored for deep learning frameworks. The landscape of artificial intelligence is rapidly evolving, and we are at the forefront of this transformation. If...


  • Shanghai, Shanghai, China NVIDIA Full time

    Senior Product Manager - Datacenter GPUWe are seeking a seasoned product leader to join our Data Center Product Management team. As a Senior Product Manager, you will be responsible for defining and marketing data center GPUs for enterprises and cloud service providers. Your expertise will help drive the growth of our GPU products, which have been used to...


  • Shanghai, Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full time

    Job Title: Senior Software Development EngineerThis role is responsible for designing and implementing graphics software on embedded systems, including GPU middleware, drivers, and virtualization.Key Responsibilities:Develop new features for graphics and display system engines to extend existing internal frameworks, particularly for automotive...


  • Shanghai, Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full time

    About the RoleThis is a challenging and rewarding opportunity to join the Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 team as a Senior Software Development Engineer - Graphics Software Expert. As a key member of our team, you will be responsible for designing and implementing advanced graphics software systems for embedded...


  • Shanghai, Shanghai, China Optiver Full time

    About Us:Optiver is a global market maker with a presence in multiple continents. Founded in 1986, we are a leading liquidity provider with a strong commitment to improving the market through competitive pricing, execution, and risk management.We provide liquidity to financial markets using our own capital, at our own risk, trading a wide range of products....


  • Shanghai, Shanghai, China Optiver Full time

    About UsOptiver is a leading global market maker with a presence in multiple continents. Founded in 1986, we have grown to become a prominent liquidity provider, with a team of over 2,000 employees worldwide. Our mission is to improve the market through competitive pricing, execution, and risk management.Our Shanghai OfficeSince its establishment in 2012,...


  • Shanghai, Shanghai, China NVIDIA Full time

    We are seeking a highly skilled Senior Software Quality Assurance Engineer to join NVIDIA's Deep Learning Software Quality Assurance team.This team is responsible for defining, developing, and performing tests to validate the robustness and performance of NVIDIA's Deep Learning software and GPU infrastructure for various AI scenarios. The ideal candidate...


  • Shanghai, Shanghai, China NVIDIA Full time

    Job Title: Senior Software EngineerNVIDIA is seeking a highly skilled Senior Software Engineer to join its team and contribute to the development of its world-class AI Infrastructure and leading-edge software on NVIDIA's high-performance DRIVE platform for Autonomous Vehicles.Job Summary:This is a collaborative work with AV perception team, AV production...

  • Senior HPC Engineer

    2 weeks ago


    Shanghai, Shanghai, China NVIDIA Full time

    About NVIDIANVIDIA is a pioneer in the field of computer graphics, PC gaming, and accelerated computing. With a legacy of innovation spanning over 25 years, we're now harnessing the power of AI to redefine the future of computing. Our GPUs serve as the brains of computers, robots, and self-driving cars that can perceive and understand the world. To achieve...


  • Shanghai, Shanghai, China NVIDIA Full time

    We are seeking a Senior Software Test Development Engineer to join NVIDIA's Deep Learning SWQA team.This role is part of NVIDIA's Deep Learning Software Quality Assurance team, which defines, develops, and performs tests to validate robustness and measure the performance of NVIDIA's Deep Learning software and GPU Infrastructure for various AI scenarios. The...


  • Shanghai, Shanghai, China NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior Computer Vision Software Engineer to join its team and contribute to the development of its world-class AI Infrastructure and leading-edge software on NVIDIA's high-performance DRIVE platform for Autonomous Vehicles.Key ResponsibilitiesCollaborate with the AV perception team, AV production team, and AI...


  • Shanghai, Shanghai, China NVIDIA Full time

    Job Title: Senior Computer Vision Software EngineerNVIDIA is seeking a highly skilled Senior Computer Vision Software Engineer to join its team and contribute to the development of its world-class AI Infrastructure and leading-edge software on NVIDIA's high-performance DRIVE platform for Autonomous Vehicles.Key Responsibilities:Collaborate with the AV...


  • Shanghai, Shanghai, China NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior Software Engineer to join its team and contribute to the development of its world-class AI Infrastructure and leading-edge software on NVIDIA's high-performance DRIVE platform for Autonomous Vehicles.Key ResponsibilitiesCollaborate with the AV perception team, AV production team, and AI infrastructure...


  • Shanghai, Shanghai, China NVIDIA Full time

    NVIDIA is a leader in GPU Computing, driving innovation in gaming, automotive, professional vision, HPC, datacenters, and networking. We're passionate about harnessing the power of AI to transform industries and improve lives. As a Senior Software QA Test Development Engineer, you'll play a critical role in ensuring the quality of our products, collaborating...


  • Shanghai, Shanghai, China NVIDIA Full time

    NVIDIA is a leading technology company in the field of GPU Computing. We are passionate about innovation in various markets, including gaming, automotive, professional vision, HPC, datacenters, and networking. Our company is also at the forefront of AI Computing, and our GPUs are the driving force behind modern Deep Learning software frameworks, accelerated...


  • Shanghai, Shanghai, China NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of computer graphics, PC gaming, and accelerated computing. With a legacy of innovation spanning over 25 years, we're now harnessing the power of AI to redefine the future of computing.Job SummaryWe're seeking a highly skilled Senior HPC Engineer to join our Professional Services team. As a key member of our team,...