Senior GPU Cluster Software Engineer
7 months ago
As a member of the System Software team, you'll be responsible for building profiling solutions for large-scale real world applications running on GPU compute clusters to make them work efficiently and improve the user experience for customer as well as engineers supporting the cluster. Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster. Creating a fault tolerant distributed system while minimizing data loss and limiting time spent on reactive operational work is key to product quality and dynamic day-to-day work. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.
What you'll be doing:
Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities with promised uptime
Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis to help improve current and future generation of GPU clusters and associated HWs
Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs for current and potential future supported features
What we need to see:
BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)
Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline, etc.
Solid understanding of algorithms, data structures, and runtime/space complexity
Experience working with distributed system software architecture
Basic understanding of HPC GPU cluster, slurm
Basic understanding of Machine learning concepts and terminologies
Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)
Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)
Ways to stand out from crowd:
Experience debugging functional and performance issues in HPC GPU clusters
Background in running and instrumenting distributed LLM training on a multi gpu HPC cluster
Knowledge of LLM training features and libraries - Checkpointing, Parallelism, Pytorch, Megatron-LM, NCCL
Experience with HPC schedulers such as Slurm
Background with Opentelemetry
-
Senior GPU Cluster Developer
4 weeks ago
Shanghai, Shanghai, China NVIDIA Full timeWe are seeking a highly skilled Senior GPU Cluster Software Engineer to join our team at NVIDIA. This is a unique opportunity to work on large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities.About the RoleAs a key member of our System Software team, you will be responsible for building profiling...
-
GPU Cluster Software Engineer
1 month ago
Shanghai, Shanghai, China NVIDIA Full timeAs a member of the System Software team at NVIDIA, you will be responsible for building and optimizing large-scale distributed systems infrastructure with monitoring, logging, visualization, and alerting capabilities. Your focus will be on creating profiling solutions for real-world applications running on GPU compute clusters to improve efficiency and user...
-
Automotive Cluster Software Expert
3 weeks ago
Shanghai, Shanghai, China Bosch Group Full timeWe are seeking a highly skilled Automotive Cluster Software Expert to join our team at Bosch Group. As a key member of our software development team, you will play a crucial role in designing and developing cutting-edge automotive instrument clusters.Job SummaryThis is an exciting opportunity for an experienced software engineer to lead the development of...
-
Shanghai, Shanghai, China Optiver Full timeAbout the Role:Optiver is a global market maker with a presence in multiple continents, and our Shanghai office is a rapidly growing participant in the Chinese markets. We are seeking a highly skilled Senior Machine Learning Platform Engineer to join our team and help shape the future of our company.Key Responsibilities:Design and develop the infrastructure...
-
Senior Machine Learning Platform Engineer
6 months ago
Shanghai, China Optiver Full timeWHO WE ARE: Optiver is a global market maker with offices in Amsterdam, London, Chicago, Austin, Sydney, Shanghai, Hong Kong, Singapore and Taipei. Founded in 1986, today we are a leading liquidity provider, with close to 2,000 employees in offices around the world, united in our commitment to improve the market through competitive pricing, execution and...
-
Shanghai, Shanghai, China Bosch Full timeJob Overview We are seeking a highly skilled Software Development Engineer to join our team at Bosch, focusing on the development of automotive instrument clusters. This role offers an exciting opportunity to work on cutting-edge technologies and collaborate with cross-functional teams. Salary The estimated annual salary for this position is $120,000 -...
-
Shanghai, Shanghai, China NVIDIA Full timeWe are seeking an exceptional Software Engineer to join our team at NVIDIA, working on our GPU-accelerated library of primitives for deep neural networks. The ideal candidate will have a strong background in software development, particularly in C/C++ and CUDA development, and experience with linear algebra, machine learning, and computer...
-
Cluster Software Development Engineer_XC-CP
3 weeks ago
Shanghai, China Bosch Full timeJob Description Responsibilities: · Software Design and Development: Develop core software modules for automotive instrument clusters, including HMI frameworks, graphics rendering, and functional features. Ensure software integration with IDC systems, such as infotainment, ADAS displays, and user interaction modules. Implement HMI solutions...
-
Cluster Software Development Engineer_XC-CP
3 weeks ago
Shanghai, China Bosch Group Full timeJob DescriptionResponsibilities:· Software Design and Development:Develop core software modules for automotive instrument clusters, including HMI frameworks, graphics rendering, and functional features.Ensure software integration with IDC systems, such as infotainment, ADAS displays, and user interaction modules.Implement HMI solutions using Kanzi , Unity...
-
Senior Software Development Engineer
6 months ago
Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full timeThis Job is mainly responsible for the architecture design and implementation of graphics software on embedded systems, including GPU middleware, drivers, and virtualization. The goal is to build an advanced, high-performance graphics software system that can be easily adapted to various hardware platforms.The responsibilities of a graphics software...
-
Senior Software Quality Assurance Engineer
1 month ago
Shanghai, Shanghai, China NVIDIA Full timeWe are seeking a Senior Software Quality Assurance Engineer to join NVIDIA's Deep Learning SWQA team.This role is part of NVIDIA's Deep Learning Software Quality Assurance team, responsible for defining, developing, and performing tests to validate the robustness and performance of NVIDIA's Deep Learning software and GPU infrastructure for various AI...
-
Senior Graphics Architecture Engineer
1 month ago
Shanghai, Shanghai, China Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93 Full time**Job Title:** Senior Graphics Architecture EngineerAbout the Role:We are seeking an experienced Senior Graphics Architecture Engineer to join our team at Amazon Innovation Center (Shenzhen) Company Limited Shanghai Branch - O93. As a key member of our graphics software development team, you will be responsible for designing and implementing high-performance...
-
Senior HPC Systems Engineer
2 months ago
Shanghai, Shanghai, China NVIDIA Full timeUnlock the Power of AI and HPC with NVIDIANVIDIA is revolutionizing the world of computer graphics, gaming, and accelerated computing. As a Senior HPC Engineer, you'll be part of a dynamic team that's pushing the boundaries of what's possible with AI and HPC. Our team is dedicated to delivering cutting-edge solutions that empower customers to achieve their...
-
Senior System Profiling Software Engineer
2 months ago
Shanghai, Shanghai, China NVIDIA Full timeA key part of NVIDIA's strength lies in our sophisticated analysis tools that empower engineers to improve performance and power efficiency of our products and running applications. We are seeking forward-thinking, hard-working, and creative individuals to join a multifaceted software team with high standards. This software engineering role involves...
-
GPU Graphics Architecture Engineer
4 weeks ago
Shanghai, Shanghai, China NVIDIA Full timeAbout NVIDIANVIDIA is a leader in the technology industry, renowned for its innovative and cutting-edge products.Job OverviewWe are seeking a skilled GPU Graphics Performance Architect to join our team. The successful candidate will be responsible for investigating and studying state-of-the-art real-time rendering techniques and their implementation on GPU,...
-
Senior System Profiling Software Engineer
7 months ago
Shanghai, China NVIDIA Full timeA key part of NVIDIA's strength is our sophisticated analysis tools that empower NVIDIA engineers to improve perf and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing...
-
Chief Data Platform Specialist
4 weeks ago
Shanghai, Shanghai, China Optiver Full timeCompany OverviewOptiver is a global market maker with offices around the world, united in its commitment to improving the market through competitive pricing, execution, and risk management. By providing liquidity on multiple exchanges across the globe, Optiver participates in safeguarding healthy and efficient markets.SalaryThe estimated salary for this...
-
Senior GPU Power Analysis Engineer
6 months ago
Shanghai, China NVIDIA Full timeWe are now looking for a Power Methodology and Analysis engineer.NVIDIA prides ourselves in having energy efficient products. We believe that continuing to maintain our products' energy efficiency compared to competition is key to our continued success. Our team is responsible for researching, developing, and deploying methodologies to help NVIDIA's products...
-
Senior System Profiling Software Engineer
1 month ago
Shanghai, Shanghai, China NVIDIA Full timeA key part of NVIDIA's strength is our sophisticated analysis tools that empower engineers to improve product efficiency and application performance. We are seeking forward-thinking individuals to join our software team, which sets high standards. This role involves developing analysis tools for various OS and hardware combinations, from single systems to...
-
GPU Computing Engineer
7 months ago
Shanghai, China NVIDIA Full timeWe are now looking for an CPU computing engineer based in Shanghai.NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers,...