Site Reliability Engineer
7 days ago
BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.
RoleJoin BentoML as a Senior Site Reliability Engineer and take charge of the infrastructure that delivers large language model and generative AI services worldwide. You will architect and operate Kubernetes clusters across AWS, Google Cloud, and on premises environments, turning vast GPU fleets into responsive inference pools. Your work will span writing clean Terraform code, refining GitOps pipelines, tuning Prometheus, and leading incident response. You will set service level objectives that matter, guide teammates through complex production challenges, and build processes that keep our platform robust and fast. If you thrive on solving difficult problems at scale and want your decisions to shape how enterprises run AI in production, this role is for you.
ResponsibilitiesKubernetes operations – design, run, and improve large multi-cluster Kubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.
Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.
CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.
GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.
Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.
Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.
Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.
Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.
Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.
Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.
Deep understanding of Linux and networking fundamentals.
Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.
Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.
Solid background with Prometheus and Grafana at scale.
Clear written and spoken communication and comfort working across time zones.
Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.
Technical scope – operate distributed LLM inference and large GPU clusters worldwide.
Customer reach – support organizations around the globe that rely on BentoML.
Influence – lead SRE practices and infrastructure choices.
Compensation – competitive salary, equity, learning budget, and paid conference travel.
-
High Voltage Site Service Engineer
1 week ago
Beijing, Beijing, China Hitachi Full time CN¥100,000 - CN¥150,000 per yearLocation:Beijing, ChinaJob ID: R0114401Date Posted: Company Name:HITACHI ENERGY (CHINA) LTD.Profession (Job Category):Engineering & ScienceJob Schedule: Full timeRemote:NoJob Description:The opportunityHitachi Energy Service is a trusted lifecycle partner, providing customers with secure, sustainable, and innovative service solutions, globally.Our Service...
-
High Voltage Site Service Engineer
7 days ago
Beijing, Beijing, China Hitachi Energy Full time $60,000 - $120,000 per yearThe opportunityHitachi Energy Service is a trusted lifecycle partner, providing customers with secure, sustainable, and innovative service solutions, globally.Our Service offerings empower customers & partners to holistically manage the asset lifecycle—from start-of-life (e.g., Install & Commission), through services designed to strengthen operational-life...
-
Beijing, Beijing, China dss+ Full time CN¥120,000 - CN¥240,000 per yearJob description Who are we?We are an international management consulting firm specialising in safety management, cultural transformation, and operational improvement. Our strong DuPont heritage and established reputation working with multinational companies and institutions around the world make us recognised leaders in our field. As trusted partners, we...
-
On-Site Technical Lead
7 days ago
Beijing, Beijing, China Alma Lasers Full time CN¥60,000 - CN¥120,000 per yearKey ResponsibilitiesOversee on-site installation, setup, and operation of production and testing equipment.Ensure adherence to defined engineering, safety, and quality standards.Coordinate technical and operational activities between local manufacturing, QA, logistics, and the overseas engineering teams.Support process validation, testing procedures, and...
-
On-Site Technical Lead
7 days ago
Beijing, Beijing, China Alma Lasers Full time CN¥100,000 - CN¥120,000 per yearKey Responsibilities· Oversee on-site installation, setup, and operation of production and testing equipment.· Ensure adherence to defined engineering, safety, and quality standards.· Coordinate technical and operational activities between local manufacturing, QA, logistics, and the overseas engineering...
-
Beijing, Beijing, China Universal Beijing Resort Full timeJob SummaryPerform electrical and controls discipline related design, engineering, configuration, installation, testing and maintenance for projects, events, and facilities to provide safe, reliable, and operable attractions focused on Live Entertainment productions.Ensures designs, components, systems, and installation are optimized for safety and...
-
Engineering Manager
1 week ago
Beijing, Beijing, China 仲量联行测量师事务所(上海)有限公司 Full time CN¥1,200,000 - CN¥1,500,000 per year工作职责:Job Description SummaryReporting to Site Senior Facilities Manager and collaborating with Regional Teams as well as Regional Engineering/Project/EHS organization to keep the reliability of facilities operations. The Engineering Manager will ensure successful Engineering and Project delivery to the client at Beijing site.As an Engineering...
-
Engineering Manager
1 week ago
Beijing, Beijing, China JLL Full time CN¥1,200,000 - CN¥2,400,000 per yearJob Description SummaryReporting to Site Senior Facilities Manager and collaborating with Regional Teams as well as Regional Engineering/Project/EHS organization to keep the reliability of facilities operations. The Engineering Manager will ensure successful Engineering and Project delivery to the client at Beijing site.As an Engineering Manager, maintaining...
-
Engineering Manager
1 week ago
Beijing, Beijing, China JLL Full time CN¥600,000 - CN¥1,200,000 per yearJLL empowers you to shape a brighter way. Our people at JLL and JLL Technologies are shaping the future of real estate for a better world by combining world class services, advisory and technology for our clients. We are committed to hiring the best, most talented people and empowering them to thrive, grow meaningful careers and to find a place where...
-
Sr. Software Engineer--Teams
1 week ago
Beijing, Beijing, China Microsoft Full time CN¥1,500,000 - CN¥1,800,000 per yearDesign, develop, and maintain new features and enhance existing systems. Write clean, testable, and maintainable code. Troubleshoot live-site issues, deploy fixes, and improve system reliability. Work collaboratively with cross-functional teams to drive project success. Ensure security compliance by configuring, updating, and maintaining security tools and...