Job Listings

Senior Site Reliability Engineering (SRE) Leader for AI and GPU Infrastructure

Nvidia

Description

NVIDIA is leading the way in the AI revolution, transforming industries with our cutting-edge GPU technology. Our GPUs fuel groundbreaking innovations across various domains, such as self-driving cars, computer vision, and speech recognition. As the premier AI computing company, we relentlessly push the boundaries of AI, big data, and deep learning. We are searching for bold and visionary leaders to join us as Senior SRE Engineering Leader. In this role, you will manage globally distributed clusters, ensuring seamless operations and delivering AI services that drive advancements in life sciences and natural language processing. Your responsibilities will include building and operating large-scale GPU clusters across various cloud providers and designing processes that enhance our operational ecosystem.

Company Culture and Environment

At NVIDIA, we foster a culture of creativity and autonomy, encouraging our engineers to innovate and drive technology forward. Our teams are made up of some of the most experienced and versatile professionals in the industry, contributing to a collaborative and supportive work environment that values diverse perspectives.

Career Growth and Development Opportunities

NVIDIA offers numerous opportunities for professional growth and career advancement. As part of our extraordinary engineering teams, you will have the chance to develop your skills, mentor others, and lead projects that influence the future of AI.

Detailed Benefits and Perks
• Comprehensive benefits package including equity options
• Opportunities for ongoing training and development
• Flexible working conditions and a supportive work environment
• Access to cutting-edge technologies and tools

Compensation and Benefits

The base salary range for this position is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits (https://www.nvidia.com/en-us/benefits/).

Why you should apply for this position today

By joining NVIDIA, you will be part of a transformative journey in AI, working on groundbreaking projects and collaborating with talented teams. Your contributions will directly impact the future of technology and innovation in diverse industries.

Skills
• Strong Unix/Linux knowledge and proficiency in at least two programming languages (Perl, Python, Go)
• Expertise in managing large-scale distributed systems and AI/HPC environments
• Experience supporting AI/ML workloads with operational best practices
• Leadership experience with mentoring and coaching skills
• Ability to quickly learn and integrate new technologies
• Strong collaboration skills across engineering, server, storage, and security teams

Responsibilities
• Manage distributed, multi-location GPU clusters for AI research
• Lead a team of SREs, driving cluster operational excellence and efficiency
• Deliver scalable distributed systems and AI services in fast-paced environments
• Build strong, globally distributed teams and drive technical strategy
• Collaborate across the company to enhance the GPU ecosystem for AI use cases
• Address reliability, efficiency, and productivity challenges for GPU infrastructure
• Define strategy, manage projects, and provide technical leadership across multiple areas
• Ensure transparency on budget and operational efficiency with internal collaborators

Qualifications
• 10+ years of experience in engineering management; 3+ years in leadership roles
• Bachelor’s or Master’s in Computer Science or a related field, or equivalent experience
• Proven experience managing large-scale distributed systems and AI/HPC environments
• Familiarity with deep learning frameworks like PyTorch and TensorFlow

Education Requirements
• Bachelor’s degree in Computer Science, Engineering, or a related field
• Master’s degree preferred but not required

Education Requirements Credential Category
• Computer Science
• Engineering

Experience Requirements
• 10+ years of overall engineering management experience
• 3+ years in leadership roles
• Background in supporting AI/ML workloads and driving operational standard methodologies

Why work in Santa Clara, CA

Santa Clara is a vibrant hub for technology and innovation, home to many leading tech companies. The city offers a diverse cultural scene, excellent dining options, and beautiful parks for recreation. Living in Santa Clara provides a unique opportunity to be at the forefront of technological advancements while enjoying a high quality of life.

Location: Santa Clara, CA

Posted: Oct. 11, 2024, 4:36 p.m.

Apply Now Company Website