Job Listings

HPC Linux Administrator for AI Infrastructure (Scientist 2/3)

Los Alamos National Laboratory

What You Will Do

Join the High Performance Operations Group (HPC-OPS) in operating and maintaining some of the fastest supercomputers in the world. Designing, operating and maintaining these systems requires highly skilled personnel that specialize in both the hardware and software aspects of High Performance Computing. Innovators at heart, HPC-OPS Linux Administrators work both independently and collaboratively to maintain and implement capability improvements across a complex computing environment. This team is currently building on-premise cloud-like infrastructure to support the AI/ML/LLM needs of the laboratory.

The Platforms Team is seeking to add highly knowledgeable and motivated team members to help build and deploy the AI/ML/LLM infrastructure for LANL. This person will be an expert Linux Administrator who will help design, build and run our production NVidia DGX/HGX pods optimized for our environment and workflow. They will run and manage both admin and user-facing services with an understanding of modern AI/ML/LLM user workflows, Kubernetes, and other common tools. The successful candidate will participate in periodic on-call responsibilities managing NVidia SuperPods and Kubernetes clusters, while actively growing their technical skills and staying up to date with the latest technologies in the field. In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally and at conferences.

The selected HPC Cluster/ Nvidia SuperPod Linux Administrator (Scientist 2/3) will provide strategic design, testing, analysis, administration, configuration management, verification, and validation of the newly developed cloud-like infrastructure and specialized compute infrastructure for AL/ML workloads. Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC system administration and developing new skills in related disciplines. This is your chance to directly support our national security mission and continue to make LANL the best place to work as a member of a dynamic, team-oriented, and leading-edge technical capability team.

What You Need

Minimum Job Requirements:

Scientist 2: ($101,700 - $168,200)
• Advanced Linux Administration Expertise: Demonstrated knowledge of administering production Linux computer systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and experience scripting in Bash, Perl, Python, or similar languages.
• Configuration Management Expertise: Demonstrated experience with configuration and automation tools and practices, such as Chef, Puppet, Ansible, Salt, CFEngine, or similar tools.
• Troubleshooting and Technical Analysis Acumen: Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems.
• Computer Networking Expertise: Working knowledge of networking concepts and practices.
• Communication and Teaming Skills: Demonstrated effective communication skills, both verbal and written, including the ability to communicate technical information to both technical and non-technical personnel, to provide assistance and knowledge to peers, to collaborate with Group members, other HPC Group personnel and vendor representatives, as required, and to formulate and communicate technical results and findings to technical audiences and readerships (examples can include publications, team projects, and presentations).
• Troubleshooting skills: Demonstrated ability to troubleshoot hardware and software errors, prioritizing problems and assessing impact to stakeholders, documenting problems and solutions.

Additional Job Requirements for Scientist 3: ($122,300 - $206,300):

In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:
• AI/ML Expertise: Strong understanding of AI/ML workflows and experience setting up and maintain user-facing AL/ML tools and services (such JupyterHub).
• Container Orchestration Expertise: Demonstrated experience managing, administering and maintaining large production Kubernetes clusters.
• Troubleshooting Expertise: Experience troubleshooting and debugging user workflows in a Kubernetes environment.
• Computer Networking Expertise: High performance interconnects, preferably NVLink and InfiniBand networks. Leadership: Demonstrated experience with project planning and management. Ability developing and leading complex projects, generating formal project plans, delegating tasks, and providing routine updates to management.
• HPC Experience: Demonstrated experience building, installation, and administration of HPC systems. Experience with modern i

Location: Los Alamos, NM

Posted: Oct. 13, 2024, 3:39 a.m.

Apply Now Company Website