Senior Site Reliability Engineer - DGX Cloud
FireHire exclusive!
Description
Join NVIDIA as a Senior Site Reliability Engineer for DGX Cloud, where the focus is on designing, building, and maintaining production systems of large scale with high efficiency and availability. This role combines software and systems engineering practices to ensure our cloud services deliver maximum reliability and uptime. The engineer will work on automation, performance tuning, and optimize system efficiency while facilitating developer changes with minimal disruption.
Responsibilities
Design and support operational aspects of large-scale Kubernetes clusters, with a focus on performance and real-time monitoring.
Engage in improving the lifecycle of services from design to operation and refinement.
Support systems through system design consulting, tool development, and capacity management.
Monitor system availability and health, scaling systems through automation.
Participate in incident response and maintain blameless postmortems.
Join the on-call rotation for production system support.
Requirements
Bachelor's degree in Computer Science or related technical field, or equivalent experience.
5+ years of experience in infrastructure automation and distributed system design.
Experience with languages such as Python, Go, Perl, or Ruby.
Strong knowledge of Linux, Networking, and Containers.
About
Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.
NVIDIA
OR
By applying, you agree to the Terms of Service applicable to FireHire for Teams and confirm you have read our Privacy Policy

