Staff Software Engineer, ML Fleet, Monitoring
Company: Google
Location: Sunnyvale
Posted on: April 2, 2026
|
|
|
Job Description:
Minimum qualifications: Bachelor’s degree or equivalent
practical experience. 8 years of experience in software
development. 5 years of experience testing, and launching software
products, and 3 years of experience with software design and
architecture. 5 years of experience with one or more of the
following: Speech/audio (e.g., technology duplicating and
responding to the human voice), reinforcement learning (e.g.,
sequential decision making), Machine learning (ML) infrastructure,
or specialization in another ML field. 5 years of experience with
ML design and ML infrastructure (e.g., model deployment, model
evaluation, data processing, debugging, fine tuning). Preferred
qualifications: Master’s degree or PhD in Engineering, Computer
Science, or a related technical field. 8 years of experience with
data structures and algorithms. 3 years of experience in a
technical leadership role leading project teams and setting
technical direction. 3 years of experience working in an
organization involving cross-functional, or cross-business
projects. Experience in predictive maintenance, anomaly detection,
or systems reliability engineering. Ability to translate complex
technical findings into actionable business strategies for
executive stakeholders. About the job Google's software engineers
develop the next-generation technologies that change how billions
of users connect, explore, and interact with information and one
another. Our products need to handle information at massive scale,
and extend well beyond web search. We're looking for engineers who
bring fresh ideas from all areas, including information retrieval,
distributed computing, large-scale system design, networking and
data storage, security, artificial intelligence, natural language
processing, UI design and mobile; the list goes on and is growing
every day. As a software engineer, you will work on a specific
project critical to Google’s needs with opportunities to switch
teams and projects as you and our fast-paced business grow and
evolve. We need our engineers to be versatile, display leadership
qualities and be enthusiastic to take on new problems across the
full-stack as we continue to push technology forward. In this role,
you will take control of the world’s largest data center footprint
as an Applied Artificial intelligence/Machine Learning (AI/ML)
Specialist on a team responsible for the fault tolerance of
Google’s entire fleet, including the ML Tensor Processing Units
(TPUs). You will pioneer the use of AI/ML to solve complex
infrastructure challenges by leveraging petabytes of operational
and telemetry data, directly empowering the very AI/ML systems that
drive the future of Google. The AI and Infrastructure team is
redefining what’s possible. We empower Google customers with
breakthrough capabilities and insights by delivering AI and
Infrastructure at unparalleled scale, efficiency, reliability and
velocity. Our customers include Googlers, Google Cloud customers,
and billions of Google users worldwide. We're the driving force
behind Google's groundbreaking innovations, empowering the
development of our cutting-edge AI models, delivering unparalleled
computing power to global services, and providing the essential
platforms that enable developers to build the future. From software
to hardware our teams are shaping the future of world-leading
hyperscale computing, with key teams working on the development of
our TPUs, Vertex AI for Google Cloud, Google Global Networking,
Data Center operations, systems research, and much more. The US
base salary range for this full-time position is $207,000-$300,000
bonus equity benefits. Our salary ranges are determined by role,
level, and location. Within the range, individual pay is determined
by work location and additional factors, including job-related
skills, experience, and relevant education or training. Your
recruiter can share more about the specific salary range for your
preferred location during the hiring process. Please note that the
compensation details listed in US role postings reflect the base
salary only, and do not include bonus, equity, or benefits. Learn
more about benefits at Google . Responsibilities Lead the design
and implementation of solutions in specialized ML areas, optimize
ML infrastructure, and guide the development of model optimization
and data processing strategies. Design and implement AI/ML models
to predict, detect, and mitigate hardware and software faults
across a global fleet. Analyze petabytes of telemetry and
performance data to uncover insights that improve the reliability
of ML TPUs and traditional compute infrastructure. Build scalable
automated systems that allow Google’s data center footprint to grow
while maintaining industry-leading uptime. Partner with hardware
designers and Site Reliability Engineers (SREs) to integrate
intelligent diagnostics into the core data center lifecycle.
Keywords: Google, West Sacramento , Staff Software Engineer, ML Fleet, Monitoring, Engineering , Sunnyvale, California