Home Vacancies DevOps Senior Site Reliability Engineer — AI Studio (Inference Platform)

Vacancy in Poland: Senior Site Reliability Engineer — AI Studio (Inference Platform) Salary

Why Join Us?
We are leading a new era in cloud computing to empower the global AI economy. Our mission is to build tools and infrastructure that enable customers to deploy state-of-the-art AI solutions at scale—without prohibitive costs or the need for large in-house teams. You will work at the forefront of AI infrastructure alongside some of the most innovative engineers in the industry.

Where You’ll Work
Headquartered in Amsterdam and listed on Nasdaq, the company operates globally with R&D hubs across Europe, North America, and Israel. Our team of over 800 professionals includes more than 400 highly skilled engineers specializing in hardware, software, and AI research.

About the Role
You will be part of a team building one of the world’s largest GPU clouds, powering an inference platform that makes deploying any kind of foundation model—text, vision, audio, and emerging multimodal architectures—fast, reliable, and effortless at scale.

In this role, you will own the reliability, performance, and observability of the entire inference stack. Your responsibilities will include:

Designing and refining telemetry pipelines to transform hundreds of terabytes of signals into clear, actionable insights.
Tuning Kubernetes autoscalers to optimize GPU efficiency.
Crafting Terraform modules that embed resilience into every new cluster.
Hardening request-routing and retry logic to make transient failures invisible to end users.
Developing automation and runbooks to detect, isolate, and resolve incidents quickly.
Driving a strong post-mortem culture to prevent recurrence.

All of this effort aims toward a single goal: scaling the platform smoothly while meeting ambitious cost and reliability objectives.

What We’re Looking For

Deep experience with Kubernetes, Prometheus, Grafana, Terraform, and infrastructure-as-code.
Proficiency scripting in Python or Bash.
Strong understanding of alerting, SLOs, and high-throughput API reliability.
Familiarity with the behavior of distributed systems in production.
Experience managing GPU-heavy workloads (vLLM, Triton, Ray, or similar).
Background in MLOps or model-hosting platforms is a plus.
Passion for building self-healing systems and debugging performance from kernel to application layer.
Strong collaboration skills and a mindset of making reliability an invisible feature for users.

What We Offer

Competitive salary and a comprehensive benefits package.
Opportunities for professional growth within a rapidly scaling organization.
Hybrid working arrangements.
A dynamic, collaborative environment that values initiative and innovation.

Join the Znoydzem community.

Apply as a Specialist

Similar Resumes

DevOps Engineer

Based in France, we have a portfolio of software. Mission: As part of our professional offer GLPI Network (on-premise, Cloud) and in collaboratio...

EngOps engineer

We are building a financial platform that will research and run systematic trading strategies. We collaborate with the quant team on strategies re...

DBA / DBRE (Database Administrator / Database Reliability Engineer)

About the CompanyOur client is a leading global live-streaming platform with over 450 million registered users worldwide, operating in a rapidly growi...

DevOps Engineer

We are looking for a DevOps Engineer to join its dream team. Our ideal candidate is a detail-oriented self-starter with strong data-driven and analyti...