HR & IT Recruitment Services Remote recruitment & HR services Recruitment subscription About Us Contacts ALL JOBS IT JOBS CV EXAMPLES Our blog 2 Case Studies

Vacancy in Poland: Senior Site Reliability Engineer — AI Studio (Inference Platform) Salary

Why Join Us?
We are leading a new era in cloud computing to empower the global AI economy. Our mission is to build tools and infrastructure that enable customers to deploy state-of-the-art AI solutions at scale—without prohibitive costs or the need for large in-house teams. You will work at the forefront of AI infrastructure alongside some of the most innovative engineers in the industry.

Where You’ll Work
Headquartered in Amsterdam and listed on Nasdaq, the company operates globally with R&D hubs across Europe, North America, and Israel. Our team of over 800 professionals includes more than 400 highly skilled engineers specializing in hardware, software, and AI research.

About the Role
You will be part of a team building one of the world’s largest GPU clouds, powering an inference platform that makes deploying any kind of foundation model—text, vision, audio, and emerging multimodal architectures—fast, reliable, and effortless at scale.

In this role, you will own the reliability, performance, and observability of the entire inference stack. Your responsibilities will include:

  • Designing and refining telemetry pipelines to transform hundreds of terabytes of signals into clear, actionable insights.

  • Tuning Kubernetes autoscalers to optimize GPU efficiency.

  • Crafting Terraform modules that embed resilience into every new cluster.

  • Hardening request-routing and retry logic to make transient failures invisible to end users.

  • Developing automation and runbooks to detect, isolate, and resolve incidents quickly.

  • Driving a strong post-mortem culture to prevent recurrence.

All of this effort aims toward a single goal: scaling the platform smoothly while meeting ambitious cost and reliability objectives.

What We’re Looking For

  • Deep experience with Kubernetes, Prometheus, Grafana, Terraform, and infrastructure-as-code.

  • Proficiency scripting in Python or Bash.

  • Strong understanding of alerting, SLOs, and high-throughput API reliability.

  • Familiarity with the behavior of distributed systems in production.

  • Experience managing GPU-heavy workloads (vLLM, Triton, Ray, or similar).

  • Background in MLOps or model-hosting platforms is a plus.

  • Passion for building self-healing systems and debugging performance from kernel to application layer.

  • Strong collaboration skills and a mindset of making reliability an invisible feature for users.

What We Offer

  • Competitive salary and a comprehensive benefits package.

  • Opportunities for professional growth within a rapidly scaling organization.

  • Hybrid working arrangements.

  • A dynamic, collaborative environment that values initiative and innovation.

Join the Znoydzem community.

Apply as a Specialist