Site Reliability Engineer

Ilert Gmbh

3 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, German

Experience level

Intermediate

Job location

Remote

Tech stack

Artificial Intelligence

Amazon Web Services

Software Debugging

Linux

DevOps

Disaster Recovery

Distributed Systems

Identity and Access Management

Performance Tuning

Reliability Engineering

Backend

Kubernetes

Apache Kafka

Terraform

Job description

As a Site Reliability Engineer at ilert, you'll own the reliability, performance, and scalability of our core platform across AWS, Kubernetes, Kafka, and more.

Build & operate a highly available platform

Run and evolve our AWS-based infrastructure
Operate and optimize self-managed Kafka, ClickHouse clusters and our Observability stack
Ensure resilience, disaster recovery, and capacity planning across the stack

Improve reliability & performance

Build and maintain SLOs, SLIs, error budgets, and observability dashboards
Debug production issues across layers (networking, Kubernetes, application, DB)
Improve performance of our ingestion pipeline

Automation & tooling

Automate operations with Terraform, Helm, Kubernetes operators, and internal tooling
Build tooling for safer deploys, blue/green rollouts, and automated verification
Strengthen incident response workflows through deep collaboration with our AI SRE agent team

Security & compliance

Implement best practices for workload isolation, secrets management, IAM, and auditability
Support our ISO27001 posture by automating controls and hardening our infrastructure

Cross-functional impact

Partner with Backend, AI, and Product teams to design reliable services
Participate in on-call rotation
Lead post-incident reviews and drive reliability improvements long-term

Requirements

3+ years experience as SRE, Platform Engineer, DevOps Engineer, or Infrastructure Engineer
Strong hands-on experience with AWS, Kubernetes, Linux internals, networking, performance tuning
Experience operating self-managed distributed systems, ideally Kafka or ClickHouse
Strong understanding of observability
Experience automating infrastructure with Terraform and CI/CD systems
Fluent English (our working language); German optional

Benefits & conditions

Location: Hybrid - Cologne (Rheinauhafen) - 3 days in the office, 2 remote (Tue + Thu) Team: Engineering · Reports to CTO

Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on business |
Hybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when you're in town |
Focus > meetings - We time-box syncs, favour async docs and protect maker time |
28 days off - …plus public holidays |
Commute perks - subsidised public transport|