Site Reliability Engineer

Ilert Gmbh
3 months ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English, German
Experience level
Intermediate

Job location

Remote

Tech stack

Artificial Intelligence
Amazon Web Services
Software Debugging
Linux
DevOps
Disaster Recovery
Distributed Systems
Identity and Access Management
Performance Tuning
Reliability Engineering
Backend
Kubernetes
Apache Kafka
Terraform

Job description

As a Site Reliability Engineer at ilert, you'll own the reliability, performance, and scalability of our core platform across AWS, Kubernetes, Kafka, and more.

Build & operate a highly available platform

  • Run and evolve our AWS-based infrastructure
  • Operate and optimize self-managed Kafka, ClickHouse clusters and our Observability stack
  • Ensure resilience, disaster recovery, and capacity planning across the stack

Improve reliability & performance

  • Build and maintain SLOs, SLIs, error budgets, and observability dashboards
  • Debug production issues across layers (networking, Kubernetes, application, DB)
  • Improve performance of our ingestion pipeline

Automation & tooling

  • Automate operations with Terraform, Helm, Kubernetes operators, and internal tooling
  • Build tooling for safer deploys, blue/green rollouts, and automated verification
  • Strengthen incident response workflows through deep collaboration with our AI SRE agent team

Security & compliance

  • Implement best practices for workload isolation, secrets management, IAM, and auditability
  • Support our ISO27001 posture by automating controls and hardening our infrastructure

Cross-functional impact

  • Partner with Backend, AI, and Product teams to design reliable services
  • Participate in on-call rotation
  • Lead post-incident reviews and drive reliability improvements long-term

Requirements

  • 3+ years experience as SRE, Platform Engineer, DevOps Engineer, or Infrastructure Engineer
  • Strong hands-on experience with AWS, Kubernetes, Linux internals, networking, performance tuning
  • Experience operating self-managed distributed systems, ideally Kafka or ClickHouse
  • Strong understanding of observability
  • Experience automating infrastructure with Terraform and CI/CD systems
  • Fluent English (our working language); German optional

Benefits & conditions

Location: Hybrid - Cologne (Rheinauhafen) - 3 days in the office, 2 remote (Tue + Thu) Team: Engineering · Reports to CTO

  • Product-centric - 100 % focused on solving a mission-critical pain felt by every always-on business |
  • Hybrid freedom - 2 days remote by default; gorgeous Rheinauhafen roof terrace when you're in town |
  • Focus > meetings - We time-box syncs, favour async docs and protect maker time |
  • 28 days off - …plus public holidays |
  • Commute perks - subsidised public transport|

Apply for this position