SRE (Site Reliability Engineer)
TON Foundation
TON Foundation is a non-profit organization supporting the growth of the TON Blockchain and its ecosystem. Founded in Switzerland in 2023 and backed by a global community, the Foundation empowers developers, creators, and businesses through grants, technical resources, and strategic partnerships. TON operates as a decentralized, open-source network, independent of centralized control and open to contributions from all.
We are looking for a Site Reliability Engineer to ensure a resilient, secure, and production-ready platform that enables the safe and efficient deployment of applications and services. This role focuses on improving service availability, monitoring, incident response, and system reliability, while supporting operational teams and driving continuous improvements in scalability, uptime, and platform stability.
Responsibilities
Increase resiliency and reliability of PaaS solutions with things like:
-
Configure and maintain monitoring and alerting for our Kubernetes clusters and production services
-
Load testing and performance tuning across our production services
-
Build dashboards, monitoring, and alerting mechanisms
-
Develop and integrate solutions with a bias for automation in order to improve and maintain reliability across the production estate and make recovery easier
-
Design and implement fault-tolerant solutions across stateful services and supporting infrastructure
-
Design and track metrics for uptime and performance ensuring high levels of visibility are maintained
-
Collaborate closely with all other engineering functions to provide timely feedback from our environments
-
Participate in the on-call rota and support incident response and service recovery
Requirements
-
Experience with monitoring systems such as Prometheus, Grafana, and VictoriaMetrics
-
Experience designing and supporting fault-tolerant Redis, RabbitMQ, and PostgreSQL clusters
-
Strong understanding of scaling, resilience, and high availability under load
-
Proficiency in load testing and performance tooling such as K6
-
Strong Linux and scripting skills for platform automation and troubleshooting
-
Ability to work closely with engineering teams to improve delivery, reliability, and developer experience