Deltatre

Open Positions

Site Reliability Engineer (SRE)

Engineering & Technology

Toronto

Position

The Site Reliability Engineer (SRE) is responsible for improving the reliability, stability, and operational readiness of critical digital platforms. The role focuses on proactively reducing risk, strengthening system resilience, and enabling product and engineering teams to operate with confidence—particularly during live events, launches, and other high-traffic periods. This role is dedicated to a major downtown Toronto-based client.

The role requires a degree of flexibility to support live operations onsite (in the client’s operations center) and regular on-call support during evening and weekend live event windows and other key periods. If the requirements will lead to work beyond 44 hours/week, overtime payment will be granted.

Outside of these event-driven windows, the role supports flexible and remote working arrangements provided some consistent onsite presence.

The SRE’s will be operating, monitoring, and enhancing the Deltatre OTT platform which is designed to withstand millions of concurrent users, using the latest cutting-edge technologies. On daily basis, the SRE’s will be innovating, automating, maintaining, and securing our cloud-based platform. SRE’s will collaborate with other engineering teams, service owners, and support teams to ensure services are highly available and performant.

Key Responsibilities

Reliability & Stability

Improve system availability, performance, and fault tolerance across production environments.
Define, measure, and track Service Level Objectives (SLOs), error budgets, and reliability metrics.
Identify systemic risks and lead initiatives to reduce operational fragility.

Incident Management & Readiness

Lead or support incident response for high-severity production issues, particularly during evenings, weekends, and live operations as required.
Establish and refine incident response processes, runbooks, and escalation paths ensuring B2B and Incident Management teams are duly informed and trained on the procedures.
Conduct post-incident reviews (blameless retrospectives) and ensure follow-up actions are completed.

Observability & Tooling

Design and maintain monitoring, alerting, and logging strategies that prioritize actionable signals over noise.
Improve visibility into system health to enable faster detection and resolution of issues.
Partner with engineering teams to embed reliability considerations into system design.

Automation & Operational Efficiency

Reduce manual operational effort through automation, tooling, and improved deployment practices.
Improve deployment safety, rollback mechanisms, and change management processes.
Support capacity planning and performance testing.

Requirements

We’re looking for a persistent, hands-on problem solver who takes ownership from first alert through to permanent resolution. You’ll have practical experience across most of the components in our technology stack and be comfortable operating in live, high-availability environments.

Core technical experience includes:

Cloud platforms such as AWS and/or Azure
Containerized workloads using Docker and OCI-compliant containers
MongoDB (including monitoring and operating in production) and Redis
CI/CD pipelines using tools such as Bamboo, GitHub, and Octopus
Scripting and automation with PowerShell and/or bash
Observability and monitoring platforms such as New Relic and Datadog
Infrastructure as Code using Terraform and/or CloudFormation

Programming & systems expertise:

Proficiency in one or more general-purpose programming languages, such as C#, JavaScript, Java, PowerShell, Go, or Python
Strong ability to read, understand, and debug .NET / C# applications (a significant advantage, as our backend services are written in C#)
Experience developing or supporting highly scalable, distributed systems
Hands-on experience with microservices architectures, leveraging virtualization and/or containerization
Full-stack troubleshooting capability, spanning network, application, infrastructure, and distributed services layers
Familiarity with load and performance testing tools such as k6, Gatling, or JMeter

We’re looking for someone who is:

driven to push the boundaries and lead change and performance
communicative to leave no-one in the dark and to work with your team successfully
reliable so we know that we can call on you to meet deadlines
passionate about the latest technologies and standards
proactive to suggest improvements, identify and fix potential issues
solid technically speaking, to advise both Clients and internal teams

Our people are key to our success and we pride ourselves on offering a dynamic, creative, innovative and supportive environment. Having the right combination of a 'can-do' approach, strong work ethic, integrity, friendliness and attention to detail is crucial.

Even if you don’t tick all the boxes for one particular role, but you have a keen interest in what we do, send us your details, we may find a suitable match during the interview process.

Deltatre consciously nurtures an environment where each and every team member feels safe to bring their whole selves to work, in which everyone is valued and respected for who they are and what they bring. Everyone has the opportunity to reach their full potential, and every team member is expected to treat everyone with dignity and respect, value different perspectives, use inclusive language and work in alignment with Deltatre's commitment to diversity and inclusion. At Deltatre, everyone is welcome and celebrated.

We are committed to ensuring that we provide equal opportunities for all. Please let us know if you need us to make any adjustments or if you have any special requirements for the interview process. Depending on the role this normally includes a written test and interview.

The Salary range for this position is CAD 110,000 - CAD 164,000