Job Title: SRE Lead – Banking Domain (Wealth Management Preferred)
Location: Toronto Downtown, ON (Onsite – 5 Days/Week)
Experience: 10+ Years
About the Role:
We are looking for a highly skilled Site Reliability Engineering (SRE) Lead with a strong background in the Banking domain, ideally within Wealth Management. The ideal candidate will lead the SRE function to ensure system reliability, scalability, and performance across mission-critical financial applications. This role involves hands-on technical expertise combined with leadership responsibilities to drive service excellence and operational efficiency.
Key Responsibilities:
· Lead and mentor a team of SREs responsible for production stability, reliability, and availability of banking and wealth management systems.
· Design and implement monitoring, alerting, and incident response strategies to proactively manage system health.
· Collaborate with development and infrastructure teams to drive DevOps and automation initiatives, ensuring smooth CI/CD pipelines.
· Define and implement SLIs, SLOs, and SLAs to measure and improve service performance.
· Manage and drive incident management, root cause analysis (RCA), and problem resolution to ensure minimal downtime and business impact.
· Lead capacity planning, performance tuning, and disaster recovery strategies.
· Drive observability and resilience engineering best practices across all platforms.
· Work closely with stakeholders in banking and wealth management domains to align reliability goals with business needs.
· Establish governance processes and ensure compliance with financial regulatory and security standards.
· Develop dashboards and reporting metrics to provide visibility into system performance and reliability.
· Champion a culture of continuous improvement, automation, and reliability-first mindset.
Required Skills & Experience:
· 10+ years of total IT experience, with at least 4+ years in Site Reliability Engineering or Production Operations leadership roles.
· Strong domain experience in Banking, with exposure to Wealth Management systems (highly desirable).
· Expertise in Linux/Unix administration, networking, and cloud infrastructure (AWS, Azure, or GCP).
· Strong scripting and automation experience (Python, Shell, or similar).
· Proficiency in monitoring and observability tools such as Prometheus, Grafana, Splunk, ELK, AppDynamics, or Dynatrace.
· Experience with CI/CD pipelines, Git, Jenkins, Ansible, Terraform, or equivalent tools.
· In-depth understanding of incident, problem, and change management based on ITIL principles.
· Proven track record in managing production systems supporting large-scale, high-availability financial applications.
· Excellent communication, stakeholder management, and team leadership skills.