Job Title: Site Reliability Engineer (SRE)
Primary Skill: Site Reliability Engineering
Location: Mississauga, ON
Job Type: Full-Time (Hybrid)
Job Description:
We are seeking a skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of critical systems and applications. The candidate will work closely with development and operations teams to automate processes, monitor system health, and improve system resilience.
Key Responsibilities:
Maintain and improve system reliability, availability, and performance.
Implement monitoring, alerting, and incident response processes.
Automate infrastructure and operational tasks using scripting and DevOps tools.
Collaborate with development teams to ensure scalable and reliable system design.
Manage CI/CD pipelines and deployment processes.
Perform root cause analysis and implement solutions to prevent recurring incidents.
Support cloud infrastructure and containerized environments.
Document system architecture, procedures, and operational practices.
Required Skills & Qualifications:
Experience in Site Reliability Engineering or DevOps roles.
Strong knowledge of Linux systems and cloud platforms (AWS, Azure, or GCP).
Experience with monitoring tools such as Prometheus, Grafana, or similar.
Knowledge of containerization technologies like Docker and Kubernetes.
Proficiency in scripting languages such as Python, Bash, or Go.
Strong troubleshooting and problem-solving skills.
Preferred Qualifications:
Experience with infrastructure as code (Terraform, Ansible, or similar).
Knowledge of CI/CD tools such as Jenkins, GitHub Actions, or GitLab CI.
Familiarity with microservices architecture and distributed systems.