Job Type: Full Time
Job Category: IT
Job Description
Role: AWS Cloud Ops SME
Location: Rockville MD (Remote)
Duration: Fulltime
Experience: 8+ years(Must)
Responsibilities
- Oversee the management and maintenance of cloud infrastructure, ensuring high availability and reliability. Act as the primary point of contact for all Cloud infrastructure related issues and escalations.
- Ensure cloud resources are optimally configured and managed to meet performance and cost objectives.
- Implement and maintain monitoring solutions to track the health and performance of cloud infrastructure.
- Drive the major incidents and potential incidents end to end with periodic updates to client stake holders for approvals/recommendations.
- Ensure due diligence and impact analysis for all the changes that get implemented in the cloud platforms.
- Lead and mentor a team of cloud engineers and administrators, fostering a collaborative and high-performing work environment.
- Provide guidance and support to team members, facilitating their professional development and growth.
- Coordinate and manage the team's daily activities, ensuring alignment with organizational goals and priorities.
- Lead the response to cloud-related incidents, ensuring timely resolution and minimal impact on business operations.
- Develop and implement incident management processes and procedures.
- Perform root cause analysis and implement preventive measures to avoid recurrence of issues.
- Identify opportunities to automate repetitive tasks and processes to improve efficiency and reduce operational overhead.
- Develop and implement automation scripts and tools, leveraging Infrastructure as Code (IaC) practices.
- Continuously evaluate and improve cloud operations processes and procedures.
- Ensure cloud infrastructure adheres to security policies, standards, and best practices.
- Implement and maintain security controls to protect cloud resources and data.
- Ensure compliance with regulatory requirements and industry standards (e.g., GDPR, HIPAA).
- Monitor and analyze cloud resource usage, ensuring efficient utilization and avoiding over-provisioning.
- Conduct capacity planning to support future growth and demand.
- Implement cost management strategies to optimize cloud spending.
- Develop and implement disaster recovery and business continuity plans for cloud infrastructure.
- Ensure regular testing and validation of disaster recovery procedures.
- Ensure cloud infrastructure is resilient and can recover quickly from failures or disruptions.
- Work closely with other IT teams, business units, and stakeholders to understand requirements and deliver cloud solutions that meet their needs.
- Collaborate with vendors and service providers to evaluate and integrate new cloud technologies and services.
- Communicate effectively with stakeholders, providing regular updates on cloud operations and performance.
- Maintain comprehensive documentation of cloud infrastructure, configurations, processes, and procedures.
- Generate regular reports on cloud performance, incidents, and operational metrics.
- Ensure documentation is up-to-date and accessible to relevant stakeholders.
Skills:
- AWS, Terraform, IAC, Python
- AWS Cloud Infra Management
- Control Tower, Organization policies and management
- Multi-Account deployment and management
- AWS Backups and SSM Patching process - in detail.
- AMI deployments & pushing config to multiple accounts
- AWS EC2, ECS, EKS, RDS, S3, Sage Maker, CloudFront, Lambda etc...
- AWS S3, SFTP and Site externalization methods.
- IaC - Terraform, Cloud Formation templates and Python.
- IAM polices and access management and restrictions.
- AWS Networking - VPC, ALB, NLB, Transit gateways, WAF
- Azure AD SSO and App Proxy.
- CI/CD and basic Dev Ops
- Linux OS troubleshooting, Bash & Ansible.
- Any Windows AD skills would be an added advantage.
Required Skills
Cloud Developer Cloud Security Engineer