We are seeking an experienced AI Engineer – Intelligent Operations (Infrastructure) to design and implement AI-driven solutions that enhance infrastructure monitoring, automation, and operational efficiency. The ideal candidate will work at the intersection of AI/ML, cloud infrastructure, and DevOps to build intelligent operational systems.
Develop and deploy AI/ML models for infrastructure monitoring and predictive maintenance
Automate incident detection, root cause analysis, and remediation workflows
Integrate AI solutions with cloud and on-prem infrastructure platforms
Build data pipelines for infrastructure logs and telemetry analysis
Collaborate with DevOps, SRE, and Cloud teams
Optimize system performance, scalability, and reliability
Implement MLOps practices for model deployment and lifecycle management
Provide technical leadership and documentation
Strong experience in Python and AI/ML frameworks (TensorFlow, PyTorch, Scikit-learn)
Experience working with infrastructure monitoring data (logs, metrics, traces)
Knowledge of cloud platforms (AWS, Azure, or GCP)
Experience with Docker and Kubernetes
Understanding of DevOps and CI/CD practices
Strong analytical and problem-solving skills
Experience in AIOps or Intelligent Automation
Knowledge of monitoring tools (Splunk, Datadog, Prometheus, etc.)
Experience with MLOps tools (MLflow, SageMaker, Vertex AI)
Strong communication and stakeholder collaboration skills