Job Description

Senior DevOps Engineer - AI/ML Infrastructure

Position Overview: We are seeking an experienced Senior DevOps Engineer to build and maintain the production infrastructure for our enterprise AI automation platform. This role combines traditional DevOps expertise with specialized knowledge of AI/ML workloads, focusing on reliability, scalability, and cost optimization of agentic AI systems. The successful candidate will work as part of our Agentic AI development team to ensure robust, production-ready deployments of complex AI workflows.

Key Responsibilities:

  • Design and implement CI/CD pipelines for AI applications including model deployment and agent workflows
  • Build and maintain Kubernetes clusters optimized for AI workloads including GPU resource management
  • Implement comprehensive monitoring and observability for AI systems including custom metrics for model performance
  • Develop infrastructure-as-code solutions for scalable AI service deployments
  • Establish reliability engineering practices including SLA management and incident response for AI systems
  • Optimize cloud infrastructure costs with focus on GPU utilization and LLM API usage
  • Implement security and compliance frameworks for AI applications and data pipelines
  • Collaborate with development teams to ensure production readiness of AI agents and RAG systems
  • Manage multi-cloud deployments and vendor integrations for AI services

Required Qualifications:

  • Bachelor's degree in Computer Science, Engineering, or related technical field
  • 7-10 years of DevOps/Infrastructure experience with demonstrated production system ownership
  • Strong expertise in Kubernetes orchestration and container management (Docker)
  • Proficient in Python scripting and automation
  • Extensive experience with Linux system administration and performance tuning
  • Hands-on experience with Jenkins or similar CI/CD platforms
  • Production experience with cloud platforms (AWS, GCP, or Azure)
  • Experience with Infrastructure-as-Code tools (Terraform, CloudFormation, or similar)

AI/ML Infrastructure Requirements:

  • Experience deploying and managing AI/ML workloads in production environments
  • Understanding of RAG system infrastructure requirements and vector database operations
  • Knowledge of LLM API integration patterns and rate limiting strategies
  • Experience with GPU cluster management and resource optimization
  • Familiarity with AI agent workflows and their operational characteristics

Site Reliability Engineering Skills:

  • Production monitoring and alerting experience with tools like Prometheus, Grafana, or DataDog
  • Incident response and post-mortem experience with complex distributed systems
  • Capacity planning and performance optimization for high-traffic applications
  • Experience with log aggregation and distributed tracing systems
  • Understanding of reliability patterns including circuit breakers and graceful degradation

Preferred Qualifications:

  • Experience with MLOps practices and model deployment pipelines
  • Knowledge of AI-specific monitoring including model drift detection and performance metrics
  • Experience with cost optimization strategies for AI workloads
  • Background in financial services, gaming, or other high-availability environments
  • Certification in major cloud platforms (AWS Solutions Architect, GCP Professional, etc.)
  • Experience with service mesh technologies (Istio, Linkerd)

Technical Environment:

  • Multi-cloud infrastructure with primary focus on AWS/GCP
  • Kubernetes-based container orchestration
  • Modern observability stack with custom AI metrics
  • GitOps workflows and infrastructure automation
  • Integration with enterprise security and compliance frameworks