Ops - Software Engineer
Work Location: Petaling Jaya (First Avenue)
Job Description: [C]DevOps Engineer, ML Platform - AI Core Infrastructure
Get to Know the Role:
- As a DevOps Engineer in the ML Platform team, you will contribute to the creation and maintenance of our machine learning infrastructure. This is a heavily Infra/SRE based role, embedded in an ML Platform team. You will be supporting us in maintaining, upgrading and improving our infrastructure and providing support.
The Day-to-Day Activities:
- Deliver high-quality AI infrastructure solutions: You will work with the Machine Leaning Platform team to design and develop the infrastructure to support distributed data processing and model training. You will utilize GitOps to ensure the reproducibility of the system's cloud infrastructure on different Kubernetes clusters.
- Develop observability solutions for Machine Learning pipelines: You will be responsible for developing and integrating monitoring and alerting within Grab’s monitoring stack powered by Datadog, Prometheus, and Grafana. You will also contribute to the creation of runbooks and DevOps guides.
The Must-Haves:
- Understand terraform and popular modules like EKS.
- Able to understand complex code bases and analyze dependencies.
- Understanding of Kubernetes and experience in managing large clusters.
- Understanding of core AWS cloud concepts like ec2, autoscaling groups, launch templates, subnets, etc.
- Understand core components like coredns, autoscaler, csi driver, load balancer controllers, service mesh, etc.
- Perform zero down time cluster upgrades for clusters serving critical online traffic and tight SLA batch jobs
Experience:
- Fresh graduates with relevant degrees in Computer Science, Software Engineering, or a related field and a strong understanding of the above requirements are welcome to apply.
- Prior experience in MLOps or related fields is a plus.