Role: Technical Program Manager (Platform and SRE)
Function: Program Management / Site Reliability
Location: Bengaluru
Industry: AI infrastructure, Cloud Computing
About Company
This role is with a rapidly growing AI infrastructure startup founded in 2025 in Bengaluru by a leadership team with deep product, cloud, and systems experience from global-scale tech companies. The company has built a GenAI-powered private cloud platform that automates and manages complex AI workloads across hybrid, on-prem, edge, and sovereign cloud environments — designed for enterprise sectors where performance, data security, and compliance are critical. Backed by leading global VCs and prominent operators (approx. $10M seed raised), the company is recognized for strong engineering rigor and product clarity. Its platform focuses on AI-native orchestration, deep observability, and cost/performance optimization to help large enterprises deploy and scale AI with confidence. This is an opportunity to join early and shape the future of AI-first cloud infrastructure.
Position Overview
You orchestrate mission-critical platform and SRE programs that power an AI-first cloud used by security-sensitive enterprises. You partner with engineering and leadership to deliver secure, observable, and cost-optimized infrastructure that enables customers to run AI workloads with confidence. Your work sets incident management standards, identity controls, and FinOps insights that influence the platform’s rapid growth.
Role & Responsibilities
- Run end-to-end programs for platform capabilities, including logging, metrics, traces, dashboards, alert policies, and cost views.
- Drive security and identity initiatives covering key management, RBAC, SSO, baseline policies, and audit trails.
- Coordinate delivery of platform Infrastructure-as-Code modules, shared environments, and drift detection in partnership with SRE and infrastructure teams.
- Standardize incident management for data and AI platforms: define SLOs, create runbooks, manage rollout strategy, and lead post-incident reviews.
- Track FinOps metrics for GPU and general compute, and present usage and optimization insights to leadership.
Must have Criteria
- 4–5 years in technical program management or platform/SRE/DevOps roles running multi-team deliveries.
- Solid understanding of cloud infrastructure, containers and Kubernetes, observability stacks, and CI/CD.
- Hands-on exposure to Terraform or other Infrastructure-as-Code tools and platform security concepts (identity, secrets, policy as code).
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
Nice to Have
- Experience with API gateways or service meshes.
- Prior work in high-growth or distributed teams.
- Familiarity with ML/AI platform SRE.
Apply Now
Share your details below to apply for this job.
Job Description
Role: Technical Program Manager (Platform and SRE)
Function: Program Management / Site Reliability
Location: Bengaluru
Industry: AI infrastructure, Cloud Computing
About Company
This role is with a rapidly growing AI infrastructure startup founded in 2025 in Bengaluru by a leadership team with deep product, cloud, and systems experience from global-scale tech companies. The company has built a GenAI-powered private cloud platform that automates and manages complex AI workloads across hybrid, on-prem, edge, and sovereign cloud environments — designed for enterprise sectors where performance, data security, and compliance are critical. Backed by leading global VCs and prominent operators (approx. $10M seed raised), the company is recognized for strong engineering rigor and product clarity. Its platform focuses on AI-native orchestration, deep observability, and cost/performance optimization to help large enterprises deploy and scale AI with confidence. This is an opportunity to join early and shape the future of AI-first cloud infrastructure.
Position Overview
You orchestrate mission-critical platform and SRE programs that power an AI-first cloud used by security-sensitive enterprises. You partner with engineering and leadership to deliver secure, observable, and cost-optimized infrastructure that enables customers to run AI workloads with confidence. Your work sets incident management standards, identity controls, and FinOps insights that influence the platform’s rapid growth.
Role & Responsibilities
- Run end-to-end programs for platform capabilities, including logging, metrics, traces, dashboards, alert policies, and cost views.
- Drive security and identity initiatives covering key management, RBAC, SSO, baseline policies, and audit trails.
- Coordinate delivery of platform Infrastructure-as-Code modules, shared environments, and drift detection in partnership with SRE and infrastructure teams.
- Standardize incident management for data and AI platforms: define SLOs, create runbooks, manage rollout strategy, and lead post-incident reviews.
- Track FinOps metrics for GPU and general compute, and present usage and optimization insights to leadership.
Must have Criteria
- 4–5 years in technical program management or platform/SRE/DevOps roles running multi-team deliveries.
- Solid understanding of cloud infrastructure, containers and Kubernetes, observability stacks, and CI/CD.
- Hands-on exposure to Terraform or other Infrastructure-as-Code tools and platform security concepts (identity, secrets, policy as code).
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
Nice to Have
- Experience with API gateways or service meshes.
- Prior work in high-growth or distributed teams.
- Familiarity with ML/AI platform SRE.
Apply Now
Share your details below to apply for this job.
Application Submitted Successfully!
Thank you for applying to Technical Program Manager_Platform and SRE. We have received your application and will review it shortly.
You will be redirected shortly...