Overview
8+ years of strong SRE experience
Must have 3 years of experience with Kubernetes and docker
Implement and manage monitoring (ELK), alerting, and logging systems to ensure proactive identification and resolution of issues
Engage and contribute towards System Monitoring, Incident management, performance tuning and fault finding
Must have Python, Powershell scripting experience or any other scripting language
Must have effective communication with excellent logic and problem-solving skills and a drive to make a difference
Good to have experience with AI/ML Ops, Release Management, CI/CD using tools such as GitHub, Blackduck Hub, Coverity, Container Signing with good understanding on Software configuration Management.
Ability to understand and communicate customer issues.
Experience in development and supporting enterprise applications.
Good written and verbal communication skills with the ability to document and communicate technical information to IT professionals
Technical Skills:
Azure Cloud: core services,· Event Hub, IOTHUB, AKS, Databricks
Database: Postgres, Mongo, Redis
Scripting:· Bash / Python
Git
Github Actions
Key Responsibilities
Operate, Optimize and design scalable cloud environments using Azure services,· with emphasis on event-driven systems, secure networking, and platform resilience.
Design and maintain Kubernetes environments with Azure AKS, Istio mesh, and gRPC-based microservices.
Build and manage GitOps-based CI/CD pipelines for multi-environment deployments and image promotion strategies.
Implement and maintain secure authentication patterns (service principals, API tokens, managed identities).
Optimize environments by executing performance analysis.
Conduct system profiling, debugging, and performance tuning for .NET microservices in production.
Lead risk assessments, architectural evaluations, and operational readiness testing for cluster/network changes.
Lead and own prod releases planning, readiness and execution end to end
Provide cross-team technical guidance on DevOps, security, networking, and platform decisions.