Salary
Not Disclosed
Job Type
full time
Location
Khurda
Shift: GMT 05:30 Asia/Kolkata IST
Placement Type: Full Time Indefinite Contract 40 hrs a week/160 hrs a month Note: This is a requirement for one of Uplers client - Strategic Transformation Through Digital Physical Innovation What do you need for this opportunity Must have skills required: Grafana, Kubernetes tools, Monitoring tools, Promotheus, Scripting, CI/CD, CockroachDB, Terraforms, AWS, Docker, GCP, Github, Kubernetes Strategic Transformation Through Digital Physical Innovation is Looking for: Dev
Ops Automation Engineer Dual - Cloud Infrastructure - AWS, Google Cloud, CockroachDB Tailscale About The Role
This role spans: - Dual - cloud infrastructure AWS GCP - Developer workstation management - Security automation - Incident response - CI/CD pipeline operations The platform is scaling to serve millions of consumers across 20,000 veterinary clinics in 100 countries, with tens of millions of projected concurrent sessions. You must have real - world experience operating infrastructure at this scale, including: - Massive server loads - Database replication and failover at scale - Disaster recovery - Performance optimization under heavy concurrent traffic The current production stack runs on ECS Fargate with RDS PostgreSQL, with CockroachDB planned for distributed workloads. You will manage: - Tailscale VPN infrastructure - 20 EC2 dev boxes - Cloudflare for 23 zones - Pager
Duty alerting - A robust security automation layer This is a hands - on role with real operational ownership. What You ll Do Cloud Infrastructure AWS GCP - Design, build, and manage dual - cloud infrastructure across AWS and Google Cloud Platform - Manage ECS Fargate deployments task definitions, service discovery, ALB target groups, and blue/green deployments - Automate infrastructure provisioning using Terraform with modular, reusable configurations - Build and maintain CI/CD pipelines using Git
Lab CI and Git
Hub Actions - Manage containerized applications using Docker, ECS, and Kubernetes EKS/GKE for planned workloads - Support multi - tenant and multi - region application architectures across 6 global regions - Implement and maintain CockroachDB clusters for distributed, geo - partitioned data planned migration from RDS PostgreSQL - Implement infrastructure cost optimization through: - Auto - scaling - Reserved capacity - Right - sizing - Spot instances - Savings Plans - Continuously monitor and reduce cloud spend across AWS and GCP - Optimize database costs through: - Right - sizing instances - Storage tiering - Reserved capacity - Query performance tuning Developer Workstation Infrastructure - Provision and manage 20 EC2 dev boxes across 3 AWS regions - Build custom AMIs using Packer for dev boxes and DERP relays - Deploy and maintain: - Memory watchdog - noVNC - Cloud
Watch agent configurations - Run fleet management commands across dev boxes via AWS Systems Manager SSM - Monitor dev box health and performance Tailscale VPN Administration - Manage Tailscale ACL policies and user access - Operate custom DERP relays in 3 regions - Configure app connectors for SaaS IP lockdown - Maintain Mullvad VPN integration for egress control Security Automation - Own Guard
Duty, Security Hub, and AWS Config across all regions - Manage Event
Bridge rules for security alert routing - Build and manage: - IAM policies - Secrets management - WAF - Zero - trust networking - Administer Git
Hub Enterprise security, including: - Org management - IP allowlists - Secret scanning policies - Runner management Scale, Performance Disaster Recovery - Design and operate infrastructure capable of handling millions of concurrent users and tens of millions of sessions across global regions - Implement and manage auto - scaling policies, including: - ECS service auto - scaling - EC2 ASGs - RDS read replicas - Conduct load testing and capacity planning - Design and maintain database scaling strategies: - Read replicas - Connection pooling - Query optimization - Sharding for high - throughput workloads - Own disaster recovery DR planning and execution: - Multi - region failover - RTO/RPO targets - Automated recovery runbooks - Regular DR drills - Implement and manage database backup strategies: - Point - in - time recovery - Cross - region replication - Automated restore testing - Optimize CDN and edge caching Cloudflare for global traffic at scale - Monitor and resolve performance bottlenecks across: - Application servers - Databases - Caches - Network layers - Build runbooks for incident response during: - High - traffic events - Database failovers - Regional outages Monitoring, Alerting Incident Response - Configure and maintain Pager
Duty - Monitor system performance using: - Prometheus - Grafana - Cloud
Watch - Cloud Monitoring - Manage EBS backup automation, including: - Daily backups - 30 - day retention - Cross - region copy - Vault lock CI/CD Repository Operations - Manage Git
Lab mirroring from Git
Hub - Maintain 45 cron jobs on the admin box - Manage Cloudflare across 23 zones, including: - CDN - DNS - WAF configuration - Collaborate with developers to improve deployment workflows and reduce lead time AI/ML Infrastructure Tooling - Use Claude Code / Cursor for: - Terraform authoring - Script generation - Infrastructure debugging - Support AI/ML infrastructure, including: - GPU instance management - Model deployment pipelines - Maintain and improve AI - assisted monitoring and alerting - Support infrastructure requirements for AI - enabled platform capabilities Must - Have Skills - 3 5 years of experience in Dev
Ops / Cloud / Infrastructure Automation at scale - High - Scale Production Experience
Critical must have operated infrastructure serving millions of users with high concurrency - Experience with: - Server load management - Database scaling read replicas, connection pooling, sharding - Auto - scaling policies - Performance optimization under heavy traffic - Strong hands - on experience with AWS, including: - ECS Fargate - EKS - Lambda - S3 - RDS - Cloud
Front - SQS - IAM - SSM - Guard
Duty - Security Hub - Config - Working experience with Google Cloud Platform, including: - GKE - Cloud Run - Big
Query - Cloud Functions - IAM - ECS Fargate production experience - Terraform Infrastructure as Code with multi - environment, modular patterns - Tailscale VPN administration ACLs, DERP relays, app connectors - Packer for AMI builds - Docker and container orchestration in production - Experience with Git
Lab CI/CD and/or Git
Hub Actions - Git
Hub Enterprise administration - Cloudflare CDN/DNS/WAF management - Pager
Duty or equivalent incident response configuration - Production experience with CockroachDB or distributed SQL databases or strong willingness to learn - Disaster recovery planning and execution, including: - Multi - region failover - Backup automation - RTO/RPO targets - Recovery runbooks - Database performance optimization at scale, including: - Replication - Connection pooling - Query tuning - Capacity planning - Cost Optimization Critical proven track record of reducing cloud infrastructure costs through: - Right - sizing - Reserved capacity - Spot instances - Storage tiering - Waste reduction - Good understanding of Linux systems, networking, and security fundamentals - Strong communication skills and ability to work in a remote, globally distributed team Nice - to - Have Skills - Experience with Kubernetes tools: - Helm - ArgoCD - Flux - Experience with monitoring stacks: - Prometheus - Grafana - ELK - Loki - AWS Systems Manager fleet management at scale - Experience working in startup or fast - paced product environments - Scripting
experience: - Bash - Python - Go - Experience supporting AI/ML workloads and GPU infrastructure - Experience with chaos engineering tools Gremlin, Litmus for resilience testing - Fin
Ops certification or formal cloud cost management framework experience What We re Looking For Mindset - Strong ownership and problem - solving mindset - Comfort working in a fast - growing, evolving environment - Ability to balance speed with stability and security - Willingness to learn and adapt to new tools and technologies - Clear, proactive communicator who surfaces issues early How to apply for this opportunity - Step 1: Click On Apply And Register or Login on our portal. - Step 2: Complete the Screening Form Upload updated Resume - Step 3: Increase your chances to get shortlisted meet the client for the Interview About Uplers: Our goal is to make hiring reliable, simple, and fast. Our role will be to help all our talents find and apply for relevant contractual onsite opportunities and progress in their career. We will support any grievances or challenges you may face during the engagement.
Sign in to apply for this job
Sign In to ApplyUplers