pallav@devops:~$ whoami

Pallav

DevOps Team Lead

$ echo "25+ clusters, 70+ servers, 1 engineer who started it all from scratch"
Leading a team of 5 Managing 25+ K8s clusters 70+ servers across on-prem & cloud
pallav@devops:~
$ kubectl get nodes | wc -l
70+
$ kubectl get clusters --all
25 clusters ready
$ uptime
3 years, 0 regrets
$ cat /etc/motto
automate everything,
monitor the rest.
# narrator: the alerts still fire at 3 AM
// what_i_do

Technical Identity

🏗️

Infrastructure at Scale

25+ Kubernetes clusters, 70+ servers spanning on-premise bare metal and AWS EKS. Built from the first node to production-grade multi-cluster operations.

🎮

GPU Workload Orchestration

NVIDIA device plugin, GPU scheduling, spot instance management for AI/ML inference pipelines. Making expensive hardware earn its keep.

📊

Observability from Zero

Built the entire monitoring stack: Prometheus, Grafana, Loki, LibreNMS. Went from "the client told us it's down" to proactive incident detection.

🔄

GitOps & Automation

ArgoCD, Helm charts, Jenkins, GitHub Actions, Bash/Python automation. If it can be automated, it should be. If it can't, rewrite it until it can.

☁️

Cloud & Migrations

AWS (EKS, EC2, S3, IAM, VPC), on-prem to cloud migrations, capacity planning. Moved workloads to EKS in 72 hours, zero data loss.

// git log --oneline career.txt

Career Journey

May 2023
$ git init
DevOps Intern — Staqu Technologies. Deployed video analytics across 30 servers. Wrote the first automation script that actually saved time instead of creating more work.
Aug 2023
$ git commit -m "promoted"
DevOps Engineer. Started the Docker → Kubernetes → EKS migration path. Went from "what's a pod?" to managing multi-node clusters in production.
Dec 2025
$ git merge feature/leadership
DevOps Team Lead. Now leading a team of 5, managing all infrastructure operations — 25+ clusters, 70+ servers, and the occasional 3 AM alert that keeps things interesting.
// ls -la /projects/production/

Production Experience

OPERATION: SCALE-30
Difficulty:

Large-Scale Government Deployment

Deployed AI application across 30+ on-premise servers for 3+ government projects. Built the automation and monitoring layer from scratch.

  • Bash automation cut deployment from 2 days to half a day per server
  • First monitoring stack: Prometheus + Grafana with Slack alerting
  • 30+ servers provisioned and maintained remotely
Bash Prometheus Grafana Linux On-Prem
OPERATION: MESH-7
Difficulty:

Multi-Location Distributed Infrastructure

7 Kubernetes clusters across 7 physical locations with 35 servers. Each with unique configurations, all managed remotely.

  • 7 clusters, 7 locations, 35 servers — each with unique configs
  • Remote management via SSH and Anydesk
  • Zero on-site visits for routine operations
Kubernetes SSH Anydesk Multi-Cluster
OPERATION: GPU-SQUEEZE
Difficulty:

GPU Cluster on Constrained Hardware

4-node cluster (2 GPU + 2 storage) on limited hardware. Made it work anyway, because "get better hardware" wasn't in the budget.

  • GPU workload scheduling with NVIDIA device plugin
  • Provisioned via Rancher on low-spec hardware
  • Custom health-check scripts for resource optimization
Rancher NVIDIA GPU Kubernetes
OPERATION: CLOUD-72H
Difficulty:

Cloud Migration in 72 Hours

Full migration from on-premise to AWS EKS in 3 days. BOQ, provisioning, workload migration, DNS cutover, zero data loss.

  • Complete on-prem to EKS migration in 72 hours
  • Created BOQ, provisioned EKS, migrated all workloads
  • Zero data loss, minimal downtime
AWS EKS EC2 Migration Kubernetes
OPERATION: EYES-OPEN
Difficulty:

Observability Stack from Zero

Replaced the "client → CSM → manager → engineer" complaint chain with actual monitoring. Revolutionary concept, apparently.

  • Prometheus, Grafana, Loki, LibreNMS
  • Multiple exporters, Slack-integrated alerts
  • Went from reactive firefighting to proactive detection
Prometheus Grafana Loki LibreNMS
// cat /etc/tech-stack.conf

Tech Stack

Cloud & Infrastructure
AWS EKS EC2 S3 IAM VPC Linux
Containers & Orchestration
Docker Kubernetes Helm
CI/CD & GitOps
ArgoCD GitHub Actions GitLab CI Jenkins
Monitoring & Observability
Prometheus Grafana Loki LibreNMS
IaC & Configuration
Terraform Ansible
Databases
SQL TimescaleDB Redis
Other Tools
Bash Python Git Trivy Nginx Rancher
// cat ~/blog/README.md

Knowledge Sharing

I write technical articles explaining DevOps concepts the way I wish someone explained them to me, with real scenarios, actual commands, and the occasional production horror story.

Read the Blog →
📝
// tail -f ~/learning.log

Currently Learning

🌐

Networking & Traffic Flow

Deep-diving into networking and traffic flow in distributed systems, because "it works on localhost" isn't a valid network architecture.

🛡️

Reliability Engineering

SLOs, observability practices, incident response frameworks. Building systems that fail gracefully instead of spectacularly.

pallav@devops:~$ ping pallav

Let's Talk Infrastructure

Open to Senior DevOps / SRE opportunities. If you need someone who builds infrastructure from scratch and keeps it running. Let's connect.