pallav@devops:~$ whoami

Pallav

DevOps Team Lead

$ echo "25+ clusters, 70+ servers, 1 engineer who started it all from scratch"

Leading a team of 5 Managing 25+ K8s clusters 70+ servers across on-prem & cloud

💼 LinkedIn 📖 Blogs 📄 Resume

pallav@devops:~

$ kubectl get nodes | wc -l

70+

$ kubectl get clusters --all

25 clusters ready

$ uptime

3 years, 0 regrets

$ cat /etc/motto

automate everything,

monitor the rest.

# narrator: the alerts still fire at 3 AM

// what_i_do

Technical Identity

🏗️

Infrastructure at Scale

25+ Kubernetes clusters, 70+ servers spanning on-premise bare metal and AWS EKS. Built from the first node to production-grade multi-cluster operations.

🎮

GPU Workload Orchestration

NVIDIA device plugin, GPU scheduling, spot instance management for AI/ML inference pipelines. Making expensive hardware earn its keep.

📊

Observability from Zero

Built the entire monitoring stack: Prometheus, Grafana, Loki, LibreNMS. Went from "the client told us it's down" to proactive incident detection.

🔄

GitOps & Automation

ArgoCD, Helm charts, Jenkins, GitHub Actions, Bash/Python automation. If it can be automated, it should be. If it can't, rewrite it until it can.

☁️

Cloud & Migrations

AWS (EKS, EC2, S3, IAM, VPC), on-prem to cloud migrations, capacity planning. Moved workloads to EKS in 72 hours, zero data loss.

// git log --oneline career.txt

Career Journey

May 2023

$ git init

DevOps Intern — Staqu Technologies. Deployed video analytics across 30 servers. Wrote the first automation script that actually saved time instead of creating more work.

Aug 2023

$ git commit -m "promoted"

DevOps Engineer. Started the Docker → Kubernetes → EKS migration path. Went from "what's a pod?" to managing multi-node clusters in production.

Dec 2025

$ git merge feature/leadership

DevOps Team Lead. Now leading a team of 5, managing all infrastructure operations — 25+ clusters, 70+ servers, and the occasional 3 AM alert that keeps things interesting.

// ls -la /projects/production/

Production Experience

OPERATION: SCALE-30

Difficulty:

Large-Scale Government Deployment

Deployed AI application across 30+ on-premise servers for 3+ government projects. Built the automation and monitoring layer from scratch.

Bash automation cut deployment from 2 days to half a day per server
First monitoring stack: Prometheus + Grafana with Slack alerting
30+ servers provisioned and maintained remotely

Bash Prometheus Grafana Linux On-Prem

OPERATION: MESH-7

Difficulty:

Multi-Location Distributed Infrastructure

7 Kubernetes clusters across 7 physical locations with 35 servers. Each with unique configurations, all managed remotely.

7 clusters, 7 locations, 35 servers — each with unique configs
Remote management via SSH and Anydesk
Zero on-site visits for routine operations

Kubernetes SSH Anydesk Multi-Cluster

OPERATION: GPU-SQUEEZE

Difficulty:

GPU Cluster on Constrained Hardware

4-node cluster (2 GPU + 2 storage) on limited hardware. Made it work anyway, because "get better hardware" wasn't in the budget.

GPU workload scheduling with NVIDIA device plugin
Provisioned via Rancher on low-spec hardware
Custom health-check scripts for resource optimization

Rancher NVIDIA GPU Kubernetes

OPERATION: CLOUD-72H

Difficulty:

Cloud Migration in 72 Hours

Full migration from on-premise to AWS EKS in 3 days. BOQ, provisioning, workload migration, DNS cutover, zero data loss.

Complete on-prem to EKS migration in 72 hours
Created BOQ, provisioned EKS, migrated all workloads
Zero data loss, minimal downtime

AWS EKS EC2 Migration Kubernetes

OPERATION: EYES-OPEN

Difficulty:

Observability Stack from Zero

Replaced the "client → CSM → manager → engineer" complaint chain with actual monitoring. Revolutionary concept, apparently.

Prometheus, Grafana, Loki, LibreNMS
Multiple exporters, Slack-integrated alerts
Went from reactive firefighting to proactive detection

Prometheus Grafana Loki LibreNMS

// cat /etc/tech-stack.conf

Tech Stack

Cloud & Infrastructure

AWS EKS EC2 S3 IAM VPC Linux

Containers & Orchestration

Docker Kubernetes Helm

CI/CD & GitOps

ArgoCD GitHub Actions GitLab CI Jenkins

Monitoring & Observability

Prometheus Grafana Loki LibreNMS

IaC & Configuration

Terraform Ansible

Databases

SQL TimescaleDB Redis

Other Tools

Bash Python Git Trivy Nginx Rancher

// cat ~/blog/README.md

Knowledge Sharing

I write technical articles explaining DevOps concepts the way I wish someone explained them to me, with real scenarios, actual commands, and the occasional production horror story.

Prometheus + Grafana Setup Container Internals OOMKilled Debugging ArgoCD GitOps Docker Image Optimization

Read the Blog →

📝

// tail -f ~/learning.log

Currently Learning

🌐

Networking & Traffic Flow

Deep-diving into networking and traffic flow in distributed systems, because "it works on localhost" isn't a valid network architecture.

🛡️

Reliability Engineering

SLOs, observability practices, incident response frameworks. Building systems that fail gracefully instead of spectacularly.

Pallav

Technical Identity

Infrastructure at Scale

GPU Workload Orchestration

Observability from Zero

GitOps & Automation

Cloud & Migrations

Career Journey

Production Experience

Large-Scale Government Deployment

Multi-Location Distributed Infrastructure

GPU Cluster on Constrained Hardware

Cloud Migration in 72 Hours

Observability Stack from Zero

Tech Stack

Knowledge Sharing

Currently Learning

Networking & Traffic Flow

Reliability Engineering

Let's Talk Infrastructure