Machine Learning Platform, Distributed Systems, and AI Infrastructure
About
I like building the parts of AI systems that most people only notice when they fail: the infrastructure, deployment paths, inference layers, and platform pieces that keep models usable in production. My work sits across distributed systems, machine learning platforms, AI platforms, agentic platforms, and the practical details that make these systems reliable.
I currently work at Oracle and studied at Carnegie Mellon University and Indian Institute of Technology (BHU). This site is where I keep the writing, research artifacts, experiments, talks, and patent work that connect back to the same thread: making ML/AI systems scale without losing sight of how they actually run.
Core Areas
- Distributed systems for ML/AI workloads
- Machine learning platforms
- Machine learning
- AI and agentic platforms
- LLM inference and serving infrastructure
- Cloud reliability, scale, and performance
Public Footprint
- Forbes Technology Council contributor
- Research article archived with DOI on Zenodo
- LinkedIn profile with 500+ connections
- Blog posts across GPT-2 internals, Forbes, and Oracle AI
- Patent filing on optimized model deployment
Research
Research Articles
-
FailFast-Fargate: Predictive Container Restart Policies for SLO-Driven ECS/Fargate Services
FailFast-Fargate studies a proactive task replacement framework for Amazon ECS/Fargate services. The goal is to detect degradation before user-visible SLO violations occur, instead of waiting for hard failures or health-check breaches after error budgets have already been consumed.
The approach estimates short-horizon SLO risk from task telemetry and trace signals, then compares the likely cost of leaving a degrading task in service against the cost of replacing it. It is designed to work through standard ECS mechanisms without requiring changes to the underlying AWS ECS architecture.
Synthetic degradation experiments show reductions in error-budget burn and SLO impact while keeping restart rates controlled. The work frames predictive restart policies as a practical path toward self-healing, SLO-aware services on ECS/Fargate.
Keywords: ECS, Fargate, SLO, cloud computing, distributed systems, predictive restart, self-healing.
Blog Posts
This is my corner of the internet where I nerd out about the stuff I love — systems, infrastructure, and AI. If it scales, breaks, or learns, I’m probably writing about it.
-
Deconstruction Series #1: Rebuilding GPT-2 in Pure C
Welcome to the GPT-2 Deconstruction Series — a deep dive into how GPT-2 really works, built from the ground up in pure C. No Python. No PyTorch. No magic. Just raw logic, memory management, and the beauty (and pain) of doing everything yourself. Whether you’re here to learn how transformers tick, or just enjoy bending C to your will, this is your guide to building GPT-2 step by step — from tokenization to text generation. Check Out the GPT-2 C Implementation: gpt2.c Read more
-
The New Frontier Of LLM Inference: Where The Next Tenfold Gains Will Come From
A Forbes Technology Council article on how brute-force scaling is giving way to inference engine improvements rooted in core computer systems design.
-
Deploy an LLM on OCI Data Science with NVIDIA Triton
Oracle blog post on deploying a large language model using OCI Data Science and NVIDIA Triton.
Talks and judging
Talks
-
UCLA LA Hacks at Pauley Pavilion
Judged UCLA's flagship hackathon, reviewed demos from 1,000+ participants, and highlighted the Codebreaker winning team.
Patents
-
Model Deployment System for Generating Optimized Models for a Target Environment
Patent application covering a model deployment system for generating optimized models for a target environment. Listed with other inventors.