Nick Gupta

Overview

As a Staff/Principal-level Machine Learning Engineer and technical lead, I design, ship, and operate production ML and GenAI systems end-to-end, turning ambiguous goals into measurable roadmaps, reliable architectures, and scalable execution. I hold a Bachelor's Degree in Computer Science from Columbia University in the City of New York. I am a U.S. citizen and do not require visa sponsorship.

I specialize in building high-impact, reusable foundations that make multiple teams faster: modern retrieval and ranking stacks, LLM/VLM-powered assistants, and distributed agentic systems with strong evaluation and safety guardrails. My work emphasizes practical optimization-cost-aware model routing, caching and batching, efficient serving (CPU/GPU), and compression techniques such as distillation and quantization to meet strict p95/p99 latency, reliability, and cost constraints in real production environments.

I am known for cross-functional leadership and technical decision-making that scales: aligning product, data, platform/SRE, privacy/security, and research stakeholders through crisp design docs, clear success metrics, and fast feedback loops. I value inclusive, high-trust teams and bring a calm, metrics-driven approach to building systems that are secure, observable, and maintainable over the long term.

My focus is generative and agentic AI: retrieval + reranking, RAG, tool-use orchestration, evaluation harnesses, and safety/guardrails paired with systems-level optimization (batching/caching, quantization/distillation, GPU efficiency, and cost-aware routing) to hit strict latency, reliability, and cost targets.

TECHNICAL SKILLS

Software
- Programming: Python, Java, C, C++, Rust, Ruby, SQL, NoSQL, RESTful APIs, GraphQL, unit and integration testing
- Application: React.js, JavaScript, TypeScript, Swift, HTML, CSS, Django, Flask, Node.js, Express, Selenium
- Cloud: GCP, AWS, Azure, Digital Ocean, Netlify
- Tools: Git, VIM, VS Code, Splunk, Confluence, Jira, Bitbucket, GitHub Actions, Docker, Kubernetes, Linux, Shell Scripting
Machine Learning
- Deep Learning: Agentic AI, PyTorch, TensorFlow, Keras, ML Recommender and Ranking Systems, Large Language Models (LLMs), Generative AI, RAG, LangChain, Multimodal Learning, Transformers, BERT, T5, Scikit-learn, NLP, NLTK, Knowledge Graphs
- Data: Pandas, Numpy, SciPy, PyTest, Spark, Hadoop, MapReduce, Tableau, Avro, Parquet, Data Parallelism, Model Parallelism, Hybrid Parallelism, Quantization
- MLOps: Ray, Slurm, Airflow, MLFlow, AutoML, Continuous ML, YARN, Kubeflow, Jenkins, Argo, CircleCI, GPU Scaling, vLLM, Distillation
- ML Systems: GRPO, SWiRL, DPO, PPO, Kafka, Zookeeper, ETCD, SHAP, LIME, NVIDIA NeMo, NVIDIA Inference Server, CUDA

Education

Columbia University
Bachelor's Degree, Computer Science

Coursework

Artificial Intelligence with Python
Natural Language Processing with Python
Advanced Programming with C/C++
Algorithmic Trading with Python (audit)
Data Structures with Java
Cloud Computing and Big Data in AWS, GCP, and Azure with Python, JavaScript, HTML/CSS
Introduction to Cryptography
Fundamentals of Computer Systems
Linear Algebra
Building a Technology Startup
Computer Science Theory

Stanford University
Certificate, Machine Learning Specialization in Supervised, Unsupervised, and Advanced ML Algorithms

Coursework

Supervised Learning: Regression and Classification
- Build machine learning models in Python using popular machine learning libraries NumPy & scikit-learn
- Build & train supervised machine learning models for prediction & binary classification tasks, including linear regression & logistic regression
Unsupervised Learning: Clustering, Anomaly Detection, Recommender Systems, Deep Reinforcement Learning, Collaborative Filtering, Content-Based Deep Learning
- Use unsupervised learning techniques for unsupervised learning: including clustering and anomaly detection
- Build recommender systems with a collaborative filtering approach and a content-based deep learning method
- Build a deep reinforcement learning model
Advanced Machine Learning Algorithms: Multi-Class Classification in Neural Networks with TensorFlow, Best Practices in Machine Learning Development, Random Forests, Boosted Trees, Regression Trees, XGBoost
- Build and train a neural network with TensorFlow to perform multi-class classification
- Apply best practices for machine learning development so that your models generalize to data and tasks in the real world
- Build and use decision trees and tree ensemble methods, including random forests and boosted trees

Experience

NVIDIA
Senior/Staff Machine Learning Engineer

- Contract

Architected and shipped a Ray Jobs API distributed execution backend spanning three production codebases (NVIDIA NeMo-RL, NeMo-Skills, and the nvflow orchestration platform), enabling SFT and GRPO reinforcement-learning post-training to run on customer-managed Ray clusters for the first time (previously Slurm-only), unblocking a top-tier global investment bank's regulated, air-gapped on-prem deployment of NVIDIA's GenAI post-training stack
Proved the full SFT to GRPO RLVR pipeline end-to-end on multi-node [N]x H100/H200 GPUs - tensor-parallel policy, colocated vLLM generation, DTensor workers, and an LLM-as-judge reward loop - sustaining ~93 TFLOPS/GPU and landing trained Qwen3-class checkpoints with 100% unattended automation across a heterogeneous multi-cluster topology
Designed a heterogeneous multi-cluster Ray architecture (independent Python-runtime clusters) with automatic per-stage GPU/CPU cross-cluster routing, letting a single command fan an end-to-end RL workflow across clusters and cutting manual orchestration steps by 68%
Drove the open-source upstream contribution of the Ray execution backend into NVIDIA's NeMo-Skills project, clearing maintainer review with a guaranteed zero-behavior-change default path, adversarial multi-agent test coverage, and hundreds of regression tests proving byte-identical Slurm command emission - protecting 82+ existing production deployments from regression
Eliminated 100% of manual intervention from the RL run loop by engineering a shared-filesystem serve-reaping protocol and process-group GPU reclamation and text diffusion decoding, collapsing a multi-step babysat process into a single unattended run and recovering ~74 GB of leaked GPU memory per failed attempt that previously OOM'd downstream training
Led security hardening of air-gapped enterprise containers to pass the customer's vulnerability scan gate, driving fix-available High/Critical CVEs from [~437] to 0, filing [~92] upstream security bugs, and standing up a repeatable scan to remediate to promote pipeline producing signed, digest-pinned images compliant with the bank's air-gap and regulatory requirements
Validated cross-scheduler model-quality parity, reproducing evaluation benchmarks within +/- 2.5% pass@1 of the established baseline, and reduced judge-induced rollout failures by tuning rollout concurrency (24 to 4) to eliminate rate-limit storms
Authored cold-start self-serve documentation and troubleshooting runbooks enabling the customer to run the platform end-to-end without NVIDIA support, accelerating the experimental to production transition and defining the Slurm-compatibility regression bar plus automated NSPECT/Trivy release gating

Amazon
Tech Lead Machine Learning Engineer in Artificial General Intelligence Customization (AGI-C) Team

- Full-time

Integrated GRPO, DPO, SWiRL, and RLVR into a unified RLHF stack to align LLM-generated inverse design code with verified metamaterial ground truth, enabling zero-shot generalization, physics-constrained reasoning, and unsupervised fine-tuning of instruction-following LLMs used by MIT researchers to accelerate synthesis validation for over 3 million material candidates
Engineered a batched, parallelized, and highly distributed reward evaluation pipeline across the 800TB MetaGen materials database, reducing runtime from 82 hours to 57 seconds for over 10 million completions, with >98% throughput efficiency using PyTorch, HuggingFace trl, and CUDA-aware sharded evaluation on multi-node clusters
Achieved >5000 times speedup in inference-time reward computation with >92.3% top-1 structural match accuracy using RLVR-based relative scoring, enabling scalable RLHF-style fine-tuning on commodity 16GB VRAM hardware via LoRA and 4-bit quantization, supporting models from 350M to 1.3B parameters
Developed a high-fidelity LLM evaluation framework for multi-turn scientific reasoning and code trace validation, analyzing over 2.1 billion tokens across 12 reasoning task types
Utilized Hopfield episodic memory layers and text diffusion decoding to enforce logical and unit-consistent reasoning chains and Tree of Thought reasoning based prompt engineering techniques, boosting multi-hop pass rate by 94% and trace accuracy by 88%
Improved inverse design success rate from 46% to 97%, while reducing GPU-hour cost per training cycle by ~98%, demonstrating the real-world feasibility of modern RLHF and alignment techniques in high-throughput scientific GenAI pipelines
Enabled unsupervised fine-tuning workflows across domains, using self-consistency checks, rule-based constraints, and reward function introspection to curate alignment signals without human labeling—automating preference modeling for LLM self-alignment at scale
Deployed Self-Adapting Language Models to dynamically adjust reasoning behavior and output formatting based on prompt context, reducing domain-specific hallucinations by 87% and improving structured generation pass@1 by 83% in physics and materials applications

Navigation

Overview

Education

Experience

Open-Source Contributions

Projects

Volunteer Activities

End of Resume