Building production-scale data pipelines that power AI & geospatial systems
- 🔭 Currently working on: End-to-end Apache Beam pipelines for a legal document intelligence platform transforming millions of files into LLM-ready datasets
- 🌱 Learning: System Design at scale, FastAPI, advanced GCP architecture
- 💼 Experience: 3+ years at Infocusp Innovations Data Engineering across geospatial & document intelligence domains
- 📝 Writing: Data engineering, pipelines & GCP on Medium
- 🎯 Domains: Document AI data prep · Geospatial & satellite imagery · Pipeline orchestration · Cloud-native infra
January 2026 – Present · Pune, India
- Architected end-to-end Apache Beam pipelines for a legal document intelligence platform — ingesting millions of files (PDFs, emails, docs, sheets) and transforming them into structured, LLM-ready datasets with full metadata preservation
- Designed and built core annotation extraction and PDF rendering libraries, enabling downstream AI systems to semantically understand document structure and content
- Owned orchestration and scheduling architecture using Apache Airflow + GCP Cloud Composer, establishing reliable deployment patterns for production-scale enterprise pipelines
- Defined data schemas and pipeline contracts with AI/ML teams, ensuring clean, correctly formatted data delivery for model consumption
Tech Stack: Python · Apache Beam · Airflow · GCP Cloud Composer · GCS · Pub/Sub · Docker
July 2023 – December 2025 · Pune, India
- Built scalable ingestion and processing pipelines for petabyte-scale geospatial datasets — satellite imagery, GeoTIFF files, spectral data, and environmental metrics from Google Earth Engine. Handled datasets exceeding 100GB per run
- Designed an automated export pipeline using Cloud Scheduler + Cloud Workflows + Batch Jobs to deliver fresh geospatial datasets at regular intervals for downstream model training cycles
- Achieved 20% efficiency improvement by optimizing ingestion algorithms and refactoring pipeline logic using OOP design principles
- Containerized distributed processing workloads using Kubernetes + Docker; provisioned all GCP infrastructure reproducibly via Terraform. Used GCS, Pub/Sub, and Cloud Logging for reliable, observable pipeline runs
Tech Stack: Python · GCP (GCS, Pub/Sub, Cloud Scheduler, Cloud Workflows, Cloud Batch, Kubernetes) · Docker · Terraform · Google Earth Engine API
January 2023 – June 2023 · Pune, India
- Processed large-volume healthcare device data using Apache Spark (PySpark) on a distributed cluster (AWS Glue + Terraform), extracting structured insights from raw sensor streams
- Built self-serve Streamlit dashboards for data visualization, reducing manual analysis time for the data science team
Tech Stack: PySpark · Apache Spark · AWS Glue · Streamlit · Terraform
Anti-bot detection asynchronous web scraping library built with Playwright
A production-ready Python library for scraping protected websites (job platforms, social networks, e-commerce) that bypasses anti-bot systems. Features session management, proxy support, and advanced HTML parsing. Published on PyPI with 2,000+ downloads.
Tech Stack: Python · Playwright · Asyncio · Bright Data Proxy
Highlights:
- 🔐 Session management with cookies and browser fingerprints for authenticated scraping
- 🛡️ Advanced anti-detection techniques to bypass bot protection systems
- ⚡ Fully asynchronous architecture for high-performance concurrent scraping
- 📦 Published open-source on PyPI with 2.08K+ downloads
- 🌐 Integrated proxy support (Bright Data) and CLI tool for session generation
- 🥈 2nd Place — Competitive Coding Competition, PCCOE (8.56 CGPA)
- ⭐ 5-Star Problem Solving — HackerRank
- ⭐ 5-Star C++ — HackerRank
- ⭐ 4-Star Competitive Programming — CodeChef
- 📜 Competitive Programming Essentials — Master Algorithms Certification
- 💻 Active on GeeksforGeeks · LeetCode
I write about Data Engineering, pipeline architecture, and GCP from building production-scale systems to the lessons learned along the way. Read on Medium
💡 Open to collaborating on interesting data engineering, pipeline, or GCP-related projects

