omkar musale omkarmusale0910

Hi 👋, I'm Omkar Musale

Software Engineer II | Data Engineering | Apache Beam · Airflow · GCP

Building production-scale data pipelines that power AI & geospatial systems

🚀 About Me

🔭 Currently working on: End-to-end Apache Beam pipelines for a legal document intelligence platform transforming millions of files into LLM-ready datasets
🌱 Learning: System Design at scale, FastAPI, advanced GCP architecture
💼 Experience: 3+ years at Infocusp Innovations Data Engineering across geospatial & document intelligence domains
📝 Writing: Data engineering, pipelines & GCP on Medium
🎯 Domains: Document AI data prep · Geospatial & satellite imagery · Pipeline orchestration · Cloud-native infra

💼 Work Experience

Software Engineer II | Infocusp Innovations

January 2026 – Present · Pune, India

Architected end-to-end Apache Beam pipelines for a legal document intelligence platform — ingesting millions of files (PDFs, emails, docs, sheets) and transforming them into structured, LLM-ready datasets with full metadata preservation
Designed and built core annotation extraction and PDF rendering libraries, enabling downstream AI systems to semantically understand document structure and content
Owned orchestration and scheduling architecture using Apache Airflow + GCP Cloud Composer, establishing reliable deployment patterns for production-scale enterprise pipelines
Defined data schemas and pipeline contracts with AI/ML teams, ensuring clean, correctly formatted data delivery for model consumption

Tech Stack: Python · Apache Beam · Airflow · GCP Cloud Composer · GCS · Pub/Sub · Docker

Software Engineer | Infocusp Innovations

July 2023 – December 2025 · Pune, India

Built scalable ingestion and processing pipelines for petabyte-scale geospatial datasets — satellite imagery, GeoTIFF files, spectral data, and environmental metrics from Google Earth Engine. Handled datasets exceeding 100GB per run
Designed an automated export pipeline using Cloud Scheduler + Cloud Workflows + Batch Jobs to deliver fresh geospatial datasets at regular intervals for downstream model training cycles
Achieved 20% efficiency improvement by optimizing ingestion algorithms and refactoring pipeline logic using OOP design principles
Containerized distributed processing workloads using Kubernetes + Docker; provisioned all GCP infrastructure reproducibly via Terraform. Used GCS, Pub/Sub, and Cloud Logging for reliable, observable pipeline runs

Tech Stack: Python · GCP (GCS, Pub/Sub, Cloud Scheduler, Cloud Workflows, Cloud Batch, Kubernetes) · Docker · Terraform · Google Earth Engine API

Software Engineer Intern | Infocusp Innovations

January 2023 – June 2023 · Pune, India

Processed large-volume healthcare device data using Apache Spark (PySpark) on a distributed cluster (AWS Glue + Terraform), extracting structured insights from raw sensor streams
Built self-serve Streamlit dashboards for data visualization, reducing manual analysis time for the data science team

Tech Stack: PySpark · Apache Spark · AWS Glue · Streamlit · Terraform

🛠️ Technical Skills

Languages

Pipeline & Big Data

Cloud & DevOps

Tools

📊 Featured Projects

🕷️ IntelliScraper

Anti-bot detection asynchronous web scraping library built with Playwright

A production-ready Python library for scraping protected websites (job platforms, social networks, e-commerce) that bypasses anti-bot systems. Features session management, proxy support, and advanced HTML parsing. Published on PyPI with 2,000+ downloads.

Tech Stack: Python · Playwright · Asyncio · Bright Data Proxy

Highlights:

🔐 Session management with cookies and browser fingerprints for authenticated scraping
🛡️ Advanced anti-detection techniques to bypass bot protection systems
⚡ Fully asynchronous architecture for high-performance concurrent scraping
📦 Published open-source on PyPI with 2.08K+ downloads
🌐 Integrated proxy support (Bright Data) and CLI tool for session generation

🏆 Achievements & Certifications

🥈 2nd Place — Competitive Coding Competition, PCCOE (8.56 CGPA)
⭐ 5-Star Problem Solving — HackerRank
⭐ 5-Star C++ — HackerRank
⭐ 4-Star Competitive Programming — CodeChef
📜 Competitive Programming Essentials — Master Algorithms Certification
💻 Active on GeeksforGeeks · LeetCode

📝 Latest Blog Posts

I write about Data Engineering, pipeline architecture, and GCP from building production-scale systems to the lessons learned along the way. Read on Medium

🤝 Connect With Me

💡 Open to collaborating on interesting data engineering, pipeline, or GCP-related projects

Provide feedback

Saved searches

Use saved searches to filter your results more quickly