Skip to content
View omkarmusale0910's full-sized avatar
😀
😀

Block or report omkarmusale0910

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
omkarmusale0910/README.md

Hi 👋, I'm Omkar Musale

Software Engineer II  |  Data Engineering  |  Apache Beam · Airflow · GCP

Building production-scale data pipelines that power AI & geospatial systems

LinkedIn Medium Email


🚀 About Me

  • 🔭 Currently working on: End-to-end Apache Beam pipelines for a legal document intelligence platform transforming millions of files into LLM-ready datasets
  • 🌱 Learning: System Design at scale, FastAPI, advanced GCP architecture
  • 💼 Experience: 3+ years at Infocusp Innovations Data Engineering across geospatial & document intelligence domains
  • 📝 Writing: Data engineering, pipelines & GCP on Medium
  • 🎯 Domains: Document AI data prep · Geospatial & satellite imagery · Pipeline orchestration · Cloud-native infra

💼 Work Experience

Software Engineer II | Infocusp Innovations

January 2026 – Present · Pune, India

  • Architected end-to-end Apache Beam pipelines for a legal document intelligence platform — ingesting millions of files (PDFs, emails, docs, sheets) and transforming them into structured, LLM-ready datasets with full metadata preservation
  • Designed and built core annotation extraction and PDF rendering libraries, enabling downstream AI systems to semantically understand document structure and content
  • Owned orchestration and scheduling architecture using Apache Airflow + GCP Cloud Composer, establishing reliable deployment patterns for production-scale enterprise pipelines
  • Defined data schemas and pipeline contracts with AI/ML teams, ensuring clean, correctly formatted data delivery for model consumption

Tech Stack: Python · Apache Beam · Airflow · GCP Cloud Composer · GCS · Pub/Sub · Docker


Software Engineer | Infocusp Innovations

July 2023 – December 2025 · Pune, India

  • Built scalable ingestion and processing pipelines for petabyte-scale geospatial datasets — satellite imagery, GeoTIFF files, spectral data, and environmental metrics from Google Earth Engine. Handled datasets exceeding 100GB per run
  • Designed an automated export pipeline using Cloud Scheduler + Cloud Workflows + Batch Jobs to deliver fresh geospatial datasets at regular intervals for downstream model training cycles
  • Achieved 20% efficiency improvement by optimizing ingestion algorithms and refactoring pipeline logic using OOP design principles
  • Containerized distributed processing workloads using Kubernetes + Docker; provisioned all GCP infrastructure reproducibly via Terraform. Used GCS, Pub/Sub, and Cloud Logging for reliable, observable pipeline runs

Tech Stack: Python · GCP (GCS, Pub/Sub, Cloud Scheduler, Cloud Workflows, Cloud Batch, Kubernetes) · Docker · Terraform · Google Earth Engine API


Software Engineer Intern | Infocusp Innovations

January 2023 – June 2023 · Pune, India

  • Processed large-volume healthcare device data using Apache Spark (PySpark) on a distributed cluster (AWS Glue + Terraform), extracting structured insights from raw sensor streams
  • Built self-serve Streamlit dashboards for data visualization, reducing manual analysis time for the data science team

Tech Stack: PySpark · Apache Spark · AWS Glue · Streamlit · Terraform


🛠️ Technical Skills

Languages

Python C++ Bash

Pipeline & Big Data

Apache Beam Apache Airflow Apache Spark

Cloud & DevOps

GCP Cloud Composer Pub/Sub Kubernetes Docker Terraform

Tools

Git Linux FastAPI


📊 Featured Projects

PyPI Downloads GitHub

Anti-bot detection asynchronous web scraping library built with Playwright

A production-ready Python library for scraping protected websites (job platforms, social networks, e-commerce) that bypasses anti-bot systems. Features session management, proxy support, and advanced HTML parsing. Published on PyPI with 2,000+ downloads.

Tech Stack: Python · Playwright · Asyncio · Bright Data Proxy

Highlights:

  • 🔐 Session management with cookies and browser fingerprints for authenticated scraping
  • 🛡️ Advanced anti-detection techniques to bypass bot protection systems
  • ⚡ Fully asynchronous architecture for high-performance concurrent scraping
  • 📦 Published open-source on PyPI with 2.08K+ downloads
  • 🌐 Integrated proxy support (Bright Data) and CLI tool for session generation

🏆 Achievements & Certifications

  • 🥈 2nd Place — Competitive Coding Competition, PCCOE (8.56 CGPA)
  • 5-Star Problem Solving — HackerRank
  • 5-Star C++ — HackerRank
  • 4-Star Competitive Programming — CodeChef
  • 📜 Competitive Programming Essentials — Master Algorithms Certification
  • 💻 Active on GeeksforGeeks · LeetCode

📝 Latest Blog Posts

I write about Data Engineering, pipeline architecture, and GCP from building production-scale systems to the lessons learned along the way. Read on Medium


🤝 Connect With Me

LinkedIn Medium CodeChef HackerRank LeetCode


Profile Views

💡 Open to collaborating on interesting data engineering, pipeline, or GCP-related projects

Pinned Loading

  1. IntelliScraper IntelliScraper Public

    A powerful, anti-bot detection web scraping solution built with Playwright, designed for scraping protected sites like LinkedIn and other platforms that require authentication. Features session man…

    Python 3

  2. pdfnotes pdfnotes Public

    Extract PDF comments and highlights for humans and AI

    Makefile