Skip to content

MarkPhamm/skytrax_reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

407 Commits
 
 
 
 

Repository files navigation

Skytrax Global Airlines Analytics Project

image

Repositories

Repository Owner Purpose
skytrax_extract_load MarkPhamm EL pipeline: scrapes 160K+ airline reviews, stages to S3, loads into Snowflake via Airflow
skytrax_transformation MarkPhamm dbt transformation into star schema with slim CI/CD, IaC, and hosted dbt docs
skytrax_data_cleaning DucLe-2005 Cleans raw scraped data and standardizes formats using modular Python functions
skytrax_dashboard_website nguyentienTCU Dashboard website for visualising insights from processed airline reviews
spirit_airlines_dashboard MiaTran1112 Mode dashboard analyzing Spirit Airlines customer satisfaction

Team

Leadership

Members


Project Overview

An end-to-end analytics pipeline that ingests, transforms, and visualises customer review data for every airline on Skytrax (AirlineQuality.com). The project scrapes 160,000+ reviews, loads them into Snowflake, transforms them into a Kimball star schema with dbt, and serves insights through an interactive dashboard.

Self-selection bias: Skytrax reviews are self-reported. Passengers with extreme experiences are more likely to post, so KPIs skew away from the broader flying population. The goal is directional insight, not population-level generalisation.


Architecture

airlinequality.com
        | scrape (26 A-Z parallel tasks)
        v
S3: raw/YYYY/MM/raw_data_YYYYMMDD.csv
        | clean + transform
        v
S3: processed/YYYY/MM/clean_data_YYYYMMDD.csv
        | COPY INTO
        v
Snowflake: SKYTRAX_REVIEWS_DB.RAW.AIRLINE_REVIEWS
        | dbt source
        v
staging -> intermediate -> marts (star schema)
        |
        v
Dashboard / BI Tools / RAG Chatbot

Stack

Layer Technology
Scraping Python 3.12, BeautifulSoup, pandas
Orchestration Apache Airflow (Astronomer Runtime, Docker)
Storage AWS S3 (date-partitioned landing zone)
Data Warehouse Snowflake
Transformation dbt (dbt-snowflake), SQLFluff
IaC Terraform (AWS S3/IAM/CloudFront/OIDC + Snowflake RBAC/schemas/warehouses)
CI/CD GitHub Actions (slim CI with merge-base, defer/favor-state CD)
Auth AWS IAM OIDC (keyless GitHub Actions)
Docs dbt docs hosted on CloudFront + S3
Dashboard Next.js, TailwindCSS, Chart.js, LangChain, ChromaDB

Part 1: Extract & Load

Repo: skytrax_reviews_extract_load

Pipeline

Three Airflow DAGs chained via Airflow Datasets (no cron, no polling):

DAG Trigger What it does
skytrax_crawl Daily 02:00 UTC Scrapes reviews with 26 parallel A-Z tasks, uploads raw CSVs to S3
skytrax_process Dataset (raw) Downloads raw CSVs, cleans/transforms, uploads processed CSVs
skytrax_snowflake Dataset (processed) Runs COPY INTO Snowflake for each review date

Loading Strategy

  • Incremental (daily): scrapes only yesterday's reviews. Each review date maps to exactly one CSV, so re-runs are idempotent.
  • Bulk backfill: trigger with full_scrape=True to scrape all historical reviews (back to 2010). Snowflake's COPY INTO tracks loaded files — no duplicates.

S3 Layout

s3://skytrax-reviews-landing-<account-id>/
  raw/YYYY/MM/raw_data_YYYYMMDD.csv
  processed/YYYY/MM/clean_data_YYYYMMDD.csv
  • Versioning enabled, AES256 encryption, lifecycle rules (Standard-IA after 30d, expire old versions after 90d), all public access blocked.

Data Cleaning

Column standardisation (snake_case), ISO 8601 dates, text cleaning, route parsing (origin/destination/connections), aircraft name normalisation, and numeric rating conversion.


Part 2: Transformation & CI/CD

Repo: skytrax_reviews_transformation

Data Model (Star Schema)

Follows Kimball methodology with deterministic surrogate keys. Grain: one row per customer review per flight.

Model Type Description
fct_review Fact Review metrics, ratings, calculated averages, rating bands
dim_customer Dimension Reviewer identity and flight count
dim_airline Dimension Airline name
dim_aircraft Dimension Aircraft model, manufacturer, seat capacity
dim_location Dimension City + airport (role-playing: origin, destination, transit)
dim_date Dimension Calendar + fiscal dates (role-playing: submitted, flown)

Live dbt Docs — auto-generated and hosted on CloudFront, updated on every deploy.

Snowflake Infrastructure

All managed by Terraform — users, roles, grants, schemas, warehouses. No manual setup.

Schema Purpose
RAW Raw source data from EL pipeline
SOURCE Staging views — 1:1 source mirrors
INTERMEDIATE Cleaned/normalized business logic
MARTS Star schema dims + facts
STAGING CI scratch space
DEV_* Per-user dev schemas for local development

CI/CD

  • Continuous Integration (PRs): merge-base state comparison — only changed models are linted (SQLFluff), compiled, run, and tested.
  • Continuous Deployment (merge to main): dbt build --select state:modified+ --defer --favor-state rebuilds only modified models + downstream. Uploads manifest, run_results, and dbt docs to S3. Invalidates CloudFront cache.
  • Keyless auth: GitHub Actions authenticates to AWS via OIDC — no static credentials.

Part 3: Visualisation

Repo: skytrax_dashboard_website | Live Dashboard

  • Interactive KPI Cards: overall satisfaction, NPS-like scores, category averages
  • Multi-Dimensional Filters: airline, aircraft, route, cabin class, traveller type
  • Data Explorer: drag-and-drop or SQL-like querying for power users
  • RAG Chatbot: natural-language Q&A across the full corpus of reviews

Key Business Insights

Economy-Class Trends

Economy Insights

  • Ground-staff service and boarding efficiency dominate complaints across airlines
  • Major hubs (LHR, CDG, JFK) see the highest negative volume — tied to long queues and staff shortages
  • 92% of low-rating Economy reviews cite at-airport factors rather than in-flight experience

Economy Recommendations

Recommendations: boost ground-handling staffing during peak waves, deploy self-service kiosks and real-time queue monitoring.

Premium-Cabin Expectations

  • Business & First passengers focus on seat comfort, bedding quality, and connectivity speed
  • Consistency gaps between aircraft sub-fleets drive dissatisfaction
  • Food quality is the second-largest driver of sub-4-star ratings

Recommendations: accelerate fleet-wide seat upgrades, introduce chef-curated rotating menus, guarantee minimum bandwidth per passenger.


Next Steps

  1. Expand Data Sources — integrate on-time performance and DOT complaint data
  2. Real-Time Ingestion — CDC-style pipelines to surface insights within hours of review publication
  3. Predictive Modelling — sentiment + operational variables to forecast NPS by airline and route
  4. Monetisation — benchmarking dashboards for airlines and airports via subscription

© 2025 Skytrax Global Airlines Analytics Project

About

End-to-end ELT pipeline for 160K+ Skytrax airline reviews: Airflow orchestration, BeautifulSoup scraping, S3 staging, Snowflake warehouse, dbt star schema transformation, Terraform IaC, GitHub Actions CI/CD, and Next.js dashboard with LangChain RAG chatbot.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors