| Repository | Owner | Purpose |
|---|---|---|
| skytrax_extract_load | MarkPhamm | EL pipeline: scrapes 160K+ airline reviews, stages to S3, loads into Snowflake via Airflow |
| skytrax_transformation | MarkPhamm | dbt transformation into star schema with slim CI/CD, IaC, and hosted dbt docs |
| skytrax_data_cleaning | DucLe-2005 | Cleans raw scraped data and standardizes formats using modular Python functions |
| skytrax_dashboard_website | nguyentienTCU | Dashboard website for visualising insights from processed airline reviews |
| spirit_airlines_dashboard | MiaTran1112 | Mode dashboard analyzing Spirit Airlines customer satisfaction |
- Data Analysts: Trang Dam, Gwennie Nguyen, Jenny Tran, Mia Tran, Alyssa Le
- Data Engineers: Leonard Dau, Thieu Nguyen, Viet Lam Nguyen
- Software Engineers: Tien Nguyen, Anh Duc Le
- Data Scientists: Robin Tran, Trung Dam
- Scrum Master: Hien Dinh
An end-to-end analytics pipeline that ingests, transforms, and visualises customer review data for every airline on Skytrax (AirlineQuality.com). The project scrapes 160,000+ reviews, loads them into Snowflake, transforms them into a Kimball star schema with dbt, and serves insights through an interactive dashboard.
Self-selection bias: Skytrax reviews are self-reported. Passengers with extreme experiences are more likely to post, so KPIs skew away from the broader flying population. The goal is directional insight, not population-level generalisation.
airlinequality.com
| scrape (26 A-Z parallel tasks)
v
S3: raw/YYYY/MM/raw_data_YYYYMMDD.csv
| clean + transform
v
S3: processed/YYYY/MM/clean_data_YYYYMMDD.csv
| COPY INTO
v
Snowflake: SKYTRAX_REVIEWS_DB.RAW.AIRLINE_REVIEWS
| dbt source
v
staging -> intermediate -> marts (star schema)
|
v
Dashboard / BI Tools / RAG Chatbot
| Layer | Technology |
|---|---|
| Scraping | Python 3.12, BeautifulSoup, pandas |
| Orchestration | Apache Airflow (Astronomer Runtime, Docker) |
| Storage | AWS S3 (date-partitioned landing zone) |
| Data Warehouse | Snowflake |
| Transformation | dbt (dbt-snowflake), SQLFluff |
| IaC | Terraform (AWS S3/IAM/CloudFront/OIDC + Snowflake RBAC/schemas/warehouses) |
| CI/CD | GitHub Actions (slim CI with merge-base, defer/favor-state CD) |
| Auth | AWS IAM OIDC (keyless GitHub Actions) |
| Docs | dbt docs hosted on CloudFront + S3 |
| Dashboard | Next.js, TailwindCSS, Chart.js, LangChain, ChromaDB |
Repo: skytrax_reviews_extract_load
Three Airflow DAGs chained via Airflow Datasets (no cron, no polling):
| DAG | Trigger | What it does |
|---|---|---|
skytrax_crawl |
Daily 02:00 UTC | Scrapes reviews with 26 parallel A-Z tasks, uploads raw CSVs to S3 |
skytrax_process |
Dataset (raw) | Downloads raw CSVs, cleans/transforms, uploads processed CSVs |
skytrax_snowflake |
Dataset (processed) | Runs COPY INTO Snowflake for each review date |
- Incremental (daily): scrapes only yesterday's reviews. Each review date maps to exactly one CSV, so re-runs are idempotent.
- Bulk backfill: trigger with
full_scrape=Trueto scrape all historical reviews (back to 2010). Snowflake'sCOPY INTOtracks loaded files — no duplicates.
s3://skytrax-reviews-landing-<account-id>/
raw/YYYY/MM/raw_data_YYYYMMDD.csv
processed/YYYY/MM/clean_data_YYYYMMDD.csv
- Versioning enabled, AES256 encryption, lifecycle rules (Standard-IA after 30d, expire old versions after 90d), all public access blocked.
Column standardisation (snake_case), ISO 8601 dates, text cleaning, route parsing (origin/destination/connections), aircraft name normalisation, and numeric rating conversion.
Repo: skytrax_reviews_transformation
Follows Kimball methodology with deterministic surrogate keys. Grain: one row per customer review per flight.
| Model | Type | Description |
|---|---|---|
fct_review |
Fact | Review metrics, ratings, calculated averages, rating bands |
dim_customer |
Dimension | Reviewer identity and flight count |
dim_airline |
Dimension | Airline name |
dim_aircraft |
Dimension | Aircraft model, manufacturer, seat capacity |
dim_location |
Dimension | City + airport (role-playing: origin, destination, transit) |
dim_date |
Dimension | Calendar + fiscal dates (role-playing: submitted, flown) |
Live dbt Docs — auto-generated and hosted on CloudFront, updated on every deploy.
All managed by Terraform — users, roles, grants, schemas, warehouses. No manual setup.
| Schema | Purpose |
|---|---|
RAW |
Raw source data from EL pipeline |
SOURCE |
Staging views — 1:1 source mirrors |
INTERMEDIATE |
Cleaned/normalized business logic |
MARTS |
Star schema dims + facts |
STAGING |
CI scratch space |
DEV_* |
Per-user dev schemas for local development |
- Continuous Integration (PRs): merge-base state comparison — only changed models are linted (SQLFluff), compiled, run, and tested.
- Continuous Deployment (merge to
main):dbt build --select state:modified+ --defer --favor-staterebuilds only modified models + downstream. Uploads manifest, run_results, and dbt docs to S3. Invalidates CloudFront cache. - Keyless auth: GitHub Actions authenticates to AWS via OIDC — no static credentials.
Repo: skytrax_dashboard_website | Live Dashboard
- Interactive KPI Cards: overall satisfaction, NPS-like scores, category averages
- Multi-Dimensional Filters: airline, aircraft, route, cabin class, traveller type
- Data Explorer: drag-and-drop or SQL-like querying for power users
- RAG Chatbot: natural-language Q&A across the full corpus of reviews
- Ground-staff service and boarding efficiency dominate complaints across airlines
- Major hubs (LHR, CDG, JFK) see the highest negative volume — tied to long queues and staff shortages
- 92% of low-rating Economy reviews cite at-airport factors rather than in-flight experience
Recommendations: boost ground-handling staffing during peak waves, deploy self-service kiosks and real-time queue monitoring.
- Business & First passengers focus on seat comfort, bedding quality, and connectivity speed
- Consistency gaps between aircraft sub-fleets drive dissatisfaction
- Food quality is the second-largest driver of sub-4-star ratings
Recommendations: accelerate fleet-wide seat upgrades, introduce chef-curated rotating menus, guarantee minimum bandwidth per passenger.
- Expand Data Sources — integrate on-time performance and DOT complaint data
- Real-Time Ingestion — CDC-style pipelines to surface insights within hours of review publication
- Predictive Modelling — sentiment + operational variables to forecast NPS by airline and route
- Monetisation — benchmarking dashboards for airlines and airports via subscription
© 2025 Skytrax Global Airlines Analytics Project

