This project focuses on performing Exploratory Data Analysis (EDA) and feature engineering on flight pricing data to understand the factors that influence ticket prices and to prepare a clean, model-ready dataset for machine learning.
The emphasis of this project is not only on transforming data, but on making justified, data-driven decisions using EDA — closely mirroring real-world data science workflows.
- Understand the distribution and behavior of flight prices
- Identify key factors affecting ticket pricing
- Perform EDA-driven feature engineering
- Prepare a clean dataset suitable for predictive modeling
The dataset contains information related to flight bookings, including:
- Airline – Name of the airline
- Source – City of departure
- Destination – City of arrival
- Date_of_Journey – Date of travel
- Route – Cities covered during the journey
- Dep_Time – Departure time
- Arrival_Time – Arrival time
- Duration – Total journey duration
- Total_Stops – Number of stops
- Additional_Info – Additional flight details
- Price – Ticket price (Target variable)
EDA was conducted to understand data structure, identify patterns, and guide preprocessing and feature engineering decisions.
- Dataset shape, data types, and missing value analysis
- Distribution and outlier analysis of flight prices
- Airline-wise and stop-wise price comparison
- Time-based analysis (journey month, weekday, departure and arrival hours)
- Investigation of missing values in critical columns
- Flight prices are right-skewed with realistic high-value outliers
- Airline and number of stops strongly influence ticket prices
- Clear seasonality is observed across journey months
- Time-of-day features show non-linear relationships with price
- Flight duration is an important feature but is stored as text and requires conversion
These insights directly informed feature engineering decisions.
Feature engineering was performed in an iterative and EDA-driven manner to preserve meaningful information while making the data suitable for machine learning models.
- Extracted day, month, and weekday from journey date
- Extracted hour and minute features from departure and arrival times
- Converted flight duration into total minutes
- Encoded
Total_Stopsas an ordinal numerical feature - Removed
Routeafter extracting stop-related information to avoid redundancy - Applied one-hot encoding to categorical variables
- Removed records with missing
RouteandTotal_Stopsdue to logical dependency
All transformations were justified based on exploratory analysis.
- Contains only numeric, model-ready features
- No missing values
- Suitable for regression-based machine learning models
Raw Data
↓
Initial EDA
↓
Key Insights
↓
Feature Engineering
↓
Final Clean Dataset
This workflow ensures that preprocessing decisions are transparent, explainable,
and data-driven.
Repository Structure
text
Copy code
├── EDA_and_Feature_Engineering_Flight_Price_Prediction.ipynb
├── README.md
└── data/
└── flight_price_dataset.xlsx
Future Scope
Train and evaluate machine learning models
Perform feature importance analysis
Hyperparameter tuning
Model deployment using a simple web interface
👤 Author
Himanchal Mishra
Engineering Student | Data Analytics Enthusiast