Design and implement a data warehouse to manage automobile accident cases across all 49 states in the US, using a star schema and Snowflake for the data warehouse architecture.
- Data Source: This project uses data on Kaggle including 2 datasets: US Accidents (2016 - 2023) and Traffic Accidents and Vehicles
US Accidents (2016 - 2023: This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023Traffic Accidents and Vehicles: every line in the file represents the involvement of a unique vehicle in a unique traffic accident, featuring various vehicle and passenger properties as columns
- Extract Data: Data is
extractedfromcsvfile theningestedintoMinIOdata lake inbronzefolder usingPythonandAirflow - Transform Data: Data is retrieved from
MinIO'sbronzedirectory usingSparkandFastAPIto performtransformationandcleaning, then the output isloadedintoMinIO'ssilverdirectory. - Load Data: Once the data has been cleaned, we load it into the
Snowflakedatawarehouseat SchemaStagingusingPythonandAirflow. - Warehouse: Data is loaded into
stagingschema inSnowflake, Build and deploydata warehousewithStar Schemaarchitecture by creatingdimensionandfacttables, to do this we useDBTtotransformandcheck data. - Serving: Analyze data to improve road safety, identify high-risk accident areas to implement preventative measures. Identify factors that contribute to accidents (weather, road conditions, human error). Then visualize and create reports with
Power BI. - Package and Orchestration: Components are packaged using
Dockerand orchestrated usingApache Airflow.
Apache AirflowApache SparkDockerDbtSnowflakeMinIOFastApiPower BI

