African News Classifier

Project Overview

This project implements an end-to-end Machine Learning pipeline to categorize news articles relevant to the Central African region (in this case Cameroon) into distinct classes: Politics, Economy, Sports, ....

This project shows a complete Data Engineering workflow, from scraping real-time data using google gnews to training a discriminative model using TF-IDF vectorization.

The Pipeline

1. Data Collection

Data was obtained using the Google News API, targeting specific queries to build a localized dataset.

Source: Real-time news aggregation.
Context: Focus on Cameroonian keywords (e.g., "Lions Indomptables", "Économie Cameroun", ...).

2. Preprocessing

To reduce noise and dimensionality in the dataset, the raw text undergoes cleaning using NLTK:

Normalization: Text lowercasing and punctuation removal.
Stop-words Removal: Filtering out common French words to focus on semantic content.
Tokenization: Splitting text into individual semantic units.

3. Mathematical Modeling (TF-IDF)

Text data is converted into numerical vectors using Term Frequency-Inverse Document Frequency, TF-IDF. This statistical measure evaluates how relevant a word is to a document in a collection and classification is performed.

$$w_{i,j} = tf_{i,j} \times \log\left(\frac{N}{df_i}\right)$$

Where $N$ is the total number of articles and $df_i$ is the number of articles containing term $i$.

Results & Performance

The model was evaluated on a test set and produced the following.

Confusion Matrix

The matrix shows that the model effectively distinguishes 'Sports' from 'Politics', though some overlap exists between 'Economy' and 'Politics' due to factors like shared vocabulary.*

How to Run

Clone the repository:

git clone https://github.com/JeffreyYAJ/African-news-classifier.git
cd African-news-classifier

Install dependencies:
```
pip install -r requirements.txt
```

Run the pipeline from the jupyter notebook.

Tools Used

Pandas: Data manipulation.
GNews: Data extraction.
NLTK: Natural Language Processing.
Scikit-Learn: Model training and evaluation.
Matplotlib/Seaborn: Data visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
African_News_classifier.ipynb		African_News_classifier.ipynb
Evaluation.py		Evaluation.py
README.md		README.md
data_cleaner.py		data_cleaner.py
data_scraper.py		data_scraper.py
matrix.png		matrix.png
requirements.txt		requirements.txt
results.png		results.png
td_idf_model.py		td_idf_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

African News Classifier

Project Overview

The Pipeline

1. Data Collection

2. Preprocessing

3. Mathematical Modeling (TF-IDF)

Results & Performance

Confusion Matrix

How to Run

Tools Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

African News Classifier

Project Overview

The Pipeline

1. Data Collection

2. Preprocessing

3. Mathematical Modeling (TF-IDF)

Results & Performance

Confusion Matrix

How to Run

Tools Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages