This project implements an end-to-end Machine Learning pipeline to categorize news articles relevant to the Central African region (in this case Cameroon) into distinct classes: Politics, Economy, Sports, ....
This project shows a complete Data Engineering workflow, from scraping real-time data using google gnews to training a discriminative model using TF-IDF vectorization.
Data was obtained using the Google News API, targeting specific queries to build a localized dataset.
- Source: Real-time news aggregation.
- Context: Focus on Cameroonian keywords (e.g., "Lions Indomptables", "Économie Cameroun", ...).
To reduce noise and dimensionality in the dataset, the raw text undergoes cleaning using NLTK:
- Normalization: Text lowercasing and punctuation removal.
- Stop-words Removal: Filtering out common French words to focus on semantic content.
- Tokenization: Splitting text into individual semantic units.
Text data is converted into numerical vectors using Term Frequency-Inverse Document Frequency, TF-IDF. This statistical measure evaluates how relevant a word is to a document in a collection and classification is performed.
Where $N$ is the total number of articles and $df_i$ is the number of articles containing term $i$.
The model was evaluated on a test set and produced the following.
The matrix shows that the model effectively distinguishes 'Sports' from 'Politics', though some overlap exists between 'Economy' and 'Politics' due to factors like shared vocabulary.*
-
Clone the repository:
git clone https://github.com/JeffreyYAJ/African-news-classifier.git cd African-news-classifier -
Install dependencies:
pip install -r requirements.txt
Run the pipeline from the jupyter notebook.
- Pandas: Data manipulation.
- GNews: Data extraction.
- NLTK: Natural Language Processing.
- Scikit-Learn: Model training and evaluation.
- Matplotlib/Seaborn: Data visualization.

