Skip to content

Commit 27f489d

Browse files
Added a documentation on how to use the filtering
1 parent d37270a commit 27f489d

1 file changed

Lines changed: 31 additions & 2 deletions

File tree

ac_dc/README.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,38 @@
22

33
This is the data filtering code for BigScience.
44

5-
See this document for more details..
5+
See [this document](https://docs.google.com/document/d/1bx7lzAIWALH2IX5PLAiRfkHr3025dC-ZYkEmq4zB2gI/edit) for more details.
66

7-
https://docs.google.com/document/d/1bx7lzAIWALH2IX5PLAiRfkHr3025dC-ZYkEmq4zB2gI/edit
7+
The supported languages are defined in the file [languages_id.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/languages_id.py).
8+
9+
10+
### Filtering
11+
12+
#### 0. Understand the filtering pipeline
13+
14+
Take a look at the pdf [explanation_filtering_pipeline.pdf](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/explanation_filtering_pipeline.pdf) for an explanation of the filtering pipeline.
15+
16+
#### 1. Define the lists of stop words and bad words, and check how the anonymization and the normalization of texts are done
17+
18+
You might want to redefine the lists of stop words and bad words for robustness or ethical reasons in the files [stopwords.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/stopwords.py) and [badwords.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/badwords.py).
19+
20+
Less importantly, you can also check how the anonymization and the normalization of texts are done in the files [anonymization.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/anonymization.py) and [normalization.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/normalization.py) (if applicable, default is to use the anonymization and not to use the normalization).
21+
22+
#### 2. Download everything you need
23+
24+
To run the filtering code, it is necessary to download the dataset on which the filtering will take place, but also the necessary models, which are the Fasttext model for language identification (download [here](https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin)) and the Sentencepiece and KenLM models for tokenization and calculation of perplexity scores (download with the file [download_kenlm_models.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/download_kenlm_models.py)).
25+
26+
#### 3. Choose the filtering parameters
27+
28+
The filtering parameters for each language are to be specified in the file [parameters_filtering.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/parameters_filtering.py). It is strongly recommended to look at the data and use the visualization code in the directory [visualization](https://github.com/bigscience-workshop/data_tooling/tree/master/ac_dc/visualization) to choose these parameters.
29+
30+
#### 4. Run the filtering
31+
32+
Run the filtering with the file [main_filtering.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/main_filtering.py), specifying the dataset used and the links to the downloaded models. The different filters are coded in the file [filtering.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/filtering.py).
33+
34+
#### 5. Do the deduplication
35+
36+
Do the deduplication, which is detailed in the following section, with the file [deduplicate.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/deduplicate.py).
837

938

1039
### Deduplication

0 commit comments

Comments
 (0)