|
2 | 2 |
|
3 | 3 | This is the data filtering code for BigScience. |
4 | 4 |
|
5 | | -See this document for more details.. |
| 5 | +See [this document](https://docs.google.com/document/d/1bx7lzAIWALH2IX5PLAiRfkHr3025dC-ZYkEmq4zB2gI/edit) for more details. |
6 | 6 |
|
7 | | -https://docs.google.com/document/d/1bx7lzAIWALH2IX5PLAiRfkHr3025dC-ZYkEmq4zB2gI/edit |
| 7 | +The supported languages are defined in the file [languages_id.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/languages_id.py). |
| 8 | + |
| 9 | + |
| 10 | +### Filtering |
| 11 | + |
| 12 | +#### 0. Understand the filtering pipeline |
| 13 | + |
| 14 | +Take a look at the pdf [explanation_filtering_pipeline.pdf](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/explanation_filtering_pipeline.pdf) for an explanation of the filtering pipeline. |
| 15 | + |
| 16 | +#### 1. Define the lists of stop words and bad words, and check how the anonymization and the normalization of texts are done |
| 17 | + |
| 18 | +You might want to redefine the lists of stop words and bad words for robustness or ethical reasons in the files [stopwords.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/stopwords.py) and [badwords.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/badwords.py). |
| 19 | + |
| 20 | +Less importantly, you can also check how the anonymization and the normalization of texts are done in the files [anonymization.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/anonymization.py) and [normalization.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/normalization.py) (if applicable, default is to use the anonymization and not to use the normalization). |
| 21 | + |
| 22 | +#### 2. Download everything you need |
| 23 | + |
| 24 | +To run the filtering code, it is necessary to download the dataset on which the filtering will take place, but also the necessary models, which are the Fasttext model for language identification (download [here](https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin)) and the Sentencepiece and KenLM models for tokenization and calculation of perplexity scores (download with the file [download_kenlm_models.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/download_kenlm_models.py)). |
| 25 | + |
| 26 | +#### 3. Choose the filtering parameters |
| 27 | + |
| 28 | +The filtering parameters for each language are to be specified in the file [parameters_filtering.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/parameters_filtering.py). It is strongly recommended to look at the data and use the visualization code in the directory [visualization](https://github.com/bigscience-workshop/data_tooling/tree/master/ac_dc/visualization) to choose these parameters. |
| 29 | + |
| 30 | +#### 4. Run the filtering |
| 31 | + |
| 32 | +Run the filtering with the file [main_filtering.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/main_filtering.py), specifying the dataset used and the links to the downloaded models. The different filters are coded in the file [filtering.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/filtering.py). |
| 33 | + |
| 34 | +#### 5. Do the deduplication |
| 35 | + |
| 36 | +Do the deduplication, which is detailed in the following section, with the file [deduplicate.py](https://github.com/bigscience-workshop/data_tooling/blob/master/ac_dc/deduplicate.py). |
8 | 37 |
|
9 | 38 |
|
10 | 39 | ### Deduplication |
|
0 commit comments