Skip to content

Commit 2797090

Browse files
pii-manager v. 0.5.0 (#320)
* pii-manager v. 0.5.0 * new task list parsing code, adding a "full" format based on dicts, in addition to the previous "simplified" format based on tuples * * refactored to allow more than one task for a given PII and country * * added the capability to add task processors programmatically * TASK_ANY split into LANG_ANY & COUNTRY_ANY * * context validation spec, for all three task implementation types * PII detectors for international phone numbers, for en-any & es-any * PII detector for IP addresses, language independent * added reading task descriptors from a JSON file * * PII detectors for GOV_ID - lang pt, countries PT & BR - lang es, country MX * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added missing __init__ files Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 7f6d398 commit 2797090

68 files changed

Lines changed: 2054 additions & 304 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

cc_pseudo_crawl/cc_lookup.py

Lines changed: 107 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,9 @@
99

1010
from pyathena import connect
1111

12-
logging.basicConfig(level='INFO',
13-
format='%(asctime)s %(levelname)s %(name)s: %(message)s')
12+
logging.basicConfig(
13+
level="INFO", format="%(asctime)s %(levelname)s %(name)s: %(message)s"
14+
)
1415

1516
join_template = """
1617
CREATE TABLE {db}._tmp_overlap
@@ -45,7 +46,7 @@
4546
WHERE cc.crawl = '{crawl}'
4647
"""
4748

48-
drop_tmp_table = 'DROP TABLE `{db}._tmp_overlap`;'
49+
drop_tmp_table = "DROP TABLE `{db}._tmp_overlap`;"
4950

5051
# list of crawls
5152
# Note: in order to get a list of released crawls:
@@ -54,49 +55,103 @@
5455
# - see
5556
# https://commoncrawl.s3.amazonaws.com/crawl-data/index.html
5657
crawls = [
57-
'CC-MAIN-2013-20', 'CC-MAIN-2013-48',
58+
"CC-MAIN-2013-20",
59+
"CC-MAIN-2013-48",
5860
#
59-
'CC-MAIN-2014-10', 'CC-MAIN-2014-15', 'CC-MAIN-2014-23',
60-
'CC-MAIN-2014-35', 'CC-MAIN-2014-41', 'CC-MAIN-2014-42',
61-
'CC-MAIN-2014-49', 'CC-MAIN-2014-52',
61+
"CC-MAIN-2014-10",
62+
"CC-MAIN-2014-15",
63+
"CC-MAIN-2014-23",
64+
"CC-MAIN-2014-35",
65+
"CC-MAIN-2014-41",
66+
"CC-MAIN-2014-42",
67+
"CC-MAIN-2014-49",
68+
"CC-MAIN-2014-52",
6269
#
63-
'CC-MAIN-2015-06', 'CC-MAIN-2015-11', 'CC-MAIN-2015-14',
64-
'CC-MAIN-2015-18', 'CC-MAIN-2015-22', 'CC-MAIN-2015-27',
65-
'CC-MAIN-2015-32', 'CC-MAIN-2015-35', 'CC-MAIN-2015-40',
66-
'CC-MAIN-2015-48',
70+
"CC-MAIN-2015-06",
71+
"CC-MAIN-2015-11",
72+
"CC-MAIN-2015-14",
73+
"CC-MAIN-2015-18",
74+
"CC-MAIN-2015-22",
75+
"CC-MAIN-2015-27",
76+
"CC-MAIN-2015-32",
77+
"CC-MAIN-2015-35",
78+
"CC-MAIN-2015-40",
79+
"CC-MAIN-2015-48",
6780
#
68-
'CC-MAIN-2016-07', 'CC-MAIN-2016-18', 'CC-MAIN-2016-22',
69-
'CC-MAIN-2016-26', 'CC-MAIN-2016-30', 'CC-MAIN-2016-36',
70-
'CC-MAIN-2016-40', 'CC-MAIN-2016-44', 'CC-MAIN-2016-50',
81+
"CC-MAIN-2016-07",
82+
"CC-MAIN-2016-18",
83+
"CC-MAIN-2016-22",
84+
"CC-MAIN-2016-26",
85+
"CC-MAIN-2016-30",
86+
"CC-MAIN-2016-36",
87+
"CC-MAIN-2016-40",
88+
"CC-MAIN-2016-44",
89+
"CC-MAIN-2016-50",
7190
#
72-
'CC-MAIN-2017-04', 'CC-MAIN-2017-09', 'CC-MAIN-2017-13',
73-
'CC-MAIN-2017-17', 'CC-MAIN-2017-22', 'CC-MAIN-2017-26',
74-
'CC-MAIN-2017-30', 'CC-MAIN-2017-34', 'CC-MAIN-2017-39',
75-
'CC-MAIN-2017-43', 'CC-MAIN-2017-47', 'CC-MAIN-2017-51',
91+
"CC-MAIN-2017-04",
92+
"CC-MAIN-2017-09",
93+
"CC-MAIN-2017-13",
94+
"CC-MAIN-2017-17",
95+
"CC-MAIN-2017-22",
96+
"CC-MAIN-2017-26",
97+
"CC-MAIN-2017-30",
98+
"CC-MAIN-2017-34",
99+
"CC-MAIN-2017-39",
100+
"CC-MAIN-2017-43",
101+
"CC-MAIN-2017-47",
102+
"CC-MAIN-2017-51",
76103
#
77-
'CC-MAIN-2018-05', 'CC-MAIN-2018-09', 'CC-MAIN-2018-13',
78-
'CC-MAIN-2018-17', 'CC-MAIN-2018-22', 'CC-MAIN-2018-26',
79-
'CC-MAIN-2018-30', 'CC-MAIN-2018-34', 'CC-MAIN-2018-39',
80-
'CC-MAIN-2018-43', 'CC-MAIN-2018-47', 'CC-MAIN-2018-51',
104+
"CC-MAIN-2018-05",
105+
"CC-MAIN-2018-09",
106+
"CC-MAIN-2018-13",
107+
"CC-MAIN-2018-17",
108+
"CC-MAIN-2018-22",
109+
"CC-MAIN-2018-26",
110+
"CC-MAIN-2018-30",
111+
"CC-MAIN-2018-34",
112+
"CC-MAIN-2018-39",
113+
"CC-MAIN-2018-43",
114+
"CC-MAIN-2018-47",
115+
"CC-MAIN-2018-51",
81116
#
82-
'CC-MAIN-2019-04', 'CC-MAIN-2019-09', 'CC-MAIN-2019-13',
83-
'CC-MAIN-2019-18', 'CC-MAIN-2019-22', 'CC-MAIN-2019-26',
84-
'CC-MAIN-2019-30', 'CC-MAIN-2019-35', 'CC-MAIN-2019-39',
85-
'CC-MAIN-2019-43', 'CC-MAIN-2019-47', 'CC-MAIN-2019-51',
117+
"CC-MAIN-2019-04",
118+
"CC-MAIN-2019-09",
119+
"CC-MAIN-2019-13",
120+
"CC-MAIN-2019-18",
121+
"CC-MAIN-2019-22",
122+
"CC-MAIN-2019-26",
123+
"CC-MAIN-2019-30",
124+
"CC-MAIN-2019-35",
125+
"CC-MAIN-2019-39",
126+
"CC-MAIN-2019-43",
127+
"CC-MAIN-2019-47",
128+
"CC-MAIN-2019-51",
86129
#
87-
'CC-MAIN-2020-05', 'CC-MAIN-2020-10', 'CC-MAIN-2020-16',
88-
'CC-MAIN-2020-24', 'CC-MAIN-2020-29', 'CC-MAIN-2020-34',
89-
'CC-MAIN-2020-40', 'CC-MAIN-2020-45', 'CC-MAIN-2020-50',
130+
"CC-MAIN-2020-05",
131+
"CC-MAIN-2020-10",
132+
"CC-MAIN-2020-16",
133+
"CC-MAIN-2020-24",
134+
"CC-MAIN-2020-29",
135+
"CC-MAIN-2020-34",
136+
"CC-MAIN-2020-40",
137+
"CC-MAIN-2020-45",
138+
"CC-MAIN-2020-50",
90139
#
91-
'CC-MAIN-2021-04', 'CC-MAIN-2021-10', 'CC-MAIN-2021-17',
92-
'CC-MAIN-2021-21', 'CC-MAIN-2021-25', 'CC-MAIN-2021-31',
93-
'CC-MAIN-2021-39', 'CC-MAIN-2021-43', 'CC-MAIN-2021-49',
140+
"CC-MAIN-2021-04",
141+
"CC-MAIN-2021-10",
142+
"CC-MAIN-2021-17",
143+
"CC-MAIN-2021-21",
144+
"CC-MAIN-2021-25",
145+
"CC-MAIN-2021-31",
146+
"CC-MAIN-2021-39",
147+
"CC-MAIN-2021-43",
148+
"CC-MAIN-2021-49",
94149
#
95150
]
96151

97152

98153
s3_location = sys.argv[1]
99-
s3_location = s3_location.rstrip('/') # no trailing slash!
154+
s3_location = s3_location.rstrip("/") # no trailing slash!
100155

101156
seed_table = sys.argv[2]
102157

@@ -106,23 +161,29 @@
106161
crawls = filter(lambda c: crawl_selector.match(c), crawls)
107162

108163

109-
cursor = connect(s3_staging_dir="{}/staging".format(s3_location),
110-
region_name="us-east-1").cursor()
164+
cursor = connect(
165+
s3_staging_dir="{}/staging".format(s3_location), region_name="us-east-1"
166+
).cursor()
111167

112168
for crawl in crawls:
113-
query = join_template.format(crawl=crawl,
114-
s3_location="{}/cc".format(s3_location),
115-
db='bigscience',
116-
seed_table=seed_table,
117-
tid='bs')
169+
query = join_template.format(
170+
crawl=crawl,
171+
s3_location="{}/cc".format(s3_location),
172+
db="bigscience",
173+
seed_table=seed_table,
174+
tid="bs",
175+
)
118176
logging.info("Athena query: %s", query)
119177

120178
cursor.execute(query)
121179
logging.info("Athena query ID %s: %s", cursor.query_id, cursor.result_set.state)
122-
logging.info(" data_scanned_in_bytes: %d", cursor.result_set.data_scanned_in_bytes)
123-
logging.info(" total_execution_time_in_millis: %d", cursor.result_set.total_execution_time_in_millis)
124-
125-
cursor.execute(drop_tmp_table.format(db='bigscience'))
180+
logging.info(
181+
" data_scanned_in_bytes: %d", cursor.result_set.data_scanned_in_bytes
182+
)
183+
logging.info(
184+
" total_execution_time_in_millis: %d",
185+
cursor.result_set.total_execution_time_in_millis,
186+
)
187+
188+
cursor.execute(drop_tmp_table.format(db="bigscience"))
126189
logging.info("Drop temporary table: %s", cursor.result_set.state)
127-
128-

cc_pseudo_crawl/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
pyathena
22
surt
3-
tldextract
3+
tldextract

cc_pseudo_crawl/sourcing_sheet_seeds/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ Steps:
88

99
2. do the lookups / table join, see [general instructions](../README.md) using the crawl selector `CC-MAIN-202[01]` to restrict the join for the last 2 years
1010

11-
3. prepare [coverage metrics](./cc-metrics.ipynb)
11+
3. prepare [coverage metrics](./cc-metrics.ipynb)

cc_pseudo_crawl/sourcing_sheet_seeds/candidate_websites_for_crawling.csv

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,9 @@
3030
38,energia-imdea,https://www.energia.imdea.org/,unknown,"multiple releases",,es,spain,"general news","text (web)",manual,,unknown,unknown,energia-imdea,
3131
39,urgente24,https://www.urgente24.com/,unknown,"multiple releases",,es,argentina,"general news","text (web)",manual,,unknown,unknown,urgente24,
3232
40,"corte electoral",https://www.corteelectoral.gub.uy/,unknown,"multiple releases",,es,uruguay,"general news","text (web)",manual,,unknown,unknown,"corte electoral",
33-
41,"observatorio de la política china",https://politica-china.org,unknown,"multiple releases",,es,spain,"general news","text (web)",manual,,unknown,unknown,"1, According to the “Provisions of the Supreme People's Court on the People's Court's Publication of Judgment Documents online” (最高人民法院关于人民法院在互联网公布裁判文书的规定), the online publication of judgment documents should be based on the principle of openness, with non-publicity as an exception. Judicial documents involving national security, juvenile delinquency, divorce proceedings, support or guardianship of minor children, etc., shall not be made public. In
34-
public judgment documents, information concerning personal privacy,
35-
trade secrets, etc., other than the names of the parties, shall also be
33+
41,"observatorio de la política china",https://politica-china.org,unknown,"multiple releases",,es,spain,"general news","text (web)",manual,,unknown,unknown,"1, According to the “Provisions of the Supreme People's Court on the People's Court's Publication of Judgment Documents online” (最高人民法院关于人民法院在互联网公布裁判文书的规定), the online publication of judgment documents should be based on the principle of openness, with non-publicity as an exception. Judicial documents involving national security, juvenile delinquency, divorce proceedings, support or guardianship of minor children, etc., shall not be made public. In
34+
public judgment documents, information concerning personal privacy,
35+
trade secrets, etc., other than the names of the parties, shall also be
3636
deleted from the document.",
3737
42,"icefi - instituto centroamericano de estudios fiscales",http://icefi.org/,unknown,"multiple releases",,es,gt,"general news","text (web)",manual,,unknown,unknown,"icefi - instituto centroamericano de estudios fiscales",
3838
43,"baja california sur - gobierno del estado",http://www.bcs.gob.mx/,unknown,"multiple releases",,es,mexico,"general news","text (web)",manual,,unknown,unknown,"baja california sur - gobierno del estado",

pii-manager/CHANGES.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
v. 0.5.0
2+
* new task list parsing code, adding a "full" format based on dicts, in
3+
addition to the previous "simplified" format based on tuples
4+
* refactored to allow more than one task for a given PII and country
5+
* added the capability to add task descriptors programmatically
6+
* added reading task descriptors from a JSON file
7+
* context validation spec, for all three task implementation types
8+
* TASK_ANY split into LANG_ANY & COUNTRY_ANY
9+
* PII detectors for international phone numbers, for en-any & es-any
10+
* PII detector for IP addresses, language independent
11+
* PII detectors for GOV_ID
12+
- lang pt, countries PT & BR
13+
- lang es, country MX
14+
115
v. 0.4.0
216
* PII GOV_ID task for es-ES and en-AU
317
* PII EMAIL_ADDRESS task

pii-manager/Makefile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,11 +78,11 @@ $(VENV)/bin/pytest:
7878

7979
# -----------------------------------------------------------------------
8080

81-
upload-check:
81+
upload-check: $(PKGFILE)
8282
twine check $(PKGFILE)
8383

84-
upload-test:
84+
upload-test: $(PKGFILE)
8585
twine upload --repository pypitest $(PKGFILE)
8686

87-
upload:
87+
upload: $(PKGFILE)
8888
twine upload $(PKGFILE)

pii-manager/doc/contributing.md

Lines changed: 94 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ repository with the following changes:
55

66
1. If the task type is a new one, add an identifier for it in [PiiEnum]
77
2. If it is for a language not yet covered, add a new language subfolder
8-
undder the [lang] folder, using the [ISO 639-1] code for the language
8+
under the [lang] folder, using the [ISO 639-1] code for the language
99
3. Then
1010
* If it is a country-independent PII, it goes into the `any` subdir
1111
(create that directory if it is not present)
@@ -15,8 +15,8 @@ repository with the following changes:
1515
module (the name of the file is not relevant). The module must contain:
1616
* The task implementation, which can have any of three flavours (regex,
1717
function or class), see below
18-
* The task descriptor, a list with the (compulsory) name `PII_TASKS` (see
19-
below)
18+
* The task descriptor, a list containing all defined tasks. The list
19+
variable *must* be named `PII_TASKS` (see below)
2020
5. Finally, add a unit test to check the validity for the task code, in the
2121
proper place under [test/unit/lang]. There should be at least
2222
- a positive test: one valid PII that has to be detected. For the cases
@@ -35,25 +35,112 @@ including where to add the required documentation explaining the task.
3535

3636
## Task descriptor
3737

38-
The task descriptor is a Python list that contains at least one tuple defining
39-
the entry points for this task (there might be more than one, if the file
40-
implements more than one PII).
38+
The task descriptor is a Python list that contains at least one element
39+
defining the entry points for this task (there might be more than one, if
40+
the file implements more than one PII).
4141

4242
* The name of the list **must be** `PII_TASKS`
43+
* A task entry in the list can have two different shapes: simplified and full.
44+
In a `PII_TASKS` list they can be combined freely.
4345

44-
* Each defined task must be a 2- or 3-element tuple, with these elements:
46+
47+
### Simplified description
48+
49+
In a simplified description a task must be a 2- or 3-element tuple, with
50+
these elements:
4551
- the PII identifier for the task: a member of [PiiEnum]
4652
- the [task implementation]: the regex, function or class implementing the
4753
PII extractor
4854
- (only if the implementation is of regex type) a text description of the
4955
task (for documentation purposes)
5056

5157

58+
### Full description
59+
60+
In a full description a task is a dictionary with these compulsory fields:
61+
* `pii`: the PII identifier for the task: a member of [PiiEnum]
62+
* `type`: the task type: `regex`, `callable` or `PiiTask`
63+
* `task`: for regex tasks, a raw string (contianing the regex to be used);
64+
for function tasks a callable and for PiiTask either a class or a string
65+
with a full class name.
66+
67+
And these optional fields
68+
* `lang`: language this task is designed for (it can also be `LANG_ANY`). If
69+
not present, the language will be determined from the folder structure the
70+
task implementation is in
71+
* `country`: country this task is designed for. If not present, the language
72+
will be determined from the folder structure the task implementation is in,
73+
if possible (else, a `None`value will be used, meaning the task is not
74+
country-dependent)
75+
* `name`: a name for the task. If not present, a name will be generated from
76+
the `pii_name` class-level attribute (PiiTask) or from the class/function
77+
name.
78+
This is meant to provide a higher level of detail than the `PiiEnum`
79+
generic name (e.g. for different types of Government ID). Class-type tasks
80+
can use a dynamic name at runtime (detected PII might have different names),
81+
while function and regexes will have a fixed name.
82+
* `doc`: the documentation for the class. If not present, the docstring for
83+
callable and class types will be used (for regex types, the task will have
84+
no documentation)
85+
* `kwargs`: a dictionary of additional arguments. For `PiiTask` task types,
86+
they will be added as arguments to the class constructor; for `callable`
87+
types they will be added to each call to the task function. It is ignored
88+
for `regex` types.
89+
* `context` and `context_width`: for context validation, see below.
90+
91+
92+
## Context validation
93+
94+
A task of any of the three types of [task implementation] may also include an
95+
additional step for context validation. In this step, all detected PII are
96+
further validated by ensuring that a document chunk around the detected PII
97+
string (before and after) contains one of the specified text contexts.
98+
99+
Note that task descriptors with context *must* be a full description, i.e. the
100+
dict version.
101+
102+
Context validation can have three variants:
103+
* `string`: each text context is a substring to be matched.
104+
* `word`: each text context is also a substring to be matched, but matching
105+
is ensured to work only on full words (the substring can contain more than
106+
one word, but it will be matched only of sequences of full words).
107+
* `regex`: each context is a regular expression (to be matched by the [regex]
108+
Python package)
109+
110+
Regardless of the variante, matching is always performed after normalizing
111+
the extracted document context chunk: normalize whitespace (replace all
112+
whitespace chunks by a single space) and lowercase the chunk.
113+
114+
This validation acts as a filter after the task implementation produces its
115+
results, and it is automatically applied if the task descriptor includes
116+
context (for class-based tasks, it can be replaced with custom code by
117+
overriding the `__call__` method, defined in the parent class).
118+
119+
Context is defined by a `context` field in the task descriptor. This field
120+
is a dictionary, with the following elements:
121+
* `type` indicates the variant, and it can be `string`, `word` or `regex`.
122+
If not present, `string` is assumed.
123+
* `value` should be a list of strings, any of which validate a document
124+
context. For `regex` mode the strings will be regex patterns.
125+
* `width` is an integer, or a tuple of two integers, that define
126+
the width (in characters) of the document chunk (before and after the PII
127+
string) that define the document context. If not specified, a default is
128+
used.
129+
130+
As a shortcut, the `context` field can also contain a simple list of strings
131+
(or a single string). In this case, the context is defined as type `string`,
132+
with the default width, and the field contents define the value.
133+
134+
An example of context validation can be seen in the [international phone
135+
number] task.
52136

53137
[task implementation]: #task-implementation
54138
[PiiEnum]: ../src/pii_manager/piienum.py
55139
[tasks]: tasks.md
56140
[lang]: ../src/pii_manager/lang
57141
[test/unit/lang]: ../test/unit/lang
142+
[international phone number]: ../src/pii_manager/lang/en/any/international_phone_number.py
143+
58144
[ISO 639-1]: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
59145
[ISO 3166-1]: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
146+
[regex]: https://github.com/mrabarnett/mrab-regex

0 commit comments

Comments
 (0)