Skip to content

Commit 95ec2b4

Browse files
Pseudo-Crawl curated list of sites: Data Sourcing Candidate seeds spreadsheet (#317)
* Initialize directory for pseudo-crawl-related tools and subsets Add seeds from data sourcing candidate seeds spreadsheet * Seeds cleanup of data sourcing candidate seeds spreadsheet * Add tools and instructions how to lookup seed list in CC * Add metrics notebook and data
1 parent 275a92d commit 95ec2b4

8 files changed

Lines changed: 3158 additions & 0 deletions

File tree

cc_pseudo_crawl/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Extracting Content from Common Crawl for Curated List of Sites
2+
3+
aka. "pseudo-crawls"
4+
5+
- tools to extract content from Common Crawl for curated list of sites
6+
- metrics about planned and ongoing pseudo-crawls to understand their coverage (size, languages, content types, etc.)
7+
8+
## Preliminary Steps
9+
10+
- create AWS account in order to use [Athena](https://aws.amazon.com/athena/) to perform the lookups
11+
- in Athena, create database `ccindex` and table `ccindex`, see https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
12+
- create the database `bigscience` which holds the joined data and more
13+
```sql
14+
CREATE DATABASE bigscience;
15+
```
16+
17+
## Looking Up URLs per Site List
18+
19+
For every site list
20+
21+
1. create a seed table which includes the join column (host or domain name, SURT URL). See [cleanup-seeds](./sourcing_sheet_seeds/cleanup-seeds.ipynb) for an example of this and the following step.
22+
23+
2. export the table to a file, ideally in a columnar format (Parquet or ORC)
24+
25+
3. upload the seed file to S3
26+
```
27+
aws s3 cp seeds.gz.parquet s3://bucket/path/seeds/
28+
```
29+
Note: the S3 path must point to a bucket with write permissions granted. The path needs to be adjusted also in follwing commands.
30+
31+
3. import the seed table into Athena
32+
```sql
33+
CREATE EXTERNAL TABLE IF NOT EXISTS bigscience.seeds (
34+
`id` int,
35+
`title` string,
36+
`link` string,
37+
`language` string,
38+
`url_path_prefix` string,
39+
`url_host_name` string,
40+
`url_host_registered_domain` string,
41+
`url_surtkey` string)
42+
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
43+
WITH SERDEPROPERTIES (
44+
'serialization.format' = '1'
45+
) LOCATION 's3://bucket/path/seeds/'
46+
TBLPROPERTIES ('has_encrypted_data'='false');
47+
```
48+
49+
4. join the seeds table crawl by crawl with Common Crawl's index, creating a temporary table which is later used as one partition of the result table
50+
```
51+
python3 cc_lookup.py s3://bucket/path seeds "CC-MAIN-2021"
52+
```
53+
This will run the join for all crawls of the year 2021 and put the join data into `s3://bucket/path/cc`.
54+
55+
5. finally, create a table holding the result data in order to get further metrics or prepare the content export
56+
```sql
57+
CREATE EXTERNAL TABLE IF NOT EXISTS bigscience.cc (
58+
id INT,
59+
title STRING,
60+
link STRING,
61+
language STRING,
62+
url_surtkey_prefix STRING,
63+
url_surtkey STRING,
64+
url_host_tld STRING,
65+
url_host_registered_domain STRING,
66+
url_host_name STRING,
67+
url STRING,
68+
fetch_status SMALLINT,
69+
fetch_time TIMESTAMP,
70+
warc_filename STRING,
71+
warc_record_offset INT,
72+
warc_record_length INT,
73+
fetch_redirect STRING,
74+
content_mime_detected STRING,
75+
content_languages STRING)
76+
PARTITIONED BY (
77+
crawl STRING,
78+
subset STRING)
79+
STORED AS parquet
80+
LOCATION 's3://bucket/path/cc/'
81+
TBLPROPERTIES (
82+
'has_encrypted_data'='false',
83+
'parquet.compression'='GZIP');
84+
```
85+
86+
6. load the partitions of the join table
87+
```sql
88+
MSCK REPAIR TABLE bigscience.cc;
89+
```

cc_pseudo_crawl/cc_lookup.py

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
#!/usr/bin/python3
2+
3+
# iterate over monthly crawls and store
4+
# the joined data as a partition of the result table
5+
6+
import logging
7+
import re
8+
import sys
9+
10+
from pyathena import connect
11+
12+
logging.basicConfig(level='INFO',
13+
format='%(asctime)s %(levelname)s %(name)s: %(message)s')
14+
15+
join_template = """
16+
CREATE TABLE {db}._tmp_overlap
17+
WITH (external_location = '{s3_location}/crawl={crawl}/',
18+
partitioned_by = ARRAY['subset'],
19+
format = 'PARQUET',
20+
parquet_compression = 'GZIP')
21+
AS SELECT
22+
{tid}.id AS id,
23+
{tid}.title AS title,
24+
{tid}.link AS link,
25+
{tid}.language AS language,
26+
{tid}.url_surtkey AS url_surtkey_prefix,
27+
cc.url_surtkey AS url_surtkey,
28+
cc.url_host_tld AS url_host_tld,
29+
cc.url_host_registered_domain AS url_host_registered_domain,
30+
cc.url_host_name AS url_host_name,
31+
cc.url AS url,
32+
cc.fetch_status AS fetch_status,
33+
cc.fetch_time AS fetch_time,
34+
cc.warc_filename AS warc_filename,
35+
cc.warc_record_offset AS warc_record_offset,
36+
cc.warc_record_length AS warc_record_length,
37+
cc.fetch_redirect AS fetch_redirect,
38+
cc.content_mime_detected AS content_mime_detected,
39+
cc.content_languages AS content_languages,
40+
cc.subset AS subset
41+
FROM ccindex.ccindex AS cc
42+
RIGHT OUTER JOIN {db}.{seed_table} AS {tid}
43+
ON cc.url_host_registered_domain = {tid}.url_host_registered_domain
44+
AND strpos(cc.url_surtkey, {tid}.url_surtkey) = 1
45+
WHERE cc.crawl = '{crawl}'
46+
"""
47+
48+
drop_tmp_table = 'DROP TABLE `{db}._tmp_overlap`;'
49+
50+
# list of crawls
51+
# Note: in order to get a list of released crawls:
52+
# - query Athena
53+
# SHOW PARTITIONS ccindex
54+
# - see
55+
# https://commoncrawl.s3.amazonaws.com/crawl-data/index.html
56+
crawls = [
57+
'CC-MAIN-2013-20', 'CC-MAIN-2013-48',
58+
#
59+
'CC-MAIN-2014-10', 'CC-MAIN-2014-15', 'CC-MAIN-2014-23',
60+
'CC-MAIN-2014-35', 'CC-MAIN-2014-41', 'CC-MAIN-2014-42',
61+
'CC-MAIN-2014-49', 'CC-MAIN-2014-52',
62+
#
63+
'CC-MAIN-2015-06', 'CC-MAIN-2015-11', 'CC-MAIN-2015-14',
64+
'CC-MAIN-2015-18', 'CC-MAIN-2015-22', 'CC-MAIN-2015-27',
65+
'CC-MAIN-2015-32', 'CC-MAIN-2015-35', 'CC-MAIN-2015-40',
66+
'CC-MAIN-2015-48',
67+
#
68+
'CC-MAIN-2016-07', 'CC-MAIN-2016-18', 'CC-MAIN-2016-22',
69+
'CC-MAIN-2016-26', 'CC-MAIN-2016-30', 'CC-MAIN-2016-36',
70+
'CC-MAIN-2016-40', 'CC-MAIN-2016-44', 'CC-MAIN-2016-50',
71+
#
72+
'CC-MAIN-2017-04', 'CC-MAIN-2017-09', 'CC-MAIN-2017-13',
73+
'CC-MAIN-2017-17', 'CC-MAIN-2017-22', 'CC-MAIN-2017-26',
74+
'CC-MAIN-2017-30', 'CC-MAIN-2017-34', 'CC-MAIN-2017-39',
75+
'CC-MAIN-2017-43', 'CC-MAIN-2017-47', 'CC-MAIN-2017-51',
76+
#
77+
'CC-MAIN-2018-05', 'CC-MAIN-2018-09', 'CC-MAIN-2018-13',
78+
'CC-MAIN-2018-17', 'CC-MAIN-2018-22', 'CC-MAIN-2018-26',
79+
'CC-MAIN-2018-30', 'CC-MAIN-2018-34', 'CC-MAIN-2018-39',
80+
'CC-MAIN-2018-43', 'CC-MAIN-2018-47', 'CC-MAIN-2018-51',
81+
#
82+
'CC-MAIN-2019-04', 'CC-MAIN-2019-09', 'CC-MAIN-2019-13',
83+
'CC-MAIN-2019-18', 'CC-MAIN-2019-22', 'CC-MAIN-2019-26',
84+
'CC-MAIN-2019-30', 'CC-MAIN-2019-35', 'CC-MAIN-2019-39',
85+
'CC-MAIN-2019-43', 'CC-MAIN-2019-47', 'CC-MAIN-2019-51',
86+
#
87+
'CC-MAIN-2020-05', 'CC-MAIN-2020-10', 'CC-MAIN-2020-16',
88+
'CC-MAIN-2020-24', 'CC-MAIN-2020-29', 'CC-MAIN-2020-34',
89+
'CC-MAIN-2020-40', 'CC-MAIN-2020-45', 'CC-MAIN-2020-50',
90+
#
91+
'CC-MAIN-2021-04', 'CC-MAIN-2021-10', 'CC-MAIN-2021-17',
92+
'CC-MAIN-2021-21', 'CC-MAIN-2021-25', 'CC-MAIN-2021-31',
93+
'CC-MAIN-2021-39', 'CC-MAIN-2021-43', 'CC-MAIN-2021-49',
94+
#
95+
]
96+
97+
98+
s3_location = sys.argv[1]
99+
s3_location = s3_location.rstrip('/') # no trailing slash!
100+
101+
seed_table = sys.argv[2]
102+
103+
crawl_selector = re.compile(sys.argv[3], re.IGNORECASE)
104+
105+
106+
crawls = filter(lambda c: crawl_selector.match(c), crawls)
107+
108+
109+
cursor = connect(s3_staging_dir="{}/staging".format(s3_location),
110+
region_name="us-east-1").cursor()
111+
112+
for crawl in crawls:
113+
query = join_template.format(crawl=crawl,
114+
s3_location="{}/cc".format(s3_location),
115+
db='bigscience',
116+
seed_table=seed_table,
117+
tid='bs')
118+
logging.info("Athena query: %s", query)
119+
120+
cursor.execute(query)
121+
logging.info("Athena query ID %s: %s", cursor.query_id, cursor.result_set.state)
122+
logging.info(" data_scanned_in_bytes: %d", cursor.result_set.data_scanned_in_bytes)
123+
logging.info(" total_execution_time_in_millis: %d", cursor.result_set.total_execution_time_in_millis)
124+
125+
cursor.execute(drop_tmp_table.format(db='bigscience'))
126+
logging.info("Drop temporary table: %s", cursor.result_set.state)
127+
128+

cc_pseudo_crawl/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
pyathena
2+
surt
3+
tldextract
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Pseudo-Crawl Data Sourcing Candidate Seeds Spreadsheet
2+
3+
Source: https://docs.google.com/spreadsheets/d/1DNLAGz--qvLh-0qQ7pMPGiNeUMgp-fRgn-8mbLagC7U/edit#gid=513216703 (timestamp 2021-11-28, reverted edits by anonymous user on record 16 - diariovasco.com), exported as [candidate_websites_for_crawling.csv](./candidate_websites_for_crawling.csv)
4+
5+
Steps:
6+
7+
1. run [cleanup-seeds](./cleanup-seeds.ipynb) to prepare a clean seed list
8+
9+
2. do the lookups / table join, see [general instructions](../README.md) using the crawl selector `CC-MAIN-202[01]` to restrict the join for the last 2 years
10+
11+
3. prepare [coverage metrics](./cc-metrics.ipynb)

0 commit comments

Comments
 (0)