You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Scraper-Reference.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,21 +10,21 @@
10
10
11
11
## Overview
12
12
13
-
Starting from a root url, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end.
13
+
Starting from a root URL, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end.
14
14
15
15
Scrapers rely on the following libraries:
16
16
17
17
* [Typhoeus](https://github.com/typhoeus/typhoeus) for making HTTP requests
18
18
* [HTML::Pipeline](https://github.com/jch/html-pipeline) for applying filters
19
19
* [Nokogiri](http://nokogiri.org/) for parsing HTML
20
20
21
-
There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base url with a local path before reading a file.
21
+
There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base URL with a local path before reading a file. `FileScraper` uses the placeholder `localhost` base URL by default and includes a filter to remove any URL pointing to it at the end.
22
22
23
23
To be processed, a response must meet the following requirements:
24
24
25
25
* 200 status code
26
26
* HTML content type
27
-
* effective URL (after redirection) contained in the base url (explained below)
27
+
* effective URL (after redirection) contained in the base URL (explained below)
28
28
29
29
(`FileScraper` only checks if the file exists and is not empty.)
30
30
@@ -34,7 +34,7 @@ Each URL is requested only once (case-insensitive).
34
34
35
35
Configuration is done via class attributes and divided into three main categories:
36
36
37
-
* [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) — essential information such as name, version, url, etc.
37
+
* [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) — essential information such as name, version, URL, etc.
38
38
* [Filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks) — the list of filters that will be applied to each page.
39
39
* [Filter options](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-options) — the options passed to said filters.
40
40
@@ -59,6 +59,7 @@ Configuration is done via class attributes and divided into three main categorie
59
59
60
60
* `base_url` [String] **(required in `UrlScraper`)**
61
61
The documents' location. Only URLs _inside_ the `base_url` will be scraped. "inside" more or less means "starting with" except that `/docs` is outside `/doc` (but `/doc/` is inside).
62
+
`FileScraper`'s default is `localhost`. (Note: any iframe, image, or skipped link pointing to localhost will be removed by the `CleanLocalUrls` filter; the value should be overridden if the documents are available online.)
62
63
Unless `root_path` is set, the root/initial URL is equal to `base_url`.
0 commit comments