Skip to content

Commit a8ffebc

Browse files
committed
Updated Scraper Reference (markdown)
1 parent 0caa366 commit a8ffebc

1 file changed

Lines changed: 5 additions & 4 deletions

File tree

Scraper-Reference.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,21 +10,21 @@
1010

1111
## Overview
1212

13-
Starting from a root url, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end.
13+
Starting from a root URL, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end.
1414

1515
Scrapers rely on the following libraries:
1616

1717
* [Typhoeus](https://github.com/typhoeus/typhoeus) for making HTTP requests
1818
* [HTML::Pipeline](https://github.com/jch/html-pipeline) for applying filters
1919
* [Nokogiri](http://nokogiri.org/) for parsing HTML
2020

21-
There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base url with a local path before reading a file.
21+
There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base URL with a local path before reading a file. `FileScraper` uses the placeholder `localhost` base URL by default and includes a filter to remove any URL pointing to it at the end.
2222

2323
To be processed, a response must meet the following requirements:
2424

2525
* 200 status code
2626
* HTML content type
27-
* effective URL (after redirection) contained in the base url (explained below)
27+
* effective URL (after redirection) contained in the base URL (explained below)
2828

2929
(`FileScraper` only checks if the file exists and is not empty.)
3030

@@ -34,7 +34,7 @@ Each URL is requested only once (case-insensitive).
3434

3535
Configuration is done via class attributes and divided into three main categories:
3636

37-
* [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) — essential information such as name, version, url, etc.
37+
* [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) — essential information such as name, version, URL, etc.
3838
* [Filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks) — the list of filters that will be applied to each page.
3939
* [Filter options](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-options) — the options passed to said filters.
4040

@@ -59,6 +59,7 @@ Configuration is done via class attributes and divided into three main categorie
5959

6060
* `base_url` [String] **(required in `UrlScraper`)**
6161
The documents' location. Only URLs _inside_ the `base_url` will be scraped. "inside" more or less means "starting with" except that `/docs` is outside `/doc` (but `/doc/` is inside).
62+
`FileScraper`'s default is `localhost`. (Note: any iframe, image, or skipped link pointing to localhost will be removed by the `CleanLocalUrls` filter; the value should be overridden if the documents are available online.)
6263
Unless `root_path` is set, the root/initial URL is equal to `base_url`.
6364

6465
* `root_path` [String] **(inherited)**

0 commit comments

Comments
 (0)