We’ll use that to construct
the regular expression for the links to follow: /tor/\d+.
We’ll use
XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent
pages: pages:
http://www.mininova.org/tor/2676093
And look at the page HTML source to construct the
XPath to select the data we want which is: torrent name, description
and size.
By looking at the page HTML source the file name is contained inside a
tag:
Darwin - The Evolution Of An Exhibition
An XPath expression to extract the name could be:
//h1/text()
And the description is contained inside a
0 码力 |
222 页 |
988.92 KB
| 2 年前 3
Spiders
Write the rules to crawl your websites.
Selectors
Extract the data from web pages using XPath.
Scrapy shell
Test your extraction code in an interactive environment.
Item Loaders
Populate your that
to construct the regular expression for the links to follow: /tor/\d+.
We’ll use XPath [http://www.w3.org/TR/xpath] for selecting the data to extract from the web
page HTML source. Let’s take one of torrent pages:
http://www.mininova.org/tor/2676093
And look at the page HTML source to construct the XPath to select the data we want
which is: torrent name, description and size.
By looking at the page HTML
0 码力 |
298 页 |
544.11 KB
| 2 年前 3
extended CSS selectors and
XPath expressions, with helper methods to extract using regular expressions.
• An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape based on XPath or CSS expressions
called Scrapy Selectors. For more information about selectors and other extraction mechanisms see the Selectors
documentation.
Here are some examples of XPath expressions simple examples of what you can do with XPath, but XPath expressions are indeed much
more powerful. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this
tutorial
0 码力 |
244 页 |
1.05 MB
| 2 年前 3
Spiders
Write the rules to crawl your websites.
Selectors
Extract the data from web pages using XPath.
Scrapy shell
Test your extraction code in an interactive environment.
Items
Define the data you extended CSS selectors and XPath expressions, with helper methods to
extract using regular expressions.
An interactive shell console (IPython aware) for trying out the CSS and XPath
expressions to scrape data There are several ways to extract data from web pages. Scrapy uses a mechanism
based on XPath [http://www.w3.org/TR/xpath] or CSS [http://www.w3.org/TR/selectors] expressions
called Scrapy Selectors. For more
0 码力 |
303 页 |
533.88 KB
| 2 年前 3
response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first() extended CSS selectors and
XPath expressions, with helper methods to extract using regular expressions.
• An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape can try selecting elements using CSS with the response object:
>>> response.css('title')
[xpath='descendant-or-self::title' data='Quotes to Scrape'>]
The result of running response
0 码力 |
272 页 |
1.11 MB
| 2 年前 3
We’ll use that to construct the regular expression for the links to follow: /tor/\d+.
We’ll use
XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent
pages: pages:
http://www.mininova.org/tor/2676093
And look at the page HTML source to construct the
XPath to select the data we want which is: torrent name, description
and size.
By looking at the page HTML source the file name is contained inside a
tag:
Darwin - The Evolution Of An Exhibition
An XPath expression to extract the name could be:
//h1/text()
And the description is contained inside a
0 码力 |
199 页 |
926.97 KB
| 2 年前 3
response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first() extended CSS selectors and
XPath expressions, with helper methods to extract using regular expressions.
• An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape can try selecting elements using CSS with the response object:
>>> response.css('title')
[xpath='descendant-or-self::title' data='Quotes to Scrape'>]
The result of running response
0 码力 |
266 页 |
1.10 MB
| 2 年前 3