Basics Of Scrapy
The basics of scrapy - a Python based web scraping and crawling framework#
-
Install Scrapy:
``` pip install Scrapy
```
Upgrade Scrapy:
pip install --upgrade scrapy
-
Start a new project:
scrapy startproject <name>
-
Basic Structure:
- scrapy.cfg: the project configuration file - <name>/: the project’s python module, you’ll later import your code from here. - <name>/items.py: the project’s items file. - <name>/pipelines.py: the project’s pipelines file. - <name>/settings.py: the project’s settings file. - <name>/spiders/: a directory where you’ll later put your spiders.
-
Create a spider
/spiders/ :_spider.py ``` import scrapy
class DmozSpider(scrapy.Spider): name = “dmoz” allowed_domains = [“dmoz.org”] start_urls = [ “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”, “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/” ]
def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)
```
-
Run the crawl
scrapy crawl dmoz
-
Using the Selectors in the shell:
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"