The basics of scrapy - a Python based web scraping and crawling framework

  1. Install Scrapy:

    ``` pip install Scrapy

    ```

    Upgrade Scrapy:

    pip install --upgrade scrapy

  2. Start a new project:

    scrapy startproject <name>

  3. Basic Structure:

    - scrapy.cfg: the project configuration file - <name>/: the project’s python module, you’ll later import your code from here. - <name>/items.py: the project’s items file. - <name>/pipelines.py: the project’s pipelines file. - <name>/settings.py: the project’s settings file. - <name>/spiders/: a directory where you’ll later put your spiders.

  4. Create a spider **/spiders/_spider.py**:

    ``` import scrapy

    class DmozSpider(scrapy.Spider): name = “dmoz” allowed_domains = [“dmoz.org”] start_urls = [ “http://www.dmoz.org/Computers/Programming/Languages/Python/Books/”, “http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/” ]

    def parse(self, response):
         filename = response.url.split("/")[-2]
         with open(filename, 'wb') as f:
           f.write(response.body)  ```
    
  5. Run the crawl

    scrapy crawl dmoz

  6. Using the Selectors in the shell:

    scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"