How to do basic web scraping using Scrapy on a Windows Azure virtual machine

In this tutorial, I will do a walkthrough on how to scrape links and titles from a website using Scrapy on an Azure virtual machine.

First, create a VM through the Azure portal. It can be either Windows or Linux since Scrapy runs on both. In this tutorial Ubuntu 13.04 will be used.

After you create a VM and it starts, go to Dashboard and copy the Public IP address.

Use your favorite SSH client to connect to the VM. The username is azureuser.

After you connect, use pip or easy_install to install Scrapy.

pip install Scrapy or

easy_install Scrapy (for Ubuntu 13.04, make sure to install python-dev with “sudo apt-get install python-dev” or it gives you an error) In this tutorial , we are going to scrape titles and links to Medium’s collections, since they don’t provide a list of all collections. Start by creating a new Scrapy project by using

scrapy startproject medium

This will give you a folder structure in your home directory. Open ~/medium/medium/items.py with your favorite text editor (I use nano, don’t judge me!) and define your class.

In this case, we have title and link as our fields.

[![scrapy-azure-4](http://sertacozercan.com/wp-content/uploads/scrapy-azure-4-150x106.png)](http://sertacozercan.com/wp-content/uploads/scrapy-azure-4.png)

Now, we want to create a spider to scrape the page. Create a new python file for the name of your choosing in ~/medium/medium/spiders/ (for example, mediumspider.py)

Import the BaseSpider and XPath classes since these are what we are going to be used. Also, import the class you made in the last step.

Give it a name, allowed domain and the page you want to scrape.

Define a parse method to scrape every link inside a specific class name (collection-item in this case) with specific tags or links (text inside div/h3 and the href tag inside the initial link) and append it to a custom dictionary called items.

[![scrapy-azure-5](http://sertacozercan.com/wp-content/uploads/scrapy-azure-5-150x106.png)](http://sertacozercan.com/wp-content/uploads/scrapy-azure-5.png)

You can run this with

scrapy crawl medium

or you can [export the output to a file](http://doc.scrapy.org/en/0.16/topics/feed-exports.html#topics-feed-exports), such as json, xml, csv with

``` scrapy crawl medium -o items.json -t json ``` You can schedule a cron job to automate this process. Please post a comment if you think this was helpful or how to improve it to help others.

Sertaç Özercan