HttpRequest and HttpResponse Classes
Let’s try to scrape from data from my own blog i.e. www.shaktitanwar.com . We will try to extract all blog titles on home page.
There are various facets to scraping:
- We need to first have a URL that we need to scrape i.e. shaktitanwar.com
- What all data needs to be scraped? In our case it’s Blog titles.
- Which verb is used by the URL to show data? In our case it’s a GET request
- The HTTP Headers required by website to serve the content properly.
First we need to analyze the requests that are being sent to server. We can do that using lots of developers tools available. Some of the popular ones are Firebug, Chrome Developer Tools, IE developer tools, Fiddler, Burp Suite , WireShark etc.
For this demo I am going to use fiddler.
You can download fiddler from below link:
Once you have downloaded and started fiddler, Open any browser of your choice and make a request to URL you need to scrape. The request will be logged in fiddler and you need to check request Verb and Http Header that are sent to server with request. This will be shown in Headers Tab as shown below:
In our case the Http Headers and verb are:
HTTP Verb: GET
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
In next article we will try to fetch data using .net classes.