Automating applications is a serious business. Most business requires data coming from various sources. Usually there are 2 techniques that are used to fetch data from sources:
1.) Calling Provider API :
This is a White hat approach of getting data from data sources.
It’s more reliable since contracts are provided by the provider and don’t change often.
Usually comes at a price.
Providers don’t expose all data via APIs
This is a black hat approach and is deemed illegal by providers. Scrapping or scrubbing is a process of extracting data available publically on website pages.
It’s free and we can scrape whatever data is available on website.
Providers usually don’t show all data on websites i.e. they usually show a low number of pages than what’s available on site. Example Google allows only 60 pages of search results out of millions for a particular search.
Have to deal with spam blockers on sites like Captcha and IP blocks
I will try to explain the black hat approach more so since there are not enough resources available on internet for this.
- Browser Automation
- Headless automation using tools like Selenium
- Sending Socket request using native HttpRequest and HttpResponse classes.