Websites generally want to differentiate bots from human users and ideally block those that are malicious or not useful. Certain verticals typically require less or more effort when scraping sites. Below we outlined some of our experiences in bullet points.
How do sites ideally protect themselves against scraping/crawling?
- Putting a strict rate limit on data-center IPs/non-ISP IPs
- Putting a reasonable rate limit on all other IPs
- JS cross-check on all visitors (except friendly bots, e.g. Google bot can be detected)
- Checking browser profile
- Checking user behaviour on site (e.g. mouse movement, typing behaviour, scrolling)
How do travel sites typically avoid scraping?
- Block all wget/phantomJS/(headless browser) requests
- JS support needed
- Real user session must be impersonated
- Datacenter IPs blocked completely
- Tier 1 IPs highly recommended
How does Google do bot-blocking?
- Allows a certain rate of all request types
- User agents are important
- Highly sophisticated bot detection
- Datacenter IPs get permanent block if allowed rate exceeded
- Tier 1 IPs recommended (unless you have a shit load of datacenter IPs to burn)
What players are there in bot-blocking and anti scraping technologies?
- Distil bot discovery, now imperva, are one of the veterans of the industry. They are market leader with more than 100k sites using their service according builtwith
- BuiltWith shows about 4k sites using DataDome
- For Shield Square Builtwith shows 12k sites using their bot detection services
- Custom made solutions are usually using lists of datacenter IPs, that are available in the internet
- Captcha Services like recaptcha are often used in combination with Distil, ShieldSquare & Co