Bot Detection

Websites generally want to differentiate bots from human users and ideally block those that are malicious or not useful. Certain verticals typically require less or more effort when scraping sites. Below we outlined some of our experiences in bullet points.

How do sites ideally protect themselves against scraping/crawling?

Putting a strict rate limit on data-center IPs/non-ISP IPs
Putting a reasonable rate limit on all other IPs
JS cross-check on all visitors (except friendly bots, e.g. Google bot can be detected)
Checking browser profile
Checking user behaviour on site (e.g. mouse movement, typing behaviour, scrolling)

How do travel sites typically avoid scraping?

Block all wget/phantomJS/(headless browser) requests
JS support needed
Real user session must be impersonated
Datacenter IPs blocked completely
Tier 1 IPs highly recommended

How does Google do bot-blocking?

Allows a certain rate of all request types
User agents are important
Highly sophisticated bot detection
Datacenter IPs get permanent block if allowed rate exceeded
Tier 1 IPs recommended (unless you have a shit load of datacenter IPs to burn)

What players are there in bot-blocking and anti scraping technologies?

Distil bot discovery, now imperva, are one of the veterans of the industry. They are market leader with more than 100k sites using their service according builtwith
BuiltWith shows about 4k sites using DataDome
For Shield Square Builtwith shows 12k sites using their bot detection services
Custom made solutions are usually using lists of datacenter IPs, that are available in the internet
Captcha Services like recaptcha are often used in combination with Distil, ShieldSquare & Co

How do sites ideally protect themselves against scraping/crawling?

How do travel sites typically avoid scraping?

How does Google do bot-blocking?

What players are there in bot-blocking and anti scraping technologies?

Product

Resources

Blog

We Accept All Major Credit Cards For Fast And Easy Payment