Your address will show here +12 34 56 78

Certain verticals typically require less or more effort if you want to scrape sites. Here some of our experiences in bullet points.

How do sites ideally protect themselves against scraping/crawling?

  • Putting a strict rate limit on data-center IPs/non-ISP IPs

  • Putting a reasonable rate limit on all other IPs

  • JS cross-check on all visitors (except friendly bots, e.g. Google bot can be detected)

  • Checking browser profile

  • Checking user behaviour on site (e.g. mouse movement, typing behaviour, scrolling)

How travel sites typically avoid to be crawled

  • Block all wget/phantomJS/(headless browser) requests

  • JS support needed

  • Real user session must be impersonated

  • Datacenter IPs blocked completely

  • Tier 1 IPs highly recommended

How does Google do bot-blocking

  • Allows a certain rate of all request types

  • User agents are important

  • Highly sophisticated bot detection

  • Datacenter IPs get permanent block if allowed rate exceeded

  • Tier 1 IPs recommended (unless you have a shit load of datacenter IPs to burn)

What are the players in bot-blocking and anti scraping technologies

  • Distil bot discovery, now imperva, are one of the veterans of the industry. They are market leader with more than 100k sites using their service according builtwith

  • BuiltWith shows about 4k sites using DataDome

  • For Shield Square Builtwith shows 12k sites using their bot detection services

  • Custom made solutions are usually using lists of datacenter IPs, that are available in the internet

  • Captcha Services like recaptcha are often used in combination with Distil, ShieldSquare & Co

Bot detection explained

Proxies for Data Scraping & Aggregation makes competitor & market monitoring so simple as never! Forget about getting blocked from crawling data, reviews, prices & product data, remain anonymous with the smartest proxies on the market. Ramp up your business with residential IPs & precise geo-targeting. Simple and API-based.