Merge branch 'main' of github.com:niespodd/browser-fingerprinting into main
189
README.md
@@ -1,25 +1,172 @@
|
||||
# Browser Fingerprinting, Bot Detection 👨🔧
|
||||
# Avoiding bot detection: How to scrape the web without getting blocked? 👨🔧
|
||||
|
||||
Whether you're just starting to build a web scraper from scratch and wondering what you're doing wrong because your solution isn't working, or you've already been working with crawlers for a while and are stuck on a page that gives you an error saying you're a bot, you can't go any further, keep reading.
|
||||
|
||||
Anti-bot solutions have evolved in recent years. More and more websites are introducing security measures: from simple ones, such as filtering IP addresses according to their geolocation, to advanced ones based on in-depth analysis of browser parameters and behavioral analysis. All this makes web scraping content more difficult and costly than a few years ago. Nevertheless, it is still possible. Here I highlight a few tips that you may find helpful.
|
||||
|
||||
## Where to begin building undetectable bot?
|
||||
|
||||
Below you can find list of curated services that I used to get around different anti-bot protections. Depending on your use-case you may need one of the following:
|
||||
|
||||
| Scenario/use-case | Solution | Example |
|
||||
| - | - | - |
|
||||
| **Short-lived sessions without auth** | Pool of rotating IP addresses | That comes handy when you scrape websites like Amazon, Walmart or public LinkedIn pages. That is any website where no sign-in is required. You plan to make a high number of short-lived sessions and can afford being blocked every now and then. |
|
||||
| **Geographically restricted websites** | Region-specific pool of IP addresses | This is useful when the website uses a firewall similar to [the one from Cloudflare to block entire geography](https://community.cloudflare.com/t/blocking-entire-countries/24172/8) from accessing it. |
|
||||
| **Long-lived sessions after sign-in** | Repeatable pool of IP addresses and stable set of browser fingerprints | The most common scenario here is social media automation e.g. you build a tool to automate social media accounts to manage ads more efficiently. |
|
||||
| **Javascript-based detection** | Use of popular evasion libraries, similar to [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth) | There is a number of websites utilizing [FingerprintJS](https://fingerprintjs.com/) that can be easily bypassed when you employ open-source plugins such as the aforementioned puppeteer stealth plugin to work with your existing software. |
|
||||
| **Detection with browser fingerprinting techniques** | Natural looking browser fingerprints. That is, having covered the whole surface that is being validated by the installed Javascript solution on the target website. | These are one of the most advanced cases. Mainstream examples are credit card processors such as Adyen or Stripe. A very sophisticated browser fingerprint is being created to detect credit fraud, or prompt additional authorization from the user. |
|
||||
| **Unique set of detection techniques** | Specialized bot software that targets the unique detection surface of the target website. | Good examples are [sneakers marketplace websites and e-commerce shops, reportedly being under heavy attack from custom made bot software](https://www.businessinsider.com/sneaker-bots-how-to-buy-make-and-run-the-tech-2021-1). |
|
||||
| **Simple custom-made detection techniques** | Before diving into any of the above, if you are targeting a smaller website, it is very likely that all you need is a [Scrapy script with tweaks](https://www.zyte.com/blog/how-to-scrape-the-web-without-getting-blocked/), a cheap data-center proxy, and you are good to go. | - |
|
||||
|
||||
Once you have decided on what type of evasion is going to be needed in your project, you can use the list below to pick the best provider for your project:
|
||||
|
||||
|
||||
### Recommended services
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Type</th>
|
||||
<th width=50%>Service</th>
|
||||
<th>Note</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td rowspan=3><b>Proxy</b></td>
|
||||
<td>
|
||||
<a href="https://cutt.ly/VRkFS7T">
|
||||
<b>BrightData (formerly Luminati Networks)</b><br>
|
||||
<img src="./assets/brightdata.png">
|
||||
</a>
|
||||
</td>
|
||||
<td>
|
||||
One of the most reliable, stable and recommended proxy provider. Best to begin there and if it turns out to be too pricey, move to cheaper alternatives.
|
||||
</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>
|
||||
<a href="https://cutt.ly/GRkG2uZ">
|
||||
<b>Global Peer to Business Proxy Network - infatica.io</b><br>
|
||||
<img src="./assets/infatica.png">
|
||||
</a>
|
||||
</td>
|
||||
<td>
|
||||
An alternative to BrightData that is <b>three times cheaper</b>, but however do mind their terms of use.
|
||||
</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>
|
||||
<a href="#">
|
||||
<b>Oxylabs</b><br>
|
||||
<img src="./assets/oxylabs.png">
|
||||
</a>
|
||||
</td>
|
||||
<td>
|
||||
Competitor to BrightData with very similar pricing model. Rumor has it that they have a better TCP fingerprinting masking mechanism in place.
|
||||
</td>
|
||||
</tr>
|
||||
|
||||
|
||||
<tr>
|
||||
<td rowspan=2>
|
||||
<b>Scraping as a service</b>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://cutt.ly/VRkHvnL">
|
||||
<b>ScrapingBee</b><br>
|
||||
<img src="./assets/scrapingbee.png">
|
||||
</a>
|
||||
</td>
|
||||
<td>
|
||||
One of the most advanced stealthy scraping as a service. At times it may be cheaper than building a dedicated scraping solution - they do not charge for the amount of traffic used.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
<a href="https://cutt.ly/8RkGETc">
|
||||
<b>Apify.io</b><br>
|
||||
<img src="./assets/apify.png">
|
||||
</a>
|
||||
</td>
|
||||
<td>
|
||||
Handy when your project is about one-off scraping. Their data understanding algorithm makes extracting data a breeze.
|
||||
</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>
|
||||
<b>De-captcha as a service</b>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://cutt.ly/NRkGtmo">
|
||||
<b>Anti Captcha: Captcha Solving Service. Bypass reCAPTCHA, FunCaptcha (...)</b><br>
|
||||
<img src="./assets/anticaptcha.png">
|
||||
</a>
|
||||
</td>
|
||||
<td>
|
||||
Self-explanatory. Bitcoin accepted ❤️.
|
||||
</td>
|
||||
</tr>
|
||||
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
## List of anti-bot software providers
|
||||
|
||||
This is a non-exhaustive list of companies that provide the most advanced anti-bot solutions for businesses ranging from smaller e-commerce sites to Fortune 500 companies:
|
||||
|
||||
Here I study various aspects of existing evasion techniques to get around anti-bot systems. The technical findings that I am sharing below are based on observations of running web scraping scripts for a few months against websites protected by:
|
||||
* [Akamai Bot Manager by Akamai](https://www.akamai.com/uk/en/products/security/bot-manager.jsp)
|
||||
* [Advanced Bot Protection by Imperva (former Distil Networks)](https://www.imperva.com/products/advanced-bot-protection-management/)
|
||||
* [DataDome](https://datadome.co/)
|
||||
* [Advanced Bot Protection by Imperva](https://www.imperva.com/products/advanced-bot-protection-management/) (former Distil Networks)
|
||||
* [DataDome Bot Protection](https://datadome.co/bot-protection/)
|
||||
* [PerimeterX](https://www.perimeterx.com/)
|
||||
* [Shape Security](https://www.shapesecurity.com/)
|
||||
* [Cloudflare Bot Management](https://www.cloudflare.com/en-gb/products/bot-management/)
|
||||
* [Barracuda Advanced Bot Protection](https://www.barracuda.com/products/advanced-bot-protection)
|
||||
* [HUMAN](https://www.humansecurity.com/products/platform)
|
||||
* [Kaskada](https://www.kasada.io/)
|
||||
* [Alibaba Cloud Anti-Bot Service](https://www.alibabacloud.com/products/antibot)
|
||||
* [Travatar](https://travatar.ai/)
|
||||
|
||||
and a few other custom built (incl. social media platforms). [Having troubles bypassing one of them?](#support)
|
||||
### How do you know who is getting you blocked?
|
||||
|
||||
---
|
||||
<img src="./assets/botty_mcbotface.png">
|
||||
|
||||
**Looking for a stellar web scraping service?** Check ScrapingBee service that runs in-cloud with no extra charges for traffic from premium and residential proxies, and has battle-tested anti-fingerprinting features.
|
||||
Join [extra.community](https://extra.community/). There runs an automated tester **Botty McBotface** that uses several complicated techniques to determine what exact protection a tested website uses (credits to [berstend](https://github.com/berstend) and others from #insiders).
|
||||
|
||||

|
||||
|
||||
### Available stealth browsers with automation features
|
||||
|
||||
**Important** You use this software at your own risk. Some of them contain malwares just fyi. **I do not recommend using them.**
|
||||
|
||||
| Stealth Browser | Puppeteer | Selenium | Evasions | SDK/Tooling | Origin |
|
||||
| - | - | - | - | - | - |
|
||||
| [GoLogin](https://gologinapp.com) | ✔️ | ✔️ | 🤮 | 👍 | 🇺🇸 + 🇷🇺 |
|
||||
| [Incogniton](https://incogniton.com) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [ClonBrowser](https://www.clonbrowser.com/) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [MultiLogin](https://multilogin.com) | ✔️ | ✔️ | 🤮 | ✔️ | 🇪🇪 + 🇷🇺 |
|
||||
| [Indigo Browser](https://indigobrowser.com) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [GhostBrowser](https://ghostbrowser.com) | ❌ | ❌ | ❌ | 👍 | ❓ |
|
||||
| [Kameleo](https://kameleo.io) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [AntBrowser](https://antbrowser.pro) | ❌ | ❌ | ❌ | ❌ | 🇷🇺 |
|
||||
| [CheBrowser](https://beta.chebrowser.site) | ❌ | ❌ | 🤮/✔️ | 👍 | 🇷🇺 |
|
||||
|
||||
**Legend:** 🤮 - Evasion based on noise. ❌ - No. ✔️ - Acceptable (with support libraries or not). 👍 - Very nice.
|
||||
|
||||
---
|
||||
|
||||
A ⭐ on this repo will be **appreciated**!
|
||||
|
||||
# Technicalities
|
||||
---
|
||||
|
||||
# Technical insights into bypassing bot detection
|
||||
|
||||
Here I study various aspects of evasion techniques used to get around bot detection systems used by major online websites. I cover both technical and non-technical matters, including recommendations, references to scientific papers and more.
|
||||
|
||||
The technical findings that I am sharing below are based on observations of running web scraping scripts for a few months against websites protected by [the major anti-bot solution vendors](#list-of-anti-bot-firewall-vendors).
|
||||
|
||||
*I constantly add stuff to this section. Over time I will try to make it look&feel more structured.*
|
||||
|
||||
@@ -58,23 +205,6 @@ A ⭐ on this repo will be **appreciated**!
|
||||
|
||||
tbd (if you have an active subscription in any of these services and don't mind sharing an account drop me an email ❤️)
|
||||
|
||||
### Available stealth browsers with automation features
|
||||
|
||||
**Important** You use this software at your own risk. Some of them contain malwares just fyi. **I do not recommend using them.**
|
||||
|
||||
| Stealth Browser | Puppeteer | Selenium | Evasions | SDK/Tooling | Origin |
|
||||
| - | - | - | - | - | - |
|
||||
| [GoLogin](https://gologinapp.com) | ✔️ | ✔️ | 🤮 | 👍 | 🇺🇸 + 🇷🇺 |
|
||||
| [Incogniton](https://incogniton.com) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [ClonBrowser](https://www.clonbrowser.com/) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [MultiLogin](https://multilogin.com) | ✔️ | ✔️ | 🤮 | ✔️ | 🇪🇪 + 🇷🇺 |
|
||||
| [Indigo Browser](https://indigobrowser.com) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [GhostBrowser](https://ghostbrowser.com) | ❌ | ❌ | ❌ | 👍 | ❓ |
|
||||
| [Kameleo](https://kameleo.io) | ✔️ | ✔️ | 🤮 | ✔️ | ❓ |
|
||||
| [AntBrowser](https://antbrowser.pro) | ❌ | ❌ | ❌ | ❌ | 🇷🇺 |
|
||||
| [CheBrowser](https://beta.chebrowser.site) | ❌ | ❌ | 🤮/✔️ | 👍 | 🇷🇺 |
|
||||
|
||||
**Legend:** 🤮 - Evasion based on noise. ❌ - No. ✔️ - Acceptable (with support libraries or not). 👍 - Very nice.
|
||||
|
||||
### Fingerprint test pages
|
||||
|
||||
@@ -95,6 +225,7 @@ These websites may be useful to test fingerprinting techniques against a web scr
|
||||
| http://uniquemachine.org/ | - |
|
||||
| http://dnscookie.com/ | - |
|
||||
| https://whatleaks.com/ | - |
|
||||
| https://kitchensink.ssl.fun/vendor/shape/fp | - |
|
||||
|
||||
|
||||
# Non-technical notes
|
||||
@@ -121,12 +252,6 @@ In this case, most of the time the vendor will be only able to **cluster the bad
|
||||
|
||||
If you think this is a way to go [google "captcha resolve api"](https://letmegooglethat.com/?q=captcha+resolve+api).
|
||||
|
||||
## Tester
|
||||
|
||||
Check out my tester application:
|
||||

|
||||
|
||||
|
||||
## Support
|
||||
|
||||
If you have problems with scraping specific website, write me a short email at `dniespodziany@gmail.com`. Let's have a quick tête-à-tête consultation via Skype 😊.
|
||||
|
||||
BIN
assets/anticaptcha.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
assets/apify.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
assets/botty_mcbotface.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
BIN
assets/brightdata.png
Normal file
|
After Width: | Height: | Size: 8.6 KiB |
BIN
assets/infatica.png
Normal file
|
After Width: | Height: | Size: 19 KiB |
BIN
assets/oxylabs.png
Normal file
|
After Width: | Height: | Size: 8.9 KiB |
BIN
assets/scrapingbee.png
Normal file
|
After Width: | Height: | Size: 5.9 KiB |
BIN
scrapingbee.png
|
Before Width: | Height: | Size: 37 KiB |