Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass

github.com

136 points by thalissonvs 9 days ago


renegat0x0 - 9 days ago

I think I will add this to my AIO package. My project allows to crawl pages. Provides a barebones page, and scraping results are passed as JSON.

This is something that was very useful for me not to setup selenium for the x time. I just use one crawling server for my projects.

Link:

https://github.com/rumca-js/crawler-buddy

jdnier - 9 days ago

Hi, just wondering what you're thinking about how your tool might be abused.

mfrye0 - 9 days ago

Checking it out and I see you're using CDP.

It's been a bit, but I'm pretty sure use of CDP can be detected. Has anything changed on that front, or are you aware and you're just bypassing with automated captcha handling?

hk1337 - 9 days ago

> Say goodbye to webdriver compatibility nightmares

That's cool but Chrome is the only browser I have had these issues with. We have a cron process that uses selenium, initially with Chrome, and every time there was a chrome browser update we had to update the web driver. I switched it to Firefox and haven't had to update the web driver since.

I like the async portion of this but this seems like MechanicalSoup?

*EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.

nickspacek - 9 days ago

As someone who uses ISPs and browser configurations that seem to frustrate CloudFlare/reCaptcha to the point of frequently having to solve them during day-to-day browsing, it would be interesting to develop a proxy server that could automatically/transparently solve captchas for me.

whall6 - 9 days ago

The web scraping arms race continues.

bobbyraduloff - 9 days ago

Is there a write up on how you deal with the captchas?

- 9 days ago
[deleted]
antiloper - 9 days ago

[flagged]