How to bypass captcha on Diggernaut web scraping platform: Solving Google reCaptcha v2 - Diggernaut

6 min read Original article ↗

Mikhail Sisin Follow Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

4 min read

Google reCaptcha v2 is not a problem for our users anymore. We recently implemented integration with Death by Captcha service, and you can easily bypass such captcha now.

Let’s see how reCaptcha v2 looks like:

Bypass captcha: identifying reCaptcha v2 by token API

If you see such captcha on the website you need to scrape, you can continue reading as we are going to give you guide how to implement it in your web scraper. Moreover, we are going to use a real example. We are planning to scrape this website: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/Cnpjreva_Solicitacao2.asp.

To use this solution you need to have an account with Death by Captcha service. This service is not free, solving 1000 such captchas going to cost you approx 2.89$ (at least this is an actual price on 02.28.2018).

Diggernaut solve captcha automatically. You need to load the page with captcha and call special command captcha_resolve with specific parameters:

provider: captcha solution provider, should be set to deathbycaptcha.com
type: captcha type, should be set to nocaptchav2
username: your death by captcha account username
password: your death by captcha account password

Also please NOTE! To have such captcha successfully resolved, persons who resolve your captcha should be able to connect page with captcha using the same IP as your web scraper. So, the only way to achieve it is to use YOUR OWN PROXY SERVER in the digger configuration. Any IP outside of our network cannot access our rotating proxies, so they cannot be used for this operation.

So base code for our scraper to resolve captcha would be:

--- config:     debug: 2     agent: Firefox     proxy: PUT YOUR OWN PROXY HERE do: # We are going to retry calls until we get successfully resolved captcha (yes its not 100% successful rate), so we set variable we going to use in the walk command - variable_set:     field: repeat     value: "yes" # Loading page with captcha - walk:     to: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/cnpjreva_solicitacao2.asp     repeat:      do:         # Resolving captcha     - captcha_resolve:         provider: deathbycaptcha.com         type: nocaptchav2         username: YOUR DBC USERNAME HERE         password: YOUR DBC PASSWORD HERE

Don’t run this code yet. If the captcha has been successfully resolved, you get token value in the captcha variable. So the first thing we are going to do after captcha resolve process is to check if we have some value in the variable and depending on check results either retry call or submit token to the server and get the results.

--- config:     debug: 2     agent: Firefox     proxy: PUT YOUR OWN PROXY HERE do: # We are going to retry calls until we get successfully resolved captcha (yes its not 100% successful rate), so we set variable we going to use in the walk command - variable_set:     field: repeat     value: "yes" # Loading page with captcha - walk:     to: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/cnpjreva_solicitacao2.asp     repeat:      do:         # Resolving captcha     - captcha_resolve:         provider: deathbycaptcha.com         type: nocaptchav2         username: YOUR DBC USERNAME HERE         password: YOUR DBC PASSWORD HERE         # Switching to the body block     - find:         path: body         do:                 # Reading captcha variable value to the register         - variable_get: captcha                 # Checking if register has not empty value         - if:             match: \w+             do:                         # If its not empty, turn off repeat mode for walk command             - variable_set:                 field: repeat                 value: "no"                         # Submit token along with other form data we need to submit to get information we need: we are getting information about the company using its CNPJ number             - walk:                 to:                     post: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/valida_recaptcha.asp                     data:                         origem: comprovante                         cnpj: 05754558000186                         g-recaptcha-response:                          submit1: Consultar                         search_type: cnpj                 do:                 - find:                     path: 'div#principal'                     do:                     - object_new: item                     - find:                         path: td:haschild(font:contains('NÚMERO DE INSCRIÇÃO')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: registration_number                     - find:                         path: td:haschild(font:contains('DATA DE ABERTURA')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: registration_date                     - find:                         path: td:haschild(font:contains('NOME EMPRESARIAL')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: company_name                     - find:                         path: td:haschild(font:contains('CÓDIGO E DESCRIÇÃO DA ATIVIDADE ECONÔMICA PRINCIPAL')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: primary_code                     - find:                         path: td:haschild(font:contains('CÓDIGO E DESCRIÇÃO DAS ATIVIDADES ECONÔMICAS SECUNDÁRIAS')) b                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_push:                             object: item                             field: secondary_codes                     - find:                         path: td:haschild(font:contains('CÓDIGO E DESCRIÇÃO DA NATUREZA JURÍDICA')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: legal_code                     - find:                         path: td:haschild(font:contains('LOGRADOURO')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: street                     - find:                         path: td:haschild(font:contains('BAIRRO/DISTRITO')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: district                     - find:                         path: td:haschild(font:contains('MUNICÍPIO')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: municipal                     - find:                         path: td:haschild(font:contains('TELEFONE')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: phone                     - find:                         path: td:haschild(font:contains('E-MAIL')) b                         slice: 0                         do:                         - parse                         - space_dedupe                         - trim                         - object_field_set:                             object: item                             field: email                     - object_save:                         name: item

You should get the following data when you run it.

[{     "item": {         "company_name": "LIST - LOGISTICA INTEGRADA, SERVICOS E TRANSPORTES LTDA",         "district": "COROADO",         "legal_code": "206-2 - Sociedade Empresária Limitada",         "municipal": "MANAUS",         "phone": "(92) 3622-7885",         "primary_code": "52.12-5-00 - Carga e descarga",         "registration_date": "26/06/2003",         "registration_number": "05.754.558/0001-86",         "secondary_codes": [             "49.30-2-02 - Transporte rodoviário de carga, exceto produtos perigosos e mudanças, intermunicipal, interestadual e internacional",             "77.39-0-99 - Aluguel de outras máquinas e equipamentos comerciais e industriais não especificados anteriormente, sem operador",             "77.19-5-99 - Locação de outros meios de transporte não especificados anteriormente, sem condutor",             "52.29-0-99 - Outras atividades auxiliares dos transportes terrestres não especificadas anteriormente",             "49.30-2-01 - Transporte rodoviário de carga, exceto produtos perigosos e mudanças, municipal",             "52.50-8-03 - Agenciamento de cargas, exceto para o transporte marítimo"         ],         "street": "R PROFESSORA RAYMUNDA MAGALHAES"     } } ] 

This particular solution can be used in “Data on demand” scenario when someone enters a number on your website. Your website sends a command to the Diggernaut API to run digger with a specific incoming parameter. API then return results to your website and its render results to your user. Please note, that captcha resolve process is not very fast. For example, resolve process for reCaptcha v2 may take more than a minute.

However, in any case, we are hoping that this article was useful for you and you now know, how to bypass recaptcha v2 with Diggernaut and Death by Captcha. We are planning to implement other captcha solutions soon, so please keep tuned.