256 Kilobytes

Busting reCaptcha Myths: Reverse Engineering Anti-spam Algorithms and Examining Existing Research in Realistic Contexts

Articles in Hacking the Government | By August R. Garcia

Published 2 months agoMon, 22 Apr 2019 22:23:34 -0700 | Last update 1 month agoSat, 08 Jun 2019 01:28:52 -0700

As one of the most widely used and recognizable captchas on the planet, attempts to circumvent reCaptcha “I am not a robot” verification are conducted regularly. In response, reCaptcha has been updated frequently with it’s most recent major version, reCaptcha 3.0, being launched toward the end of 2018.

947 views, 2 RAMs, and 1 comment

As one of the most widely used and recognizable captchas on the planet, attempts to circumvent reCaptcha “I am not a robot” verification are conducted regularly. In response, reCaptcha has been updated frequently with it’s most recent major version, reCaptcha 3.0, being launched toward the end of 2018. For the lulz, I wrote a reCaptcha solver and examined the security updates that have been made. From this, it has been concluded that:

  • Requests to solve an audio captcha instead of the visual captcha now results in increased scrutiny by reCaptcha, making previous audio captcha solvers largely nonviable for real-world spambots.
  • Due to improvements in machine learning and image classification, reCaptcha has given increasingly less value to the image-identifying aspect of captcha solving as well, despite trusting it more than the audio captcha (and the image captcha has been retired as of reCaptcha v3.0).
  • For a reCaptcha solver to be plausibly used in real world scenarios (i.e., that use cheap and/or free proxies), it must be able to farm Google cookies.

These points, as well as an overview of reCaptcha, reCaptcha solving, anti-spam practices, and general implications, are included below.

Context and Earlier Security Research

Earlier Security Research

Here are two of the most useful and notable projects on breaking reCaptcha (and similar captchas), the research and code from which are referenced and expanded on in various places throughout this post.

I'm Not a Human: Breaking the Google Recaptcha

If you want to learn about “click the image” captchas and how to approach building bots to solve them, this presentation from the 2016 Black Hat conference and the accompanying written publication and slides some of the best places to start:

The presentation covers the ins-and-outs of these types of captchas, an overview of how image classifiers work, and touches on a number of factors that captchas often take into account behind-the-scenes.

unCaptcha: A Low-resource Defeat of reCaptcha's Audio Challenge

A substantial amount of open-source code was made available in mid to late 2017 through Uncaptcha, a reCaptcha solving project published by Kevin Bock, Daven Patel, George Hughey, and Dave Levin from the University of Maryland.

Parts of the code used in this post have been taken and/or modified from unCaptcha.

How reCaptcha Works

Point-Based Spam Filters

When discussing spam filters, incorrect statements are made constantly:

  • “You only have to click the fucking cars if you’re in a private browsing/incognito window.”
  • “You can’t solve reCaptcha on Tor.”
  • “reCaptcha always gives a harder captcha if you’re using a proxy.”

However, while these statements each have a very mild amount of truth in them, they are all wrong. Rather, reCaptcha uses a system wherein a number of factors are taken into an account to determine a “score” for how spammy or non-spammy the user is. While the exact details are proprietary secrets, the basic concept is simple:

  • Factors that look suspicious, like using an IP address in a known data center (i.e., a known proxy rather than a residential IP address) increase the spam score.
  • Factors like having an aged Google search cookie can lower the spam score.
  • Once a score is calculated, different actions can be taken based on where the user’s score falls on a scale:
    • Very spammy requests are rejected entirely with this error message:
      • "Your computer or network may be sending automated queries. To protect our users, we can't process your request right now. For more details visit our help page."
    • Moderately spammy requests are given a large number of image-clicking challenges that deliberately load in slowly to stall users.
    • Neutral requests are given a small number of image-clicking challenges that load in quickly.
    • Trustworthy requests are given the green checkmark with no captchas needing to be solved.

This is the same type of system used for many other spam filters, such as those used to determine whether an email should go to a user’s inbox, spam folder, or be rejected by the server entirely.

reCaptcha Versions

Cracking reCaptcha

Creating a bot that is capable of solving reCaptcha requires three parts:

  • Code to interact with the reCaptcha user interface to click buttons, code to handle error messages from reCaptcha, and so on.
  • A solution to solve or bypass the reCaptcha challenges. There are three general possibilities:
    • Solving the audio captcha;
    • Solving the visual captcha; or
    • Creating a cookie farm to prevent the captcha from requiring a challenge to be solved at all.
  • A setup that meets certain bare minimum standards for non-spamminess, to prevent requests from being rejected outright and to prevent the rest of your code from being dragged down by having too many “spam points” from general incompetence.

Bare Minimum Standards

Considering that spam filters are based on a point system, fucking up basic shit can make the complicated part of your code pointless. To ensure that a bot meets bare minimum standards for not looking like a spambot, these points are, for all intents and purposes, required.

IP Addresses and Proxies

A user’s IP address can be used to infer a large amount of information about an HTTP request.

As discussed in the best proxy guide on the Internet, it is possible to determine whether an IP address is residential or a known proxy. While some requests from known proxies are humans doing regular things, many are spam.

While it is possible to repeatedly load and solve some reCaptcha demo without a proxy, connecting from your home IP address each time, this is unrealistic in terms of creating a reCaptcha solver that would actually be used for webspam purposes. Spambots are generally used to create accounts, post comments, and to otherwise interact with websites in massive bulk. Since these sites will generally log IP addresses, for a reCaptcha solver to be viable in real-world webspam scenarios, it must be able to solve reCaptcha while connected to proxies.

More specifically, it should be able to solve reCaptcha from cheap or free proxies. No webspammer is spending hundreds to thousands of dollars for the highest quality proxies, since that would destroy their profit margins, likely causing them to lose money and making the entire project pointless.

Programmatically Interacting with the reCaptcha UI

Browser Emulation vs. Headless Requests

The reCaptcha developers made the decision to refrain from including an API. Because of this, it’s necessary to write custom code to interact with reCaptcha. When deciding how to do this, the first question is whether to:

  • Make headless requests where raw requests are made to the server without actually rendering the page source as it would be seen by a human. Robots can work with raw HTML, CSS, JS, and other code without needing to convert it to a pretty, rendered UI. For most bots, headless requests are ideal because they are substantially less resource intensive and are also generally easier to write code for. However, one limitation of headless requests is that it is relatively easy for websites to check whether a request is being made headless; using similar processes to those used to detect user-agent spoofing (more on user agent spoofing in this article).
  • Emulate the browser where the robot actually opens an instance of FireFox, Chrome, or some other browser as if it were a human.

I heard from a guy who knows a guy who knows a guy who said that it’s possible to solve reCaptcha 2.0 with headless requests. Regardless, this post uses browser emulation via Selenium.

“Mouse Movement” is Irrelevant

If you look up “how does reCaptcha know whether or not you are a robot” (or similar queries), you’ll likely find 800 normie-tier articles about how reCaptcha “looks at unnatural activity” that robots do, but not humans. Generally, the example that is given is “mouse movement.” While this is an easily-understandable example, it’s important to note that it’s fucking wrong.

Actually clicking the image is a trivial task for a robot. In addition to the fact that clicking the reCaptcha checkbox can be done without bothering to simulate any mouse movement at all, here’s an even easier way to verify that mouse movement is clearly not used:

  1. Open some recaptcha demo, like this one.
  2. Press tab until you’ve selected the reCaptcha checkbox.
  3. Press enter.

Congratulations. You can even unplug your mouse while doing this.

Audio Captchas

As you may or may not know, reCaptcha 2.0 includes an option to request an audio captcha instead of a visual captcha, for accessibility purposes. Previous solutions to reCaptcha, such as unCaptcha, relied on the audio captcha option, which was able to be solved fairly easily by:

  1. Downloading the audio file
  2. Splitting the file into separate files for each the individual spoken letters/numbers
  3. Processing it with multiple voice-to-text programs
  4. Determining the most likely input for each letter based on the result (e.g., if five voice to text programs heard “seven” and two heard “something else,” assuming that the answer is “seven” for that character)

An example of this is shown in the video below:

The specific script being used in the video above is Uncaptcha.

Audio Captchas are Under Increased Scrutiny

However, while this solution still works for solving the captcha puzzle, choosing the option to request an audio captcha instead of a visual captcha results in increased suspicion from reCaptcha (i.e., adding “spam points) over the visual option. Presumably because either:

  • The audio captcha has been solved by a number of different open-source solutions and people started exploiting the audio captcha; or
  • Google does not like blind people.

Because of this, when using most proxies, requesting an audio captcha (rather than an image captcha) will--in almost all cases--result in an outright block from reCaptcha, returning the error:

  • "Your computer or network may be sending automated queries. To protect our users, we can't process your request right now. For more details visit our help page."

Of course, this depends on the type of proxy being used; exceptionally clean proxies (exceptionally expensive proxies) and/or other green flags may allow the audio captcha to be served and solved correctly even with this increased scrutiny, this additional cost and effort is impractical for any realistic spambot.

The implication of this is that a reliable, modular solution to the visual reCaptcha is needed to bypass reCaptcha tests on a large scale.

The Image Captcha

Overview of Image Classification Algorithms

To solve the image captcha, the easiest approach is to find an existing image classifier and to retrain it for this specific user case. While there are many subtopics and concepts to learn about machine learning and image classification algorithms, the basic concept behind machine learning can be simplified as follows:

  1. Get some problems for your robot to solve
  2. Create an “answer key”
  3. Split the problems into two sets:
    1. Roughly 80% as the “training set”
    2. Roughly 20% as the “validation set”
  4. Repeat the following steps indefinitely until you are satisfied with the algorithm generated:
    1. Have the robot use the training set and associated answer key to attempt to create an algorithm that can accurately solve this category of problem.
    2. Have the robot use that algorithm to try to solve the problems in the validation set.
    3. If, based on the results from the previous step, the algorithm is better than what was being used in the previous loop, keep it. Otherwise, go back to the old algorithm and create another offshoot.

As mentioned in this thread, this video by SethBling and this video by CGP Grey (embedded below) are additional understandable, entry-level explanations of machine learning.

One of the most straightforward methods to get started with machine learning is through Google’s Tensorflow for Poets tutorial, which uses Python and the open-source TensorFlow library to create and train a basic image classifier to identify flowers, which can easily be adjusted to identify other arbitrary images.

Gathering Training Images

To train an image classifier for this particular use case (solving reCaptcha), training images are needed. Fortunately, it’s fairly easy to acquire these with a script that:

  • Opens a page with reCaptcha on it, like this demo site.
  • Clicks the image
  • Downloads the images to a folder/category with the type of challenge (traffic lights, street signs, etcetera)
  • Hits the “get a new challenge” button to repeat/refresh until reCaptcha throws an error
  • Upon getting an error or otherwise being unable to download more images, close the instance, cycle to a new proxy, and start from step one.

Repeat until you have somewhere between 1,000 and 10,000 training images for each of the major categories. As of the time of this bot being tested, almost all challenges were for one of these five categories:

  • “Bus”
  • “Cars”
  • “Roads”
  • “Store Front”
  • “Street Signs”

While there were other categories like fire hydrants and statues, they were rare enough that they could be skipped with little to no repercussions.

Once a fair number of training images have been downloaded, sort them into “Cars” and “Not Cars” categories (or comparable) to be used as training data for the image classifier. When doing this, I hired the worst fucking VA in the world; a better option would have been Amazon’s Mechanical Turk, which is full of people who do an obscene quantity of microtasks to make modest amounts of money.

Regardless of how they get sorted, once you have the data, you can retrain the classifier for reCaptcha solving.

While solving the image captcha is a """relatively""" trivial process (compared to when it was first introduced), there are many cases (particularly for Tor users) where, despite many captchas being solved correctly, the reCaptcha either takes (literally) 3-5 minutes to solve or gives the user the "automated queries GTFO" error after wasting like five minutes of their time. The same is true for reCaptcha solvers working from proxies. While the images are clicked correctly and some captchas will eventually get solved correctly, this slow as fuck and completely impractical. From the tests I ran, you're looking at roughly one captcha being solved per hour (when running in a single thread). To get around this, it's necessary to reduce the "spam score" further by either:

Code and Resources

Here is teh code used during this project. Note that various dependencies will likely need to be installed to make all of this fucking shit run, most notably Selenium. Originally written in April 2018.

Recaptcha Photos

The sorted training photos used for the image captcha solver are included in this .tar.gz:

The image categories include images from challenges for busses, cars, roads, store fronts, and street signs sorted into foldes that do and do not match the categories.

ris.py

# ris.py
# Probably mostly from unCaptcha (?)
import requests
import os
import time
import random
import json
import threading
import multiprocessing
import time
import copy
import pickle

from PIL import Image
from os import walk
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def parse_test_file(test_filename):
    return json.loads(open(test_filename, "r").read())

def test_all():
    dirs = list()
    test_root = "images"
    for (dirName, subDir, _) in walk(test_root):
        dirs.extend(subDir)
        break
    for d in dirs:
        full_path = os.path.join(test_root, d)
        try:
            search_directory(full_path)
        except Exception as exc:
            print("test %s failed" % full_path)
            print(exc.message)

def search_directory(directory, target_keyword=None, width=4):
    f = []
    trues = 0
    threads = []
    oracle = None
    try:
        oracle = parse_test_file(os.path.join(directory, "oracle.json"))
        if target_keyword == None:
            target_keyword = oracle["target_keyword"]
    except IOError as err:
        print("no oracle file found")
    manage_vars = multiprocessing.Manager()
    for (_, _, filenames) in walk(directory):
        f.extend([file for file in filenames if "image" in file or "output" in file])
    i = 0
    ret_vals = manage_vars.dict()
    target_syns = list()
    for targ_key in target_keyword.split():
        target_syns.extend(get_synonyms(targ_key))
        target_syns.append(targ_key)

    print("testing " + directory)
    for img_file in f:
        t = multiprocessing.Process(target=reverse_search2, args=(os.path.join(directory, img_file), img_file, ret_vals, target_syns))
        threads.append(t)
        t.start()
        i+=1
    for j in range(0, i-1):
        threads[j].join()
    print("")
    # print ret_vals
    # print oracle
    if oracle: # local testing only
        for img_file in ret_vals.keys():
            # print str(ret_vals[img_file]) + " " + str(oracle[img_file])
            if(ret_vals[img_file] == oracle[img_file]):
                trues += 1
        print("  %s correct out of %s" % (str(trues), len(ret_vals)))
        return ret_vals
    else: # live testing only

        return get_coor(ret_vals, width)

def reverse_search2(img_file, filename, ret_vals, target_keyword="vehicle"):
    ret_vals[filename] = reverse_search(img_file, target_keyword)

# determines if an image keywords matches the target keyword
# uses the synonyms of the image keyword
def check_image(img_keywords, target_syns, syn_image=False):
    #print ("Checking keywords against: " + target_keyword)

    for k in img_keywords:
        #print(k)
        if syn_image:
            image_syns = get_synonyms(k)
            if image_syns:
                for image_s in image_syns:
                    for target_s in target_syns:
                        # print("- %s" % (target_s))
                        if target_s == image_s:
                            return True
        else:
            for target_s in target_syns:
                # print("- %s" % (target_s))
                if target_s == k:
                    if (DEBUG > 0):
                        print("Found " + target_s + " equal to " + k)
                    return True
    return False

def get_coor(click_dict, width=4):
    x = 1
    y = 1
    coor_dict = dict()
    for key in sorted(click_dict.keys()):
        coor_dict[(x, y)] = click_dict[key]
        y += 1
        if y > width:
            x += 1
            y = 1
    return coor_dict
    
def pprint(matrix):
    s = [[str(e) for e in row] for row in matrix]
    lens = [max(map(len, col)) for col in zip(*s)]
    fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
    table = [fmt.format(*row) for row in s]
    print '\n'.join(table)

recaptcha_solver.py

# Recaptcha Solver
# recaptcha_solver.py
# Created on 25 April, 2018
# Base code taken from the open source code for Uncaptcha

from selenium import webdriver
from selenium.webdriver.common import action_chains, keys

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

import sys
import os
import time
from time import sleep

from bs4 import BeautifulSoup
import urllib
import urllib2

import pdb
import logging
import random
from random import uniform
import threading
from threading import Timer,Thread,Event
import argparse

from termcolor import colored
import pprint

import Tkinter as tk
from PIL import ImageTk, Image

import traceback

# Custom Modules
import ris

# Import the functions to run reCaptcha images through the image classifier / to get data from the image classifier
# recaptcha_solver/tensorflow-for-poets-2/scripts/label_image.py
classifier_path = os.path.join(  os.getcwd(), 'recaptcha_solver/tensorflow-for-poets-2/scripts'  )
sys.path.append(classifier_path) # FIXME
import label_image
sys.path.remove(classifier_path)


# ***** ***** ***** ***** FUNCTION DEFINITIONS ***** ***** ***** ***** #
def recaptcha_solver_demo(driver):
    print "Starting the reCAPTCHA solver demo..."
    driver.get("https://patrickhlauke.github.io/recaptcha/")
    #driver.get("https://www.google.com/recaptcha/api2/demo")
    solve_recaptcha(driver, 5)
    print "End of recaptcha_solver_demo() function..."


# Solve all reCaptchas that are currently loaded on the screen of the driver
# returns True if reCaptcha is solved correctly
# Otherwise, returns False
def solve_recaptcha(driver, seconds_to_wait=15):
    if ( click_recaptcha(driver, seconds_to_wait) ):
        wait_for_initial_challenge_to_load(driver)

        try:
            return solve_visual_captcha(driver)
            # Uncomment this and comment out the line above if you want to download training images instead of solve captchas
            #download_training_images(driver) 
        except:
            traceback.print_exc()
            print "Error somewhere, lmao"

    return False

def click_recaptcha(driver, seconds_to_wait=15):
    for iii in range(seconds_to_wait):
        try:
            print "Trying to find the recaptcha's iFrame..."
            #recaptcha_iframe = driver.find_element(By.CSS_SELECTOR, "iframe[title=\"recaptcha challenge\"]")
            #recaptcha_iframe = driver.find_element_by_css_selector("#g-recaptcha iframe")
            recaptcha_iframe = driver.find_element_by_css_selector(".g-recaptcha iframe")
            print "Found frame..."
            driver.delete_all_cookies() #Is this even a good choice? (Re: Asian woman)
            print "Cookies deleted..."
            driver.switch_to.frame(recaptcha_iframe)            
            print "Switched to the iFrame..."
            driver.delete_all_cookies() 
            print "Cookies deleted (again)..."
            
            print "Trying to find the recaptcha..."
            recaptcha = driver.find_element_by_css_selector("#recaptcha-anchor")
            print "Trying to click the recaptcha..."
            recaptcha.click()
            print "The reCaptcha has been clicked..."

            driver.switch_to.default_content()
            return True  
        except:
            driver.switch_to.default_content()
            sys.stdout.flush()
            sys.stdout.write( '\r' + colored("Waiting for a reCaptcha to load. "+str(iii)+" seconds have passed. Will abandon at "+str(seconds_to_wait)+" seconds...", 'cyan') )
            #print colored("Waiting for a reCaptcha to load. "+str(iii)+" seconds have passed. Will wait up to "+str(seconds_to_wait)+" seconds...", 'yellow')
            time.sleep(1)
    print colored("\nError: No reCaptcha was detected on the page after waiting for "+str(seconds_to_wait)+" seconds.", 'red')
    return False

def wait_for_initial_challenge_to_load(driver):
    print "Waiting for challenges to load, probably..."
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[title=\"recaptcha challenge\"]")))
    iframe = driver.find_element(By.CSS_SELECTOR, "iframe[title=\"recaptcha challenge\"]")
    driver.switch_to.frame(iframe)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, "rc-imageselect")))

    print "reCaptcha has loaded..."


############################## CAPTCHA ERRORS ###############################
# Checks to see if reCaptcha has given the "Try again later" error (which disallows attempts to solve a reCaptcha at all)
# "Your computer or network may be sending automated queries. To protect our users, we can't process your request right now. For more details visit our help page"

# FIXME
def check_for_automated_queries_error(driver):
    try:
        if driver.find_element_by_css_selector(".rc-doscaptcha-body-text").is_displayed():
            print colored("Error: Automated queries error message was found. Captcha could not be served...", 'red')
            return True
    except:
        print colored("Automated queries error message not found...", 'green')
        return False
    
#TODO - Test this
# Check if reCaptcha has timed out
# red border thing
# FIXME -- might need to switch to the correct of the two (?) iFrames (?)
def check_if_recaptcha_timed_out(driver):
    #id='recaptcha-anchor' class="recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox recaptcha-checkbox-expired"
    try:
        # fixme SELECTOR not CLASS
        driver.find_element_by_css_class("#recaptcha-anchor.recaptcha-checkbox.recaptcha-checkbox-expired")
        print colored("reCaptcha has expired.", 'red')
        return True
    except:
        print colored("reCaptcha has not expired.", 'green')
        return False
    pass

# TODO -- Make this way faster
# Fucking big reason why signs times out
def check_if_recaptcha_is_solved(driver):
    print "Checking if reCaptcha is solved..."
    try:
        driver.switch_to.default_content()
        recaptcha_iframe = driver.find_element_by_css_selector(".g-recaptcha iframe")
        driver.switch_to.frame(recaptcha_iframe)            
        driver.find_element_by_css_selector("#recaptcha-anchor.recaptcha-checkbox.recaptcha-checkbox-checked")
        #driver.find_element_by_css_class("#recaptcha-anchor.recaptcha-checkbox.recaptcha-checkbox-checked").get_attribute("aria-checked")
        print colored("reCaptcha is solved.", 'green', 'on_yellow')
        driver.switch_to.default_content()
        return True
    except:
        print colored("reCaptcha is not yet solved.", 'red')
        driver.switch_to.default_content()
        iframe = driver.find_element(By.XPATH, "/html/body/div/div[4]/iframe")
        driver.switch_to.frame(iframe)
        return False

# Checks for common reasons that the reCaptcha solver may not be able to continue and returns relevant information
def check_if_recaptcha_solver_is_stuck(driver):
    # Check for 'automated queries' rejection
    if (  check_for_automated_queries_error(driver) == True):
        return True
   
    # FIXME
    if (check_if_recaptcha_timed_out(driver)):
        return True

    print colored("reCaptcha solver is not stuck. Continuing...", 'green')
    return False

############################## VISUAL RECAPTCHA #############################
def solve_visual_captcha(driver):
    return image_recaptcha(driver)

############################## IMAGE RECAPTCHA FUNCTIONS ##############################
TASK_PATH = "recaptcha_solver/captcha-images"

def should_click_image(img, x1, y1, store, threshold=0.95, target="cars"):

    decision = parse_classify_image(img, threshold, target)

    store[(x1,y1)] = decision
    logging.debug(store)

    return decision

def click_tiles(driver, coords, subdir=None):

    # Some recaptchas (generally everything except street signs) will fade out when clicked, after which a new image is loaded in
    # ---> Dynamic -- (.rc-imageselect-tile.rc-imageselect-dynamic-selected)
    # Other recaptchas (almost exclusively street signs) get static check mark and don't fade out
    # ---> Static -- (.rc-imageselect-tile.rc-imageselect-tileselected)
    # These flag is used to determine whether the script should wait for new images to load in or not
    flag_is_static_select = False
    flag_is_dynamic_select = False
    # There are two distinct flags so that it can be determined immediately on the first image click either way, which improves performance

    orig_srcs, new_srcs = {}, {}
    for (x, y) in coords:
        logging.debug("[*] Going to click {} {}".format(x,y))
        tile1 = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//div[@id="rc-imageselect-target"]/table/tbody/tr[{0}]/td[{1}]'.format(x, y))))
        orig_srcs[(x, y)] = driver.find_element(By.XPATH, "//*[@id=\"rc-imageselect-target\"]/table/tbody/tr[{}]/td[{}]/div/div[1]/img".format(x,y)).get_attribute("src")
        new_srcs[(x, y)] = orig_srcs[(x, y)] # to check if image has changed 
        tile1.click()
        if (flag_is_static_select == False and flag_is_dynamic_select == False):
            try:
                driver.find_element_by_css_selector(".rc-imageselect-tile.rc-imageselect-tileselected")
                flag_is_static_select = True
            except:
                flag_is_dynamic_select = True
                #try:
                #    driver.find_element_by_css_selector(".rc-imageselect-tile.rc-imageselect-dynamic-selected")
                #    flag_is_dynamic_select = True
                #except:
                #    print colored("WARNING: How can the recaptcha image be neither of the two possibilities?", 'red', 'on_yellow')
        #wait_between(0.1, 0.5)
        wait_between(0.1, 0.2)

    # TODO -- Check if the images use a checkmark style (rather than fading out and loading in a new image)
    if (flag_is_static_select == True):
        print colored("All images have been checked/selected.", 'blue')
        #pdb.set_trace()
        return None # FIXME - Test this
    else:
        print colored("New images loading in...", 'blue')

    # Set the path for where the images will be downloaded to
    if (subdir != None):
        subdir = os.path.join(TASK_PATH, subdir)
    else:
        subdir = TASK_PATH

    # Start downloading the new images, etc.
    logging.debug("[*] Downloading new inbound image...")
    new_files = {}
    for (x, y) in orig_srcs:

        # If there are connection issues, images may not load in correctly once old images are clicked
        # To prevent this the reCaptcha from hanging forever in an infinite loop, 
        # throw an error after waiting for over [some amount of time, probably set this to five seconds]
        # TODO -- make this click "verify" insteadi and move on to the next reCaptcha (?)
        max_seconds_to_wait_raw_value = 7
        max_seconds_to_wait = max_seconds_to_wait_raw_value
        loop_delay          = 0.5
        while new_srcs[(x, y)] == orig_srcs[(x, y)]:
            if(max_seconds_to_wait > 0):
                new_srcs[(x, y)] = driver.find_element(By.XPATH, "//*[@id=\"rc-imageselect-target\"]/table/tbody/tr[{}]/td[{}]/div/div[1]/img".format(x,y)).get_attribute("src")
                time.sleep(loop_delay)
                max_seconds_to_wait -= loop_delay
                #print colored("Remaining wait: "+str(max_seconds_to_wait), 'blue', 'on_cyan')
                pass
            else:
                e = "ERROR: New reCaptcha image "+""+" could not be found after waiting for "+str(max_seconds_to_wait_raw_value)+" seconds."
                print colored(e, 'red')
                raise e

        #urllib.urlretrieve(new_srcs[(x, y)], "captcha.jpeg")
        #new_path = TASK_PATH+"/new_output{}{}.jpeg".format(x, y)
        #os.system("mv captcha.jpeg "+new_path)
        new_path = subdir+"/new_output{}{}.jpeg".format(x, y)
        urllib.urlretrieve(new_srcs[(x, y)], new_path)
        #os.system("mv captcha.jpeg "+new_path)

        new_files[(x, y)] = (new_path)
    return new_files

def handle_queue(to_solve_queue, coor_dict, threshold=0.95, target="cars"):
    ts = []
    for (x,y) in to_solve_queue:
        image_file = to_solve_queue[(x, y)]
        t = threading.Thread(target=should_click_image, args=(image_file, x, y, coor_dict, threshold, target))        
        ts.append(t)
        t.start()
    for t in ts:
        t.join()

def image_recaptcha(driver):
    # This is here to check specifically for the "automated queries" error on the first click
    print "Checking for 'automated queries' error before continuing..."
    if (  check_for_automated_queries_error(driver) == True):
        return False

    print colored("Attempt to solve the image reCaptcha has started...\n", 'cyan')
    current_captcha  = 1
    threshold = 0.02 # The threshold for probablity of being a car 
    continue_solving = True
    while continue_solving:
        print "Starting reCaptcha #" + str(current_captcha) + "..."
        print colored("Searching for a relatively easy reCaptcha to solve...\n", 'cyan')
        willing_to_solve = False
        while not willing_to_solve:
            target = get_captcha_title(driver)

            t = get_captcha_dimensions_and_payload(driver); max_width = t[0]; max_height = t[1]; payload = t[2]

            print colored(       "Target:\t", 'cyan') + colored(target,     'cyan', attrs=['bold']) + \
                  colored("\t" + "Width: \t", 'cyan') + colored(max_width,  'cyan', attrs=['bold']) + \
                  colored("\t" + "Height:\t", 'cyan') + colored(max_height, 'cyan', attrs=['bold']) 

            # If there is no image classifier for the current category, skip the category
            # As of June 14th, 2018, these five categories cover almost all reCaptchas served.
            #if target != "cars" and target != "store front" and target != "bus" and target != "roads": 
            #if target != "street signs":
            if target != "street signs" and target != "cars" and target != "store front" and target != "bus" and target != "roads": 
                print colored("This reCaptcha has no image classifier. Requesting a new reCaptcha...\n", 'magenta')
                reload_captcha(driver)
                #threshold = 0.01 # Reset the threshold for probablity of being a car, since there is a new image set
                #threshold = 0.02 # Reset the threshold for probablity of being a car, since there is a new image set
                #threshold = 0.25 # Reset the threshold for probablity of being a car, since there is a new image set
                current_captcha = current_captcha + 1
            else:
                print colored("This reCaptcha is acceptable. Attempt to solve reCaptcha will begin...\n\n", 'green')
                if (target =="street signs"):
                    threshold = 0.40
                else: 
                    threshold = 0.02
                willing_to_solve = True

        # Consider FIXME-ing -- Random is probably not ideal, but allows for the loop to break 
        # when it hits uncommon edge cases while minimizing the performance hit that occurs
        # from running this fucking function.
        # Main need for performance improvements is with the street signs, so restricted to that
        # target category
        if (target != "street signs" or random.randint(1,5) == 5):
            print "Checking for errors before continuing..."
            if (   check_if_recaptcha_solver_is_stuck(driver)   ):
                return False        
        print colored("reCaptcha to solve has been chosen...", 'green')

        subdir_name = "recaptcha--"+str(int(time.time()))+"--"+target+"--"+str(max_width)+"x"+str(max_height)
        full_task_path = TASK_PATH + "/" + subdir_name
        download_recaptcha_images(driver, subdir_name)
        
        t_dir = os.listdir(full_task_path) 
        t_dir.sort()

        # build queue of files
        print colored("Creating queue of image files to solve...", 'cyan')
        to_solve_queue = {}
        idx = 0 
        for f in [full_task_path+"/"+f for f in t_dir if "output_" in f]:
            y = idx % max_height + 1  # making coordinates 1 indexed to match xpaths 
            x = idx / max_width + 1
            #y = idx % 3 + 1  # making coordinates 1 indexed to match xpaths 
            #x = idx / 3 + 1
            to_solve_queue[(x, y)] = f
            idx += 1
        
        logging.debug(to_solve_queue)

        #print colored("Handling/solving of image queue starting...", 'cyan')
        print colored("Actual solving of the reCaptcha images starting...", 'cyan')        

        #threshold = threshold - 0.05 

        coor_dict = {}
        handle_queue(to_solve_queue, coor_dict, threshold, target)  # multithread builds out where to click
        logging.debug(coor_dict)
        #os.system("rm "+full_task_path+"/full_payload.jpeg")
        
        driver.switch_to.default_content()  
        iframe = driver.find_element(By.XPATH, "/html/body/div/div[4]/iframe")
        driver.switch_to.frame(iframe)

        #print colored("Actual solving of the reCaptcha images starting...", 'cyan')        
        continue_solving = True 
        while continue_solving:
            to_click_tiles = []
            for coords in coor_dict:
                to_click = coor_dict[coords]
                x, y = coords
                body = driver.find_element(By.CSS_SELECTOR, "body").get_attribute('innerHTML').encode("utf8")
                if to_click:
                    to_click_tiles.append((x,y)) # collect all the tiles to click in this round
            new_files = click_tiles(driver, to_click_tiles, subdir_name)
            if (new_files != None):
                handle_queue(new_files, coor_dict, threshold, target)
                continue_solving = False
                for to_click_tile in coor_dict.values():
                    #print colored("In this loop, lmao, lmao", 'cyan', 'on_white')
                    continue_solving = to_click_tile or continue_solving
            else:
                continue_solving = False

        #pdb.set_trace()
        print colored("The images that appear to match the category of ", 'cyan') + colored(target, 'cyan', attrs=['bold']) + colored(" have been clicked. Clicking the 'verify' button...", 'cyan')
        #wait_between(1.5, 2.5) # Wait for all the images to fully load in before clicking verify. Otherwise, it always gives a "Pleaes Try Again" error for some reason. JK LOL IGNORE THIS
        #pdb.set_trace() # TODO -- add this back in as an optional parameter to human-verify captchas
        driver.find_element(By.ID, "recaptcha-verify-button").click()
        # wait_between(0.2, 0.5)
        wait_between(0.4, 0.6) # Increased to prevent getting stuck on wrong image (consider a less ghetto solution)
        #if driver.find_element_by_class_name("rc-imageselect-incorrect-response").get_attribute("style") != "display: none":
        
        # FIXME - Ghetto solution
        if (target != "street signs" or random.randint(1,3) == 3):
            if check_if_recaptcha_is_solved(driver) == False:
                print colored("reCaptcha is not yet solved. Continuing with solution...", 'red')
                continue_solving = True
            else:
                #timeout_timer.cancel()
                print colored("Recaptcha should be solved.", 'green', 'on_yellow')
                #time.sleep(10) # FIXME
                return True
        else:
            print colored("Verification step skipped for speed reasons (LMAO). Continuing with solution...", 'yellow')
            continue_solving = True


        # TODO check if captcha changed on hitting verify -- "Please Try Again" instead of "Please select all matching images."
        if (False and target != get_captcha_title(driver) ):
            print "New Captcha was served" # FIXME check for error message directly


def download_training_images(driver):
    print colored("Attempt to solve the image reCaptcha has started...\n", 'cyan')
    continue_solving = True
    while continue_solving:
        print colored("Searching for a relatively easy reCaptcha to solve...\n", 'cyan')
        willing_to_solve = False
        while not willing_to_solve:
            target     = get_captcha_title(driver)
            t = get_captcha_dimensions_and_payload(driver); max_width = t[0]; max_height = t[1]; payload = t[2]

            print colored(       "Target:\t", 'cyan') + colored(target,     'cyan', attrs=['bold']) + \
                  colored("\t" + "Width: \t", 'cyan') + colored(max_width,  'cyan', attrs=['bold']) + \
                  colored("\t" + "Height:\t", 'cyan') + colored(max_height, 'cyan', attrs=['bold']) 

            subdir_name = target #"recaptcha--"+str(int(time.time()))+"--"+target+"--"+str(max_width)+"x"+str(max_height)
            download_recaptcha_images(driver, subdir_name, True)

            reload_captcha(driver)


# ##### IMAGE CAPTCHA UTIL ##### #
def reload_captcha(driver):
    reload_captcha = driver.find_element(By.XPATH, "//*[@id=\"recaptcha-reload-button\"]")
    try:
        reload_captcha.click()
    except Exception as e:
        print colored("Error clicking the button to reload the captcha -- ({0}): {1}".format(e.errno, e.strerror), 'red')
    wait_between(0.2, 0.5)

def get_captcha_title(driver):
    body = driver.find_element(By.CSS_SELECTOR, "body").get_attribute('innerHTML').encode("utf8")
    soup = BeautifulSoup(body, 'html.parser')
    #table = soup.findAll("div", {"id": "rc-imageselect-target"})[0]
    target = soup.findAll("div", {"class": "rc-imageselect-desc"})
    if not target: # find the target
        target = soup.findAll("div", {"class": "rc-imageselect-desc-no-canonical"})
    target = target[0].findAll("strong")[0].get_text()

    return target

def get_captcha_dimensions_and_payload(driver):
    body = driver.find_element(By.CSS_SELECTOR, "body").get_attribute('innerHTML').encode("utf8")
    soup = BeautifulSoup(body, 'html.parser')
    table = soup.findAll("div", {"id": "rc-imageselect-target"})[0]

    trs = table.findAll("tr")
    if (len(trs) > 4):
        #FIXME - Sort of ghetto
        max_height = 4
        #pdb.set_trace()
    else:
        max_height = len(trs)

    max_width = 0
    for tr in trs:
        imgs = tr.findAll("img")
        payload = imgs[0]["src"]
        if len(imgs) > max_width:
            max_width = len(imgs)

    return [max_width, max_height, payload]

def download_recaptcha_images(driver, subdir_name="garbage_heap", rand_file_names=False):
    t = get_captcha_dimensions_and_payload(driver); max_width = t[0]; max_height = t[1]; payload = t[2]

    # Pull down catcha to solve and organize directory structure

    print colored("Creating the directory ", 'cyan') + colored(subdir_name, 'cyan', attrs=['bold']) + colored(" to store the images for the current reCaptcha...", 'cyan')
    full_task_path = TASK_PATH + "/" + subdir_name
    print colored("Full path for this task:\t", 'cyan') + colored(full_task_path, 'cyan', attrs=['bold'])
     
    if not os.path.exists(full_task_path):
        os.makedirs(full_task_path)
    elif (rand_file_names == True):
        pass
    else:
        print colored("Directory already exists. Get better error handling, Jesus Christ.", 'red')
        return 0

    print colored("File download starting...", 'cyan')
    # TODO -- download directly to the correct directory (might need to be created first)
    #urllib.urlretrieve(payload, "captcha.jpeg")
    #os.system("mv captcha.jpeg '"+full_task_path+"/full_payload.jpeg'") # FIXME possible overwriting during multi-threaded operation
    urllib.urlretrieve(payload, full_task_path+"/full_payload.jpeg")

    t = ""
    if (rand_file_names == True):
        t = subdir_name+"--"+str(int(time.time()))+"--"
    print colored("Creating distinct images from the main reCaptctha grid...", 'cyan')
    os.system("convert \""+full_task_path+"/full_payload.jpeg\" -crop "+str(max_width)+"x"+str(max_height)+"@ +repage +adjoin \""+full_task_path+"/"+t+"output_%03d.jpg\"")
    print colored("The main grid of images has been split into ", 'cyan') +  colored(str(max_width*max_height), 'cyan', attrs=['bold'])  + colored(" individual images...", 'cyan')

# Returns True if the image classifier estimates that 'image' is a car with a confidence level greater than the specified threshold
# Otherwise, returns False
def parse_classify_image(image="output_008.jpg", threshold=0.95, target="cars"):
    t = label_image.classify_image(image, target)

    if ( t[0][0].find("not") == -1):
        success_label  = t[0][0]
        success_chance = t[1][0]
    else:
        success_label  = t[0][1]
        success_chance = t[1][1]

    # FIXME this relies on alphabetical order

    m = "The image " + image + " has a " + "{0:.2f}%".format( success_chance * 100 ) + " chance of being matching the label "+str(target)+". Threshold: " + "{0:.2f}%".format( threshold * 100 )
    if ( success_chance > threshold ):
        # Leave this debugging statement in, but don't run it for performance reasons, probably
        #print colored(m, 'green')
        return True

    # Leave this debugging statement in, but don't run it for performance reasons, probably
    #print colored(m, 'red')
    return False

############################## UTIL FUNCTIONS #############################
def show_image(path):
    image_window = tk.Tk()
    img = ImageTk.PhotoImage(Image.open(path))
    panel = tk.Label(image_window, image=img)
    panel.pack(side="bottom", fill="both", expand="yes")
    image_window.mainloop()

# Actually a thing
def wait_between(a, b):
    rand = uniform(a, b)
    sleep(rand)

interface.py

# interface.py
# Config
DEBUG_MODE = False
MAX_THREADS  = 1
DELAY = 10.0

#imports
import selenium
from selenium import webdriver
from selenium.webdriver.common import action_chains, keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

import sys
import os

import traceback
import thread
import threading
from threading import Thread
import pdb

import time
import random
import pprint
import math

import urllib
import csv
import sqlite3

from termcolor import colored, cprint

from screeninfo import get_monitors


# ***** Locally Stored Files ***** #
# reCaptcha Solver
recaptcha_path = os.path.join(  os.getcwd(), 'recaptcha_solver'  )
sys.path.append(recaptcha_path) # FIXME
import recaptcha_solver
sys.path.remove(recaptcha_path)

##############################################################################
####       ___                          _         _   _ _ _ _             ####
####      / _ \___ _ __   ___ _ __ __ _| |  /\ /\| |_(_) (_) |_ _   _     ####
####     / /_\/ _ \ '_ \ / _ \ '__/ _` | | / / \ \ __| | | | __| | | |    ####
####    / /_\\  __/ | | |  __/ | | (_| | | \ \_/ / |_| | | | |_| |_| |    ####
####    \____/\___|_| |_|\___|_|  \__,_|_|  \___/ \__|_|_|_|\__|\__, |    ####
####                                                            |___/     ####
##############################################################################

def import_jquery(driver):
    with open('assets/jquery-3.3.1.min.js', 'r') as jquery_js: 
        jquery = jquery_js.read() #read the jquery from a file
        driver.execute_script(jquery) #active the jquery lib


def clear_cookie_and_session_data():
    print "Clearing cookie and session data..."
    driver.delete_all_cookies()
    print "Cookie and session data have been cleared..."


def set_proxy(driver, ip=False, port=False):
    
    profile = webdriver.FirefoxProfile() 
    if (PROXY_TYPE != "NONE"):
        if (ip==False and port==False):
            proxies = []

            # Fill in proxy details
            proxies.append(  [ "xxx.xxx.xxx.xxx", 00000 ]  ) 
            proxies.append(  [ "xxx.xxx.xxx.xxx", 00000 ]  ) 

            # Connect to a random proxy
            t = random.choice(proxies)
            ip = t[0]; port = t[1]

        profile.set_preference("network.proxy.type", 1)
        profile.set_preference("network.proxy.http",        ip   )
        profile.set_preference("network.proxy.http_port",   port )
        profile.set_preference("network.proxy.ssl",         ip   )
        profile.set_preference("network.proxy.ssl_port",    port )

    profile.set_preference("browser.content.main-window.width", 20)
    profile.set_preference("browser.content.main-window.height", 30)

    profile.update_preferences() 

    driver = webdriver.Firefox(profile, executable_path='assets/selenium-drivers/geckodriver')

    profile._create_tempfolder

    return driver

def refresh_full_session(driver, num_instances=None):
    try:
        driver.quit()
    except:
        pass

    driver = set_proxy(driver)

    # Start the session
    driver.get("about:newtab") 
    driver.implicitly_wait(10) # seconds
    resize_window(driver, num_instances)
    return driver

def test_recaptcha_solver():
    print "Starting test..."
    driver = None
    for i in range(12000):
        print "Loop "+str(i)+"..."
        driver = refresh_full_session(driver)
    
        try:
            recaptcha_solver.recaptcha_solver_demo(driver)
        except:
            driver.quit()
            traceback.print_exc()
            print "Unknown error during recaptcha solver test..."

        driver.quit()
    print "End of test..."

# Main
def main_loop(loop_until_all_complete=True):
    choice = lambda: test_recaptcha_solver();

    try:
        while ( choice() != -1):
            pass
    except (KeyboardInterrupt, SystemExit) as e:
        print "Keyboard interrupt or system exit detected. Killing all threads..."
        cleanup_stop_thread()
        sys.exit() 
        raise e

main_loop()

Implications

Why reCaptcha 3.0 Changes Nothing

At face value, the explanation for why the image-clicking challenges are removed in reCaptcha 3.0 seem to be done for user convenience, they appear to be done mainly because the image captchas have become increasingly less effective over time. By the time reCaptcha 3.0 launched in 2018, reCaptcha 2.0 had already been operating almost entirely based off of the non-challenge anti-spam factors, such as as IP address and user cookies.

Without valid cookie history and/or a top-tier proxy, V2 is already extremely slow to solve (by design) and in arguably borderline unsolvable, with a classic example of this being seen by everyone who has ever attempted to solve reCaptcha on Tor during 2018.

Image classifiers are a solved problem at this point. Seems like they're removing the already pointless image clicking part and leaving in the rest.

V2 solvers that are halfway decent at cookie farming should work fine on V3.

Should You Use reCaptcha?

There are three types of spam that websites should be protected from:

For the average website (i.e., sites that don’t at least have a few million unique users per month):

reCaptcha is a substantial inconvenience to users (and also is borderline unusable through Tor) and every inconvenience on your site negatively impacts not only usability, but also conversion rate. While reCaptcha’s use of many factors arguably makes it a good fit for massive sites like circlejerk comment fanclub, expired username land, and Christopher Poole’s anime fan site, for the average use case, reCaptcha is excessive, unnecessary, and intrusive. 

Users Who Have Downloaded More RAM:
Hash Brown (1 month ago)
Huevos Rancheros (1 month ago)
🐏 ⨉ 2
Posted by August R. Garcia 2 months ago

Edit History

• [2019-04-22 22:23 PDT] August R. Garcia (2 months ago)
🕓 Posted at 22 April, 2019 22:23 PM PDT

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 163 🗨 847 🐏 250
Site Owner

Grahew Mattham

August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attributeproxiessin, the developer console, literally every link building method, and other junk.

Available at arg@256kilobytes.com, via Twitter, or arg.256kilobytes.com. Open to business inquiries based on availability.


Account created 7 months ago.
163 posts, 847 comments, and 250 RAMs.

Last active 13 hours ago:
Posted thread [Solved, Basically] Nautilus - Open Multiple Tabs from Command Line?

Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
🗎 163 🗨 847 🐏 250
Site Owner

The .tar.gz of sorted reCaptcha photos has been moved to this URL on the 256 Kilobytes server, since the old link expired:

Download more RAM. 🐏 ⨉ 0 Posted by August R. Garcia 1 month ago 🕓 Posted at 08 June, 2019 01:29 AM PDT

Sir, I can do you a nice SEO.

Post a New Comment

To leave a comment, login to your account or create an account.

Do you like having a good time?

Read Quality Articles

Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.

View Articles →

Argue with People on the Internet

Use your account to explain why people are wrong on the Internet forum.

View Forum →

Vandalize the Wiki

Or don't. I'm not your dad.

View Wiki →

Ask and/or Answer Questions

If someone asks a terrible question, post a LMGTFY link.

View Answers →

Make Some Money

Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.

Register an Account
You can also login to an existing account or recover your password. All use of this site is subject to terms outlined in the terms of service and privacy policy.