Unicode Visual Spoofing for Good: Confusable CAPTCHAs

In this blog post, I will show a proof of concept method of leveraging Unicode Visual Spoofing/Lookalikes for use in a CAPTCHA to help prevent automated bots from scraping pages and autosubmitting data.

Unicode Visual Spoofing/Lookalikes

An in-depth discussion of Unicode and the security challenges it poses is beyond the scope of this post, however there are a few salient points to mention. The first of which is the issue of Visual Spoofing. Chris Weber of Casaba Security has an outstanding presentation entitled "Exploiting Unicode-enabled Software" in which he outlines this issue. Here are two applicable points:

Visual Spoofing

  • Over 100,000 assigned characters
  • Many lookalikes within and across scripts


Example IDN Homograph Attack

www.google.com is not www.gooɡle.com

g = LatinU+0069
ɡ = LatinU+0261

The main issue for security is that, unless data is properly canonicalized before security checks, it is possible for attackers to evade detections. Unicode Visual spoofing can easily be used by criminals in phishing attacks. Even savy Internet users may be tricked into clicking on links at the these Unicode code points are oftentimes visually indistiguishable from one another.


The underlying issue outlined above is that computer programs and humans may interpret Unicode characters differently. We can leverage this issue in our favor if we implement the same concept in a different context - CAPTCHAs.

A CAPTCHA (pronounced /ˈkæptʃə/) is a type of challenge-response test used in computing as an attempt to ensure that the response is not generated by a computer. The process usually involves one computer (a server) asking a user to complete a simple test which the computer is able to generate and grade. Because other computers are supposedly unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. Thus, it is sometimes described as a reverse Turing test, because it is administered by a machine and targeted to a human, in contrast to the standard Turing test that is typically administered by a human and targeted to a machine. A common type of CAPTCHA requires the user to type letters or digits from a distorted image that appears on the screen.

Here is an example of typical CAPTCHA usage where a graphic is used with obscured text characters displayed:

The user must visually decipher the test and input it into the text box.

Turning the Tables: Visual Spoofing in CAPTCHAs

Rather than using an image file with obscured text in it, the concept presented here is to use Unicode Visually Spoofing/Lookalikes to essentially "trick" the user into entering the text that you desire.

Here is an example Comment form CAPTCHA that implements this concept by adding in an addition field to the end of the form:

            <form method="post" action="http://www.example.com/cgi-bin/mt/mt-c.cgi" name="comments_form" id="comments-form" onsubmit="if (this.bakecookie.checked) rememberMe(this)">             <input type="hidden" name="static" value="1" />             <input type="hidden" name="entry_id" value="43271" />             <input type="hidden" name="__lang" value="en" />             <input type="hidden" name="parent_id" value="" id="comment-parent-id" />            <div id="comments-open-data">                 <div id="comment-form-name">                     <label for="comment-author">Name</label>                     <input id="comment-author" name="author" size="30" value="" />                 </div>                 <div id="comment-form-email">                     <label for="comment-email">Email Address</label>                     <input id="comment-email" name="email" size="30" value="" />                 </div>                                 <div id="comment-form-remember-me">                     <label for="comment-bake-cookie"><input type="checkbox" id="comment-bake-cookie" name="bakecookie" onclick="if (!this.checked) forgetMe(document.comments_form)" value="1" />                         Remember personal info?</label>                 </div>             </div>             <div id="comments-open-text">                 <label for="comment-text">Comments (You may use HTML tags for style)</label>                 <textarea id="comment-text" name="text" rows="15" cols="50"></textarea>             </div>   <div id="comments-open-footer">                 <!--input type="submit" accesskey="v" name="preview" id="comment-preview" value="Preview" /-->                 <br><label for="challenge_answer">Type the word &#1072;pple below. <strong>(required)</strong>:</label><br /><input type="text" id="challenge_answer" name="challenge_answer" /><br><input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />                 </div>         </form> 

This html adds in a new text field called "challenge_answer" where this data will be sent along with the standard POST arguments when the form is submitted to the web app. Notice the highligted text area at the end of the form? It includes an encoded A (Cyrillic) character (&#1072) instead of a Latin small letter "a" to display the word "apple".

Here is how the form would look to user in a web browser:

Screen shot 2011-05-10 at 10.51.39 AM

So the concept is that a malicious SPAM bot program would most likely scrape the raw html above and either insert the raw &#1072 or а (A_(Cyrillic) data into the text field, while a human would insert a normal a (Lating small letter "a") when spelling the word "apple".

Implementation/Validation of Confusable CAPTCHA using ModSecurity

We can implement this Confusable CAPTCHA concept dynamically into forms by using new ModSecurity v2.6 capabilities such as Content Modification.

Enabling Content Modification

In order to dynamically modify outbound response bodies in ModSecurity, you must enable the following two directives:

Modifying Outbound Forms

In order to modify the existing html form data, you can use the following example ModSecurity rules which uses the new @rsub operator which allows for data substitution:

SecRule STREAM_OUTPUT_BODY "@rsub s/<input type=\"submit\"/<br><label for=\"challenge_answer\">Type the word &#1072;pple below. <strong>(required)<\/strong>:<\/label><br \/><input type=\"text\" id=\"challenge_answer\" name=\"challenge_answer\" \/><br><input type=\"submit\"/" \"phase:4,t:none,nolog,pass"

This rule will trap any existing form "Submit" button elements and then prepend our Confusable CAPTCHA data before it.

Validating CAPTCHA Data

We now implement two SecRules to validate the CAPTCHA data.

SecRule REQUEST_FILENAME "@streq /cgi-bin/mt/mt-c.cgi" "chain,phase:2,t:none,block,msg:'Comment Post Error: CAPTCHA Challenge Missing.'"        SecRule &ARGS:CHALLENGE_ANSWER "@eq 0"SecRule REQUEST_FILENAME "@streq /cgi-bin/mt/mt-c.cgi" "chain,phase:2,t:none,block,msg:'Comment Post Error: Invalid CAPTCHA Challenge Answer.',logdata:'%{args.challenge_answer}'"        SecRule ARGS:CHALLENGE_ANSWER "!@streq apple"

These rules check the Comment Form receiving page (/cgi-bin/mt/mt-c.cgi) and then ensure that that the challenge_answer is present and that is contains exactly the word "apple" with a Latin lower case "a". If these checks fail, then the requests will be blocked and alerts generated.

Example alert:

[Tue May 10 08:42:30 2011] [error] [client xxx.xxx.xxx.xxx] ModSecurity: Warning. Match of "streq apple" against "ARGS:challenge_answer" required. [file "/usr/local/apache/conf/crs/base_rules/modsecurity_crs_14_customrules.conf"] [line "9"] [msg "Comment Post Error: Invalid CAPTCHA Challenge Answer."] [data "&#1072;pple"] [hostname "www.example.com"] [uri "/cgi-bin/mt/mt-c.cgi"] [unique_id "TckytsCoAW0AAB9vOWoAAAAD"]

Confusable CAPTCHA Effectiveness

Keep in mind that this is simply a proof of concept at this point and it has not yet been field tested. This implementation is not meant as a replacement for programs such as ReCAPTCHA. The idea is that this implementation would stop automated programs from scraping your comment form data and auto-submitting SPAM posts. This concept would obviously be circumvented by CAPTCHA answering services as well.

If you decided to field test this concept, we would love to hear from you.

Trustwave reserves the right to review all comments in the discussion below. Please note that for security and other reasons, we may not approve comments containing links.