Web: Stupid HTML trick to get past content filters
by firestorm_v1 on May.02, 2010, under How-To's, Miscellaneous, Networking, Software
I know it’s been a while since I posted, and I do apologize. Life has definitely not been kind to me in the regards of time however I have not forgotten anything. I have two major posts coming up hopefully within the next week, however here’s a quick article about a trick I discovered while working on a project with a friend. The project was to see if their content filter could be broken in their chat application andthrough a little bit of HTML know-how and some PHP code, I was able to crank out a generator to do just that. Read more to find out the details.
The Challenge:
The trick was to figure out how to get certain “four letter words” past the chat app’s filter and into the main chat window without the word being munged by the system. Most chat applications filter out obscene words through a string matching system and replaces it with something that is much less offensive, usually a series of asterisks. The only thing I could use was straight ASCII characters, and I couldn’t use any “img src” HTML tags to do the dirty work (literally).
The Analysis:
All HTML code that is rendered is associated with something called a character set (or code page from the old MS-DOS days). These character sets associate any character with a certain number (often called it’s ASCII value). Although some characters are standard on all character sets, (like “a” = 97), some control characters and characters above 256(decimal) change significantly. In order to properly convey these control characters via the web, urlencoding was created and implemented as part of the HTML spec. What this means is that every character in a character set can be represented in HTML through the use of the percent sign (%) modifier. The syntax for this was %(ASCII value in hexadecimal). The general idea was that if you typed in a russian name using symbols not found in the Latin alphabet, these symbols could be properly represented on the server side.
With that in mind, I examined the UTF-8 character set. In this example, I’ll use the word “taco” to represent the offending word.
How it’s done:
The process for this is as follows:
- Find the ASCII value for each character in the word
- Find the hexadecimal value for the ASCII value
- Add “%” in front of that number
- Insert a “null” character somewhere.
For reference, you can use this chart which gives you the ASCII and the ASCII in hex values already
From the chart, we see the following information:
t = 116 (decimal) or 74(hex)
a=97(decimal) or 61(hex)
c= 99(decimal) or 63(hex)
o = 111(decimal) or 6f(hex)
Using this information, we can then create our string, inserting the % where needed. %74 %61 %63 %6f
Only one item remains. In order to spoof some of the more intelligent content filters, you need to put a null character in there somewhere. This throws off the content filter and makes it think that there are different characters represented. For this, I used character 0B which does not have latin equivalent and is a control code that does not render in HTML. I used 0B because 08 rendered as a tab in testing.
Knowing this, I inserted the null character between the urlencoded “a” and the urlencoded “c”: %74 %61 %0B %63 %6F
Testing it out:
All that is needed to test it is to copy and paste the above string into any chat application and hit send. You will need to remove the spaces from between the characters otherwise your application will treat them as renderable characters as well. If it works, you’ll see the word “taco” in your window. Now you know how to get past content filters. If you are in the business of building content filters, now you have a new strategy for blocking people abusing them.
Don’t be a prick!
I posted this information with the hopes that people may find it useful, not so that script kiddies can run around and make asses of themselves. Be smart about how you use this information and last but not least, DON’T BE A PRICK!