Php Input Validation

Problem

I need to validate input from a textarea field. I want to allow a few tags, like a, i, b, etc. But everything else needs to be filtered out. And the input should be checked to see that its is nested properly.

Overview

User input sanitization and validation is one of those things that just needs to be done. Dealing with a textarea is more complicated and validating html for re-display on a web page is less trivial than it seems at a glance. It’s a pain in the ass so I like it when I can find someone who has already solved the issue. I found a new solution here: HTML Purifier.

The author, ezyang, offers a detailed study of the issue as well as a short history of the PHP HTML validators that preceded his and where they fall short. It’s well-done, an impressive work of engineering and scholarship, but I hesitate to use the class because it’s just so damned large. Also, the dependencies are a little bit confusing to me, though I am sure they could be sorted out easily enough by testing the class. On the plus side, the API is simple enough. But I don’t need the kind of comprehensive solution that’s offered here.

The Process

Since the solution I am looking for involves two parts: (1) sanitizing the input and (2) validating the markup, I figured I’d take the Reese’s approach and just try to combine two existing classes that will taste great together. The data sanitization component is based on the Input Filter class that the ezyang critiques, justifiably, as inadequate. It does the job sanitizing input. Where if falls short is in validating the markup to make sure it’s properly formed. That’s where the second source comes in: Simon Willison’s Safe HTML Checker. This relies on Php’s native XML parser and is nice and short.

I’m calling my class Input Baffle, where baffle means “A device used to restrain or regulate.” It basically acts as a wrapper (is this the composite pattern?) for these two subordinate classes.

Even with the hard work done for me, the problem still proved thorny and took me several hours to work out all the significant wrinkles. The major complications I encountered included:

1. Parser Iteration in Safe HTML Checker

I set up my class to create a single instance of the SafeHtmlChecker class as a member. SafeHtmlChecker is essentially a wrapper for Php’s native XML Parser. The problem with this is that there isn’t a way – at least, an obvious one – to reuse the parser after it has detected an error. This is a problem, as in my unit test, where my class might be validating more than one field in a form. The problem is identified here. The solution was to have each call to validate the submitted markup create, and destroy, a separate parser.

2. Recursion Overflow in InputFilter

Try running this <<<>>> through InputFilter. Memory overflow. To solve that problem, I added a preprocessor to my class that uses regex to eliminate this and a few other hobgoblins that InputFilter misses.

3. InputFilter vs. the rsnake XSS Cheat Sheet

Discovered the XSS cheat sheet on the HTML Purifier site. An excellent test set, I added methods (based on HTML Purifier’s smoke test), to run the test. HTML Purifier aces the test. InputFilter does not. InputFilter, in conjunction with SafeHtmlChecker catches all the exploits except one. So I just added that one to the preprocessor.

Solution

You can find my code, with unit tests, on my Google Code site:

http://klenwell.googlecode.com/svn/trunk/projects/php/kwoss/input_baffle/

My goal ultimately is to convert this to a CakePhp behavior.

I already knew the stakes involved with input filtering and validation. And I knew it was not a trivial problem. Working through this, and referring to ezyang’s more fastidious solution, gave me a much deeper appreciation for how complex it really is.