The simplest way to prevent HTML injection is to keep any third-party content from getting into the site. This isn't quite as easy as it sounds. The first step is to exclude all user input. It's also necessary to guard against altered cookies and URLs. Carelessly copying their parameters and content onto the page's content allows client-side or server-side injection.
If accepting and posting user input is part of the site's purpose, defense becomes more complicated. The site can disallow all HTML tags and accept only plain text, but the server code has to be on the lookout for tricks. By using quoting mechanisms, devious users can insert tags into the input without ever entering a ".
Users like to be able to mark up their text, adding bold and italic typefaces and perhaps inline images and embedded videos. A site that supports this needs to do careful sanitization or filtering of the input. Some tags, such as
em and
strong, are safe. Others, such as
script and
applet, definitely aren't.
It's not just tags which are risky. HTML attributes and CSS properties also need to be restricted. An attribute with an opening quote and no closing quote could eat content which should appear. The
position CSS property can move user-provided content into places where it looks like part of the website. If links are allowed, they could lead to dangerous sites.
Defeating all the tricks a determined attacker can try is difficult. The best approach is to use an HTML sanitization library that has a good reputation. The Java-based
OWASP HTML Sanitizer Project is an example which is available at no cost. Reputable CMS and blogging software includes HTML sanitization for user responses. Any page which accepts user input should undergo penetration testing by experts at injection techniques.
Website developers should avoid accepting raw input directly and writing their own filters. It's just too complicated, and the cost of a bug is high.