PHP Encoding for HTML
HTML can contain almost any character. There are a few characters which have special meanings in HTML and should be used with caution.
Reserved HTML characters
These characters do not have special meanings in all parts of HTML. For example, it is not a problem to use » inside a paragraph of text, but it is a problem to use it inside an HTML attribute name.
I love the movie "Star Wars".
In the example above, the » is a problem for the title attribute but not for the content of the paragraph. Here is what the resulting HTML would look like. HTML would read the title value as «Movie: «.
title="Movie: "Star Wars"">I love the movie "Star Wars".
The < and >are the most problematic characters because they indicate the start and end of tags. Using them accidentally could open a new tag or close an existing tag when it is not intended and break the entire page structure.
Allowing these characters in dynamic data also opens up the possibility that additional tags—including form elements and JavaScript—could be inserted in the HTML of the page. This is a common security vulnerability that hackers like to exploit.
HTML encoding
All dynamic values should be encoded (i.e. «transformed») before being used anywhere in HTML. This will ensure that the content does not interfere with the structure of the HTML. Web developers output a lot of dynamic data to HTML, so HTML encoding happens routinely.
This is a major concern for security because embedded-JavaScript needs HTML tags to function. It will be a primary defense against Cross-Site Scripting attacks.
There are different types of encoding depending on the context. Encoding for HTML means converting reserved characters into HTML character entities.
HTML character entities are written as &code; , where «code» is an abbreviation or a number to represent each character. There are thousands of HTML character entities, but for encoding special characters, there are only four that matter.
PHP encoding functions
PHP has two built-in functions which can help with HTML encoding. The first encodes only the four reserved characters. The second encodes as much as it can.
htmlspecialchars()
- Encode reserved characters as HTML entities
- Ignores single quotes by default, but configurable
- Use for all output inside HTML
htmlentities()
- Encode all possible characters as HTML entities
- Use for safe and pretty output in HTML
$string = 'We have to watch out for < and & as well as " and >'; ?> echo htmlspecialchars($string); ?>
We have to watch out for < and & as well as " and >
$symbols = "™ ® © • £ ¢ ¥"; ?> echo 'Symbols: ' . htmlentities($symbols); ?>
Symbols: ™ ® © • £ ¢ ¥
There are PHP functions which can decode these encoded strings (htmlspecialchars_decode(), html_entity_decode()) but they are almost never needed because the browser does the decoding that matters when it processes the HTML page.
Pro Tip
Because encoding for HTML is done frequently and because the function name is very long, most PHP developers define a custom function as a short cut.
function h($string="") return htmlspecialchars($string); > ?> echo h("This is safe for < and >."); ?>
Encoding for URLs inside HTML
When outputing a dynamic link to an HTML page, it should be encoded for the URL and also encoded for HTML. Because all output should be encoded for HTML.
$course = 'web security'; $query = 'URL encode & decode'; $label = 'Link label with < and >'; $url = rawurlencode('/courses/' . $course . '/content'); $url .= '?search=' . urlencode($query); ?> href=" echo htmlspecialchars($url); ?>"> echo htmlspecialchars($label); ?>
Other HTMl sanitizing functions
PHP’s strip_tags() function will remove all HTML and PHP tags from a string. It is an exception to the «don’t remove content» rule because it is well-designed to remove all tags.
$string = 'Text
Link'; echo strip_tags($string); // TextLink ?>
It is possible to whitelist tags which should still be allowed but, as the PHP manual notes, this opens up the possibility for abuse of the tag attributes such as style and onmouseover.
When removing all tags, strip_tags() is as secure as htmlspecialchars(). When tags are allowed strip_tags() is much less secure than htmlspecialchars() and should be used with caution.
PHP’s filter_var() function will apply a selected filter to a value. Filters are grouped into sanitizing and validating. At first, filters may seem harder to use than simple functions, but they are powerful.
The FILTER_SANITIZE_FULL_SPECIAL_CHARS filter has the same effect as htmlspecialchars().
$string = 'We have to watch out for < and & as well as " and >'; echo filter_var($string, FILTER_SANITIZE_FULL_SPECIAL_CHARS); ?>
Other sanitizing filters include:
- FILTER_SANITIZE_ENCODED: encodes for a URL, like rawurlencode()
- FILTER_SANITIZE_URL: remove all characters not allowed in a URL
- FILTER_SANITIZE_EMAIL: removes characters not allowed in an email address
- FILTER_SANITIZE_STRING: removes tags, like strip_tags()
- FILTER_SANITIZE_NUMBER_INT: removes characters not allowed in numbers
- FILTER_SANITIZE_NUMBER_FLOAT: removes characters not allowed in floats
Pro Tip
If the filter_var() syntax seems cumbersome, it is possible to wrap them in custom functions with names which are easier to remember.
function sanitize_email($value="") return filter_var($value, FILTER_SANITIZE_EMAIL); > ?>
PHP HTML Encode
HTML relies heavily on the use of tags and special characters. Some special characters in HTML contain special meaning that requires them to be used with caution.
For example, tags such as < and >are among the most widely used characters in HTML. Although they do not pose any threats on their own, when misused, they can break the entire web page.
Such HTML characters also stance a significant security flaw, especially in dynamic web applications. This can lead to the injection of malicious code such as JavaScript and form data.
The essence of this guide is to show you how you can use PHP to encode or “sanitize” HTML characters. Encoding such characters in dynamic websites will prevent Cross-Site Scripting and protect the web page from breaking.
What is Encoding?
Encoding refers to the process of converting reserved characters into HTML character entities. HTML character entities are expressed as &value; where the “value” represents an abbreviation or number for each character.
HTML offers a comprehensive collection of entities. However, we need only concern ourselves with four of them for encoding purposes:
Let us learn how we can use PHP to encode such characters.
PHP Encoding Functions
PHP has two main functions that you can use to encode HTML characters.
The htmlspecialchars() functions encode the four main characters (above) while the htmentities() function will encode all the characters as possible.
Let us learn how to use the two functions.
PHP htmlspecialchars()
This function converts all special or reserved HTML characters to HTML entities. Although you can specify, the function will ignore single quotes by default.
The general syntax of the function is as shown:
The function accepts the string containing the HTML to be encoded. You can also specify flag values that allow you to tweak how the method operates.
PHP also allows you to specify the encoding method you wish to use for the HTML entities. The following image shows the supported charsets.
The following example shows how to use the htmlspecialchars() method.
The above example will encode the HTML characters specified in the variable $str.
If you want the function to process single and double quotes, you can use a flag as shown in the example below:
$str = «A single quote as ‘and’ will be ignored by default » ;
echo htmlspecialchars ( $str , ENT_QUOTES ) ;
?>?php>
Once you run the above code, the function will process the single quotes and give an output as shown:
A single quote as ‘and’ will be ignored by default
PHP htmlentities()
We will look at the next encoding character is the PHP htmlentities(). This function converts all applicable HTML characters to HTML entities. It is a perfect choice when you need to process your HTML safely.
The general syntax of the function is as shown:
The function is very similar to htmlspecialchars() except it processes all characters it can by default.
The following example shows you how to use the htmlentities() function.
The above code should return all the tags converted to entities as:
Similar to the htmlspecialchars() function, it supports flags and encoding charset. Check the documentation to discover more.
Conclusion
In this guide, you learned how the basics of HTML character encoding. You also learned how to use PHP to convert HTML characters into HTML entities.
Thank you for reading and stay tuned for more.
About the author
John Otieno
My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list