What is character encoding html

Declaring character encodings in HTML

Intended audience: HTML authors (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who needs an introduction to how to declare the character encoding of their HTML file.

Question

How should I declare the encoding of my HTML file?

You should always specify the encoding used for an HTML or XML page. If you don’t, you risk that characters in your content are incorrectly interpreted. This is not just an issue of human readability, increasingly machines need to understand your data too. A character encoding declaration is also needed to process non-ASCII characters entered by the user in forms, in URLs generated by scripts, and so forth. This article describes how to do this for an HTML file.

If you need to better understand what characters and character encodings are, see the article Character encodings for beginners . For information about declaring encodings for CSS style sheets, see CSS character encoding declarations .

Quick answer

Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive). The declaration should fit completely within the first 1024 bytes at the start of the file, so it’s best to put it immediately after the opening head tag.

Читайте также:  Java for web scraping

It doesn’t matter which you use, but it’s easier to type the first one. It also doesn’t matter whether you type UTF-8 or utf-8 .

You should always use the UTF-8 character encoding. (Remember that this means you also need to save your content as UTF-8.) See what you should consider if you really cannot use UTF-8.

If you have access to the server settings, you should also consider whether it makes sense to use the HTTP header. Note however that, since the HTTP header has a higher precedence than the in-document meta declarations, content authors should always take into account whether the character encoding is already declared in the HTTP header. If it is, the meta element must be set to declare the same encoding.

You can detect any encodings sent by the HTTP header using the Internationalization Checker.

Details

What about the byte-order mark?

If you have a UTF-8 byte-order mark (BOM) at the start of your file then modern browsers will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.

You could skip the meta encoding declaration if you have a BOM, but we recommend that you keep it, since it helps people looking at the source code to ascertain what the encoding of the page is.

Should I declare the encoding in the HTTP header?

Use character encoding declarations in HTTP headers if it makes sense, and if you are able, for any type of content, but in conjunction with an in-document declaration.

Content authors should always ensure that HTTP declarations are consistent with the in-document declarations.

Pros and cons of using the HTTP header

One advantage of using the HTTP header is that user agents can find the character encoding information sooner when it is sent in the HTTP header.

The HTTP header information has the highest priority when it conflicts with in-document declarations other than the byte-order mark. Intermediate servers that transcode the data (ie. convert to a different encoding) could take advantage of this to change the encoding of a document before sending it on to small devices that only recognize a few encodings. It is not clear that this transcoding is much used nowadays. If it is, and it is converting content to non-UTF-8 encodings, it runs a high risk of loss of data, and so is not good practice.

On the other hand, there are a number of potential disadvantages:

  • It may be difficult for content authors to change the encoding information for static files on the server – especially when dealing with an ISP. Authors will need knowledge of and access to the server settings.
  • Server settings may get out of synchronization with the document for one reason or another. This may happen, for example, if you rely on the server default, and that default is changed. This is a very bad situation, since the higher precedence of the HTTP information versus the in-document declaration may cause the document to become unreadable.
  • There are potential problems for both static and dynamic documents if they are not read from a server; for example, if they are saved to a location such as a CD or hard disk. In these cases any encoding information from an HTTP header is not available. Similarly, if the character encoding is only declared in the HTTP header, this information is no longer available for files during editing, or when they are processed by such things as XSLT or scripts, or when they are sent for translation, etc.

So should I use this method?

If serving files via HTTP from a server, it is never a problem to send information about the character encoding of the document in the HTTP header, as long as that information is correct.

On the other hand, because of the disadvantages listed above we recommend that you should always declare the encoding information inside the document as well. An in-document declaration also helps developers, testers, or translation production managers who want to visually check the encoding of a document.

(Some people would argue that it is rarely appropriate to declare the encoding in the HTTP header if you are going to repeat it in the content of the document. In this case, they are proposing that the HTTP header say nothing about the document encoding. Note that this would usually mean taking action to disable any server defaults.)

Working with polyglot and XML formats

XHTML5: An XHTML5 document is served as XML and has XML syntax. XML parsers do not recognise the encoding declarations in meta elements. They only recognise the XML declaration. Here is an example:

X

The idea was that the browser would be able to apply the right encoding to the document it retrieves if no encoding is specified for the document in any other way.

There were always issues with the use of this attribute. Firstly, it is not well supported by major browsers. One reason not to support this attribute is that if browsers do so without special additional rules it would be an XSS attack vector. Secondly, it is hard to ensure that the information is correct at any given time. The author of the document pointed to may well change the encoding of the document without you knowing. If the author still hasn’t specified the encoding of their document, you will now be asking the browser to apply an incorrect encoding. And thirdly, it shouldn’t be necessary anyway if people follow the guidelines in this article and mark up their documents properly. That is a much better approach.

This way of indicating the encoding of a document has the lowest precedence (ie. if the encoding is declared in any other way, this will be ignored). This means that you couldn’t use this to correct incorrect declarations either.

Working with UTF-16

According to the results of a Google sample of several billion pages, less than 0.01% of pages on the Web are encoded in UTF-16. UTF-8 accounted for over 80% of all Web pages, if you include its subset, ASCII, and over 60% if you don’t. You are strongly discouraged from using UTF-16 as your page encoding.

If, for some reason, you have no choice, here are some rules for declaring the encoding. They are different from those for other encodings.

The HTML5 specification forbids the use of the meta element to declare UTF-16, because the values must be ASCII-compatible. Instead you should ensure that you always have a byte-order mark at the very start of a UTF-16 encoded file. In effect, this is the in-document declaration.

Furthermore, if your page is encoded as UTF-16, do not declare your file to be «UTF-16BE» or «UTF-16LE», use «UTF-16» only. The byte-order mark at the beginning of your file will indicate whether the encoding scheme is little-endian or big-endian . (This is because content explicitly encoded as, say, UTF-16BE should not use a byte-order mark; but HTML5 requires a byte-order mark for UTF-16 encoded pages.)

The XML declaration

The XML declaration is defined by the XML standard. It appears at the top of an XML file and supports an encoding declaration . For example:

An XML declaration is required for a document parsed as XML if the encoding of the document is other than UTF-8 or UTF-16 and if the encoding is not provided by a higher level protocol , ie. the HTTP header.

This is significant, because if you decide to omit the XML declaration you must choose either UTF-8 or UTF-16 as the encoding for the page if it is to be used without HTTP!

You should use an XML declaration to specify the encoding of any XHTML 1.x document served as XML.

It can be useful to use an XML declaration for web pages served as XML, even if the encoding is UTF-8 or UTF-16, because an in-document declaration of this kind also helps developers, testers, or translation production managers ascertain the encoding of the file with a visual check of the source code.

Using the XML declaration for XHTML served as HTML. XHTML served as HTML is parsed as HTML, even though it is based on XML syntax, and therefore an XML declaration should not be recognized by the browser. It is for this reason that you should use a pragma directive to specify the encoding when serving XHTML in this way*.

* Conversely, the pragma directive, though valid, is not recognized as an encoding declaration by XML parsers.

On the other hand, the file may also be used at some point as input to other processes that do use XML parsers. This includes such things as XML editors, XSLT transformations, AJAX, etc. In addition, sometimes people use server-side logic to determine whether to serve the file as HTML or XML. For these reasons, if you aren’t using UTF-8 or UTF-16 you should add an XML declaration at the beginning of the markup, even if it is served to the browser as HTML. This would make the top of a file look like this:

If you are using UTF-8 or UTF-16, however, there is no need for the XML declaration, especially as the meta element provides for visual inspection of the encoding by people processing the file.

Catering for older browsers. If anything appears before the DOCTYPE declaration in Internet Explorer 6, the page is rendered in quirks mode. If you are using UTF-8 or UTF-16 you can omit the XML declaration, and you will have no problem.

If, however, you are not using these encodings and Internet Explorer 6 users still count for a significant proportion of your readers, and if your document contains constructs that are affected by the difference between standards mode vs. quirks mode, then this may be an issue. If you want to ensure that your pages are rendered in the same way on all standards-compliant browsers, you will have to add workarounds to your CSS to overcome the differences.

There may also be some other rendering issues associated with an XML declaration, though these are probably only an issue for much older browsers. The XHTML specification warns that processing instructions are rendered on some user agents. Also, some user agents interpret the XML declaration to mean that the document is unrecognized XML rather than HTML, and therefore may not render the document as expected. You should do testing on appropriate user agents to decide whether this will be an issue for you.

Of course, as mentioned above, if you use UTF-8 or UTF-16 you can omit the XML declaration and the file will still work as XML or HTML. This is probably the simplest solution.

Further reading

  • Getting started? Introducing Character Sets and Encodings
  • Tutorial, Handling character encodings in HTML and CSS
  • Related links, Authoring HTML & CSS
    • Characters
    • Declaring the character encoding for HTML
    • Choosing and applying a character encoding
    • Characters

    Источник

Оцените статью