Html and character encoding

Содержание

Handling character encodings in HTML and CSS (tutorial)
Objectives
In a nutshell
Essential background information
Choosing and applying a character encoding
How to declare a character encoding
The byte-order mark ( BOM )
Unicode normalization forms
Using character escapes
Characters or markup?
Further reading
HTML Character Sets
Example
HTML Character Sets
In the Beginning: ASCII
In Windows: Windows-1252
In HTML 4: ISO-8859-1
Example
Example
Example
In HTML5: Unicode UTF-8
Example
Example

Handling character encodings in HTML and CSS (tutorial)

If a browser is unable to detect the character encoding used in a page, the content may be unreadable. The information in this tutorial is particularly important for those maintaining and extending a multilingual site, but declaring the character encoding of the document is important for anyone producing HTML or CSS that uses non-ASCII characters, because, although it looks good to you, other people»s browser settings can affect readability. This tutorial will give you an understanding of the topic that will help you make the right choices.

Objectives

When you have finished this tutorial you should:

have a clear idea about factors relating to the choice of encoding for HTML documents, and appreciate the value of using Unicode
know when and how to declare the character encoding (charset) for documents using HTML and CSS
understand what the terms byte-order mark and normalization mean, how they can affect you, and how to deal with them
understand when and how to use escapes to represent characters

Intended audience: HTML and CSS content authors. This material is applicable whether you create documents in an editor, or via scripting.

Читайте также: Java new file exceptions

This tutorial gathers together and organizes pointers to articles that, taken together, help you understand how to handle the essential aspects of authoring HTML and CSS related to characters and character encodings.

In a nutshell

Always declare the encoding of your document. Use the HTTP header if you can. Always use an in-document declaration too.

You can use @charset or HTTP headers to declare the encoding of your style sheet, but you only need to do so if your style sheet contains non-ASCII characters and, for some reason, you can’t rely on the encoding of the HTML and the associated style sheet to be the same.

Try to avoid using the byte-order mark in UTF-8, and ensure that your HTML code is saved in Unicode normalization form C (NFC).

Avoid using character escapes, except for invisible or ambiguous characters. And don’t use Unicode control characters when you can use markup instead.

Essential background information

If you are a newcomer to this topic, there are certain foundational concepts you need to understand if you are to follow various parts of the tutorial. If you are familiar with these concepts, you can skip to the next section.

Choosing and applying a character encoding

Content is composed of a sequence of characters. Characters represent letters of the alphabet, punctuation, etc. But content is stored in a computer as a sequence of bytes, which are numeric values. Sometimes more than one byte is used to represent a single character. Like codes used in espionage, the way that the sequence of bytes is converted to characters depends on what key was used to encode the text. In this context, that key is called a character encoding. There are many character encodings to choose from.

Choosing & applying a character encoding offers simple advice on which character encoding to use for your content, and how to apply it.

How to declare a character encoding

You should always specify the encoding used for an HTML or XML page. If you don’t, you risk that characters in your content are incorrectly interpreted. This is not just an issue of human readability, increasingly machines need to understand your data too. You should also check that you are not specifying different encodings in different places.

Declaring character encodings in HTML provides quick recommendations for those who just want to be told what to do, and more detailed information for those who need it.

The byte-order mark ( BOM )

The byte-order mark, or BOM, is something you will come across when using a Unicode-based character encoding, such as UTF-8 and UTF-16. In some cases you will need to remove the BOM, in others you need to ensure that it is there.

Unicode normalization forms

Normalization is something you need to be aware of if you are authoring in UTF-8, be it HTML pages or CSS style sheets, particularly if you are dealing with text in a script that uses accents or other diacritics.

Using character escapes

You can use a character escape to represent any character from the Unicode character set in HTML, XML or CSS using only ASCII characters.

Using character escapes in markup and CSS provides best practices for use of escapes, and tells you how to use them when they are needed.

Characters or markup?

Finally, there are a range of control-like Unicode characters, some of which fulfill the same role as markup. The question is, which should you use, and which should you avoid?

HTML Character Sets

To display an HTML page correctly, the browser must know what character set (encoding) to use:

Example

HTML Character Sets

The HTML5 specification encourages web developers to use the UTF-8 character set!

This has not always been the case. The character encoding for the early web was ASCII.

Later, from HTML 2.0 to HTML 4.01, ISO-8859-1 was considered as the standard character set.

With XML and HTML5, UTF-8 finally arrived and solved a lot of character encoding problems.

In the Beginning: ASCII

Computer data is stored as binary codes (01000101) in the electronics.

To standardize the storing of text, the American Standard Code for Information Interchange (ASCII) was created. It defined a unique binary number for each storable character to support the numbers from 0-9, the upper and lower case alphabet (a-z, A-Z), and special characters like ! $ + — ( ) @ < >, .

Since ASCII used 7 bits for the character, it could only represent 128 different characters.

The biggest weakness with ASCII, was that it excluded non English letters.

ASCII is still in use today, especially in large mainframe computer systems.

For a closer look, please study our Complete ASCII Reference.

In Windows: Windows-1252

Windows-1252 was the default character set in Windows, up to Windows 95.

It is an extension to ASCII, with added international characters.

It uses a full byte (8-bits) to represent 256 different characters.

Since Windows-1252 has been the default in Windows, it is supported by all browsers.

In HTML 4: ISO-8859-1

The character set most often used in HTML 4 was ISO-8859-1.

ISO-8859-1 is an extension to ASCII, with added international characters.

Example

In HTML 4, a character set different from ISO-8859-1 can be specified in the tag:

Example

All HTML 4 processors also support UTF-8:

Example

When a browser detects ISO-8859-1 it normally defaults to Windows-1252, because Windows-1252 has 32 more international characters.

In HTML5: Unicode UTF-8

The HTML5 specification encourages web developers to use the UTF-8 character set.

Example

A character-set different from UTF-8 can be specified in the tag:

Example

The Unicode Consortium developed the UTF-8 and UTF-16 standards, because the ISO-8859 character-sets are limited, and not compatible a multilingual environment.

The Unicode Standard covers (almost) all the characters, punctuations, and symbols in the world.

All HTML5 and XML processors support UTF-8, UTF-16, Windows-1252, and ISO-8859.

Источник

Html and character encoding

Handling character encodings in HTML and CSS (tutorial)

Objectives

In a nutshell

Essential background information

Choosing and applying a character encoding

How to declare a character encoding

The byte-order mark ( BOM )

Unicode normalization forms

Using character escapes

Characters or markup?

Further reading

HTML Character Sets

Example

HTML Character Sets

In the Beginning: ASCII

In Windows: Windows-1252

In HTML 4: ISO-8859-1

Example

Example

Example

In HTML5: Unicode UTF-8

Example

Example