Html page header utf 8

Setting the HTTP charset parameter

When a server sends a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. This article provides a starting point for those needing to set the encoding information in the HTTP header.

The charset parameter

Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can send a charset parameter in the HTTP header to specify the character encoding of the document.

It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1. But there are too many unlabeled documents in other encodings, so browsers use the reader’s preferred encoding when there is no explicit charset parameter.

The line in the HTTP header typically looks like this:

In theory, any character encoding that has been registered with IANA can be used, but there is no browser that understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it. A Unicode encoding such as UTF-8 is a good choice for a number of reasons.

Читайте также:  Логические операторы в html

Server setup

How to make the server send out appropriate charset information depends on the server. You will need the appropriate administrative rights to be able to change server settings.

Apache. This can be done via the AddCharset (Apache 1.3.10 and later) or AddType directives, for directories or individual resources (files). With AddDefaultCharset (Apache 1.3.12 and later), it is possible to set the default charset for a whole server. For more information, see the article on Setting ‘charset’ information in .htaccess.

Jigsaw. Use an indexer in JigAdmin to associate extensions with charsets, or set the charset directly on a resource.

IIS 5 and 6. In Internet Services Manager, right-click «Default Web Site» (or the site you want to configure) and go to «Properties» => «HTTP Headers» => «File Types. » => «New Type. «. Put in the extension you want to map, separately for each extension; IIS users will probably want to map .htm, .html. Then, for Content type, add » text/html;charset=utf-8 » (without the quotes; substitute your desired charset for utf-8; do not leave any spaces anywhere because IIS ignores all text after spaces). For IIS 4, you may have to use «HTTP Headers» => «Creating a Custom HTTP Header» if the above does not work.

Scripting the header

The appropriate header can also be set in server side scripting languages. For example:

Perl. Output the correct header before any part of the actual page. After the last header, use a double linebreak, e.g.:
print «Content-Type: text/html; charset=utf-8\n\n»;

Python. Use the same solution as for Perl (except that you don’t need a semicolon at the end).

PHP. Use the header() function before generating any content, e.g.:
header(‘Content-type: text/html; charset=utf-8’);

Java Servlets. Use the setContentType method on the ServletResponse before obtaining any object (Stream or Writer) used for output, e.g.:
resource.setContentType («text/html;charset=utf-8»);
If you use a Writer, the Servlet automatically takes care of the conversion from Java Strings to the encoding selected.

JSP. Use the page directive e.g.:

Output from out.println() or the expression elements ( ) is automatically converted to the encoding selected. Also, the page itself is interpreted as being in this encoding.

ASP and ASP.Net. ContentType and charset are set independently, and are methods on the response object. To set the charset, use e.g.:

In ASP.Net, setting Response.ContentEncoding will take care both of the charset parameter in the HTTP Content-Type as well as of the actual encoding of the document sent out (which of course have to be the same). The default can be set in the globalization element in Web.config (or Machine.config , which is originally set to UTF-8).

Further reading

  • Setting charset information in .htaccess
  • Checking HTTP Headers
  • Tutorial, Handling character encodings in HTML and CSS
  • Related links, Setting up a server
    • Characters
    • Setting the HTTP charset parameter
    • Characters
    • Characters

    Источник

    Content Encoding: why and how to use the meta charset tag and the Content-Type header

    Improving the speed at which a web page is displayed often means making the browser’s life as easy as possible. When the browser receives an HTTP response, it actually receives text encoded in bytes, where each byte or sequence of bytes represents a given character. If the browser does not have a clear information about the used encoding, it will waste time trying to guess and may fail in some cases.

    Although the Web is intended to be universal, the various human groups that use it have their own specificities. One of these specificities is language, especially when written. All textual content is composed of characters from a directory intended for a type of use. Hiraganas, for example, are phonetic system intended for the unambiguous transcription of the Japanese language.

    To be able to designate each character unambiguously, we must assign a unique identifier to each of them. The whole set of identifiers will be called a character set. Once this correspondence table has been defined, each character need be converted into a sequence of bytes so that we can store or share them between computers. This is called character encoding.

    Imagine that I use a character set to write text and a corresponding encoding to convert it to bytes, which I later send to you. How would you decode it, and read the content, without knowing which encoding, or set, I used? Eventually, you would have to use some of the most common character set & encodings you know, expecting the result to make sense… What could go wrong?

    Replace a semicolon (;) with a greek question mark (;) in your friend’s JavaScript and watch them pull their hair out over the syntax error.
    Ben Johnson (@benbjohnson), November 16, 2014

    For example, the bit sequence 1100 0011 1010 1001 represents the character “é” in the UTF-8 encoding. If you decode this sequence assuming you have to use the Latin-1 encoding and not UTF-8, you will read “Ã ©”.
    In Latin-1, the character “é” is represented by the sequence 1110 1001.

    When the browser receives bytes from your server, it needs to identify the collection of letters and symbols that were used in writing the text that was converted into these bytes, and the encoding used for this conversion, in order to reverse it. If no information of this kind has been transmitted, the browser will try to find recognizable patterns within the bytes to determine the encoding itself, and eventually try some common charsets, which will take time, delaying further processing of the page.

    To speed up the display of your pages, you must specify the content encoding into your HTTP response.

    How to choose the right character set?

    There was a time when hundreds of character encodings coexisted, all limited and not able to contain enough characters to cover all the languages of the world. Sometimes, no encoding was adequate for all letters in a single language.

    Nowadays, Unicode – a universal character set, defining all the characters necessary to write the majority of languages – has become a standard, no matter what platform, device, application or language you’re targeting. UTF-8 is one of the Unicode encodings and the one that should be used for Web content, according to the W3C:

    Everyone developing content, whether content authors or programmers, should use the UTF-8 character encoding, unless there are very special reasons for using something else. (If you decide to not use UTF-8, you must choose one of the few encodings that are interoperably implemented across all browsers.)
    “Introducing Character Sets and Encodings”, W3C

    Note: if you’re using a database to store your content on the server side, you may be tempted to also use the “utf-8” charset too. Beware: on MySQL and MariaDB, it’s an alias for “utf8mb3”, a UTF-8 encoding called “Basic Multilingual Plane” (BMP) that only stores a maximum of three bytes per code point. Instead, you’d rather use “utf8mb4”, an encoding that stores a maximum of four bytes per code point. Otherwise, you won’t be able to use some popular characters, such as 🚀, otherwise known as “U+1F680 ROCKET”!

    How to advertise your character encoding… and the best way to do it.

    Before going any further, let’s take a look at the vocabulary in use.

    Historically, the terms “character encoding”, “character map”, “character set” and “code page” were synonymous in computer science[…]. But now the terms have related but distinct meanings,[…] Regardless, the terms are still used interchangeably, with character set being nearly ubiquitous.
    “Character encoding”, Wikipedia

    We find this use of “character set” or “charset” to designate, in reality, an encoding, in the HTML specifications. We will do the same in the rest of this article.

    One of the easiest ways to specify a charset in an HTML page is to put in a

    Declaring a character set this way requires certain constraints to be respected, one of them being that the element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document, to ensure that the browser will receive the information with the first IP packets transiting through the network and can use it to decode the rest of the document. As the charset

    If you’re afraid to forget this, don’t worry. This is obviously one of the checks that Dareboost will perform for you within our website quality analysis tool. However, you may find yourself in a situation where this declaration is not sufficient, and the browser does not take it into account. Why? Because the Content-Type metadata of the page may indicate another character set and in the event of a conflict, this information – defined in the page HTTP headers – has priority.

    To make sure of the information transmitted through the page metadata, you can use our Timeline / Waterfall feature. Unfold the detailed values of your main document to view the response HTTP headers, including the Content-Type header, containing the encoding metadata.

    Content Type Header in an HTTP response

    To change this HTTP header, you may need the help of the person who managed the server, whereas it’s your hosting service provider or a person in charge in your organization, because the configuration of the HTTP headers is very specific to the web server in use, and you’ll need the appropriate administrative rights to be able to modify those server settings.

    On Apache 2.2+, the configuration of UTF-8 as a default character set for your text/plain and text/html files involves the AddDefaultCharset directive:

    For other types of files, you’ll need the AddCharset directive;

    On nginx, you’ll need to make sure that the ngx_http_charset_module is loaded, then use the charset directive.

    Here too, it is possible to refine the scope so that other types of files than text/html are delivered in utf-8, using the directive charset_types :

    charset_types text/html text/css application/javascript 

    Of course, you can also configure the Content-Type HTTP header from your server-side scripting code. For example, in PHP, you can use the header() network function. Don’t forget to define the Media Type (or MIME type) of the body of the response, in addition to the character set.

    header('Content-type: text/html; charset=utf-8'); 

    Beware: if your pages are delivered by a Content Delivery Network (CDN), you may have to configure the Content-Type header in the CDN configuration, as most of them don’t pass on the headers they find on your servers.

    Additional Resources

    • The IANA Media Types list
    • The Unicode website
    • “Character encodings: Essential concepts“, on the W3C Internationalization website
    • Still thinking that Character Encoding is a trivial issue? Look at “Anatomy of a Critical Security Bug“, by Andrew Nacin at Loop Conf 2015.

    Источник

Оцените статью