- Unicode in JavaScript
- How JavaScript uses Unicode internally
- Using Unicode in a string
- Normalization
- Emojis
- Get the proper length of a string
- ES6 Unicode code point escapes
- Encoding ASCII chars
- Saved searches
- Use saved searches to filter your results more quickly
- License
- javascript-utilities/to-unicode
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
Unicode in JavaScript
If not specified otherwise, the browser assumes the source code of any program to be written in the local charset, which varies by country and might give unexpected issues. For this reason, it’s important to set the charset of any JavaScript document.
How do you specify another encoding, in particular UTF-8, the most common file encoding on the web?
If the file contains a BOM character, that has priority on determining the encoding. You can read many different opinions online, some say a BOM in UTF-8 is discouraged, and some editors won’t even add it.
This is what the Unicode standard says:
… Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.
This is what the W3C says:
In HTML5 browsers are required to recognize the UTF-8 BOM and use it to detect the encoding of the page, and recent versions of major browsers handle the BOM as expected when used for UTF-8 encoded pages. — https://www.w3.org/International/questions/qa-byte-order-mark
If the file is fetched using HTTP (or HTTPS), the Content-Type header can specify the encoding:
Content-Type: application/javascript; charset=utf-8
If this is not set, the fallback is to check the charset attribute of the script tag:
script src="./app.js" charset="utf-8">
If this is not set, the document charset meta tag is used:
. head> meta charset="utf-8" /> head> .
The charset attribute in both cases is case insensitive (see the spec)
Public libraries should generally avoid using characters outside the ASCII set in their code, to avoid it being loaded by users with an encoding that is different than their original one, and thus create issues.
How JavaScript uses Unicode internally
While a JavaScript source file can have any kind of encoding, JavaScript will then convert it internally to UTF-16 before executing it.
JavaScript strings are all UTF-16 sequences, as the ECMAScript standard says:
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
Using Unicode in a string
A unicode sequence can be added inside any string using the format \uXXXX :
A sequence can be created by combining two unicode sequences:
Notice that while both generate an accented e, they are two different strings, and s2 is considered to be 2 characters long:
And when you try to select that character in a text editor, you need to go through it 2 times, as the first time you press the arrow key to select it, it just selects half element.
You can write a string combining a unicode character with a plain char, as internally it’s actually the same thing:
const s3 = 'e\u0301' //é s3.length === 2 //true s2 === s3 //true s1 !== s3 //true
Normalization
Unicode normalization is the process of removing ambiguities in how a character can be represented, to aid in comparing strings, for example.
Like in the example above:
const s1 = '\u00E9' //é const s3 = 'e\u0301' //é s1 !== s3
ES6/ES2015 introduced the normalize() method on the String prototype, so we can do:
s1.normalize() === s3.normalize() //true
Emojis
Emojis are fun, and they are Unicode characters, and as such they are perfectly valid to be used in strings:
Emojis are part of the astral planes, outside of the first Basic Multilingual Plane (BMP), and since those points outside BMP cannot be represented in 16 bits, JavaScript needs to use a combination of 2 characters to represent them
The 🐶 symbol, which is U+1F436 , is traditionally encoded as \uD83D\uDC36 (called surrogate pair). There is a formula to calculate this, but it’s a rather advanced topic.
Some emojis are also created by combining together other emojis. You can find those by looking at this list https://unicode.org/emoji/charts/full-emoji-list.html and notice the ones that have more than one item in the unicode symbol column.
👩❤️👩 is created combining 👩 ( \uD83D\uDC69 ), ❤️ ( \u200D\u2764\uFE0F\u200D ) and another 👩 ( \uD83D\uDC69 ) in a single string: \uD83D\uDC69\u200D\u2764\uFE0F\u200D\uD83D\uDC69
There is no way to make this emoji be counted as 1 character.
Get the proper length of a string
You’ll get 8 in return, as length counts the single Unicode code points.
Also, iterating over it is kind of funny:
And curiously, pasting this emoji in a password field it’s counted 8 times, possibly making it a valid password in some systems.
How to get the “real” length of a string containing unicode characters?
One easy way in ES6+ is to use the spread operator:
You can also use the Punycode library by Mathias Bynens:
require('punycode').ucs2.decode('🐶').length //1
(Punycode is also great to convert Unicode to ASCII)
Note that emojis that are built by combining other emojis will still give a bad count:
require('punycode').ucs2.decode('👩❤️👩').length //6 [. '👩❤️👩'].length //6
If the string has combining marks however, this still will not give the right count. Check this Glitch https://glitch.com/edit/#!/node-unicode-ignore-marks-in-length as an example.
(you can generate your own weird text with marks here: https://lingojam.com/WeirdTextGenerator)
Length is not the only thing to pay attention. Also reversing a string is error prone if not handled correctly.
ES6 Unicode code point escapes
ES6/ES2015 introduced a way to represent Unicode points in the astral planes (any Unicode code point requiring more than 4 chars), by wrapping the code in graph parentheses:
The dog 🐶 symbol, which is U+1F436 , can be represented as \u instead of having to combine two unrelated Unicode code points, like we showed before: \uD83D\uDC36 .
But length calculation still does not work correctly, because internally it’s converted to the surrogate pair shown above.
Encoding ASCII chars
The first 128 characters can be encoded using the special escaping character \x , which only accepts 2 characters:
This will only work from \x00 to \xFF , which is the set of ASCII characters.
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Converts character or string to Hex Unicode
License
javascript-utilities/to-unicode
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Converts character or string to Hex Unicode
NodeJS developer dependencies may be installed via NPM.
Note, NPM is not required to utilize this project, only for tracking development dependencies.
For use with GitHub Pages, this repository encourages the use of Git Submodules to track dependencies
Bash Variables
_module_name='to-unicode' _module_https_url="https://github.com/javascript-utilities/to-unicode.git" _module_base_dir='assets/javascript/modules' _module_path="$ /$ "
Bash Submodule Commands
cd "" git checkout gh-pages mkdir -vp "$ " git submodule add -b main\ --name "$ "\ "$ "\ "$ "
Suggested additions for your ReadMe.md file so everyone has a good time with submodules
Clone with the following to avoid incomplete downloads git clone --recurse-submodules Update/upgrade submodules via git submodule update --init --merge --recursive
git add .gitmodules git add "$ " ## Add any changed files too git commit -F- 'EOF' :heavy_plus_sign: Adds `javascript-utilities/to-unicode#1` submodule **Additions** - `.gitmodules`, tracks submodules AKA Git within Git _fanciness_ - `README.md`, updates installation and updating guidance - `_modules_/to-unicode`, Converts character or string to Hex Unicode EOF git push origin gh-pages
🎉 Excellent 🎉 your project is now ready to begin unitizing code from this repository!
This project is compatible with NodeJS and most browsers that support ECMAScript (version 6 or greater)
const toUnicode = require('to-unicode'); var panda_code = toUnicode.fromCharacter('🐼'); console.log(panda_code); //> '1f43c'
Example usage within a web project.
> html lang pl-s">en" dir pl-s">ltr"> head> meta charset pl-s">utf-8"> title>toUnicode Usage Exampletitle> script type pl-s">text/javascript" src pl-s">assets/js/modules/to-unicode.js" differ>script> script type pl-s">text/javascript" src pl-s">assets/js/index.js" differ>script> head> body> span>Prefix: span> input type pl-s">text" id pl-s">client__text--prefix" value pl-s">0x"> br> span>Input: span> input type pl-s">text" id pl-s">client__text--input" value=""> pre id pl-s">client__text--output">pre> body> html>
assets/js/index.js
const text_input__callback = (_event) => const client_input = document.getElementById('client__text--input').value; const client_prefix = document.getElementById('client__text--prefix').value; const output_element = document.getElementById('client__text--output'); const unicode_list = toUnicode.fromString(client_input, client_prefix); console.log(unicode_list); output_element.innerText = unicode_list.join('\n'); >; window.addEventListener('load', () => const client_text_input = document.getElementById('client__text--input'); const client_text_prefix = document.getElementById('client__text--prefix'); client_text_input.addEventListener('input', text_input__callback); client_text_prefix.addEventListener('input', text_input__callback); >);
This repository may not be feature complete and/or fully functional, Pull Requests that add features or fix bugs are certainly welcomed.
- Fork this repository to an account that you have write permissions for.
- Add remote for fork URL. The URL syntax is git@github.com:/.git .
cd ~/git/hub/javascript-utilities/to-unicode git remote add fork git@github.com:NAME>/to-unicode.git
cd ~/git/hub/javascript-utilities/to-unicode git commit -F- 'EOF' :bug: Fixes #42 Issue **Edits** - `` script, fixes some bug reported in issue EOF git push fork main
Note, the -u option may be used to set fork as the default remote, eg. git push fork main however, this will also default the fork remote for pulling from too! Meaning that pulling updates from origin must be done explicitly, eg. git pull origin main
- Then on GitHub submit a Pull Request through the Web-UI, the URL syntax is https://github.com///pull/new/
Note; to decrease the chances of your Pull Request needing modifications before being accepted, please check the dot-github repository for detailed contributing guidelines.
Converts character or string to Hex Unicode Copyright (C) 2020 S0AndS0 This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see .
For further details review full length version of AGPL-3.0 License.