This html и unescape

Escaping HTML

The cgi module that comes with Python has an escape() function:

 1  import cgi  2    3  s = cgi.escape( """& """ ) # s = "& < >" 

However, it doesn’t escape characters beyond &, , and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".

Recent Python 3.2 have html module with html.escape() and html.unescape() functions. html.escape() differs from cgi.escape() by its defaults to quote=True:

 1  import html  2    3  s = html.escape( """&  " ' >""" ) # s = '& < " ' >' 

Here's a small snippet that will let you escape quotes and apostrophes as well:

 1  html_escape_table =  2  "&": "&",  3  '"': """,  4  "'": "'",  5  ">": ">",  6  "": "<",  7  >  8    9  def html_escape(text):  10  """Produce entities within text."""  11  return "".join(html_escape_table.get(c,c) for c in text) 

You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.

 1  from xml.sax.saxutils import escape, unescape  2  # escape() and unescape() takes care of &, < and >.  3  html_escape_table =  4  '"': """,  5  "'": "'"  6  >  7  html_unescape_table = v:k for k, v in html_escape_table.items()>  8    9  def html_escape(text):  10  return escape(text, html_escape_table)  11    12  def html_unescape(text):  13  return unescape(text, html_unescape_table) 

Unescaping HTML

Undoing the escaping performed by cgi.escape() isn't directly supported by the library. This can be accomplished using a fairly simple function, however:

 1  def unescape(s):  2  s = s.replace("<", "")  3  s = s.replace(">", ">")  4  # this has to be last:  5  s = s.replace("&", "&")  6  return s 
>>> from HTMLParser import HTMLParser >>> HTMLParser.unescape.__func__(HTMLParser, 'ss©') u'ss\xa9'

Note that this will undo exactly what cgi.escape() does; it's easy to extend this to undo what the html_escape() function above does. Note the comment that converting the & must be last; this avoids getting strings like "&lt;" wrong.

This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the htmlentitydefs module from the library; this is not normally used directly, but the htmllib module uses it to support most common entities. It can be used very easily:

 1  import htmllib  2    3  def unescape(s):  4  p = htmllib.HTMLParser(None)  5  p.save_bgn()  6  p.feed(s)  7  return p.save_end() 

This version has the additional advantage that it supports character references (things like A) as well as entity references.

A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data).

Formal htmlentitydefs

Yet another approach available with recent Python takes advantage of htmlentitydefs:

import re from htmlentitydefs import name2codepoint def htmlentitydecode(s): return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m: unichr(name2codepoint[m.group(1)]), s)

Builtin HTML/XML escaping via ASCII encoding

A very easy way to transform non-ASCII characters like German umlauts or letters with accents into their HTML equivalents is simply encoding them from unicode to ASCII and use the xmlcharrefreplace encoding error handling:

>>> a = u"äöüßáà" >>> a.encode('ascii', 'xmlcharrefreplace') 'äöüßáà'

Note, that this does only transform non-ASCII characters and therefore leaves , >, ? as they are. However, you can combine this technique with the cgi.escape.

See Also

John J. Lee discusses still more refinements in implementation in this comp.lang.python follow-up.

EscapingHtml (last edited 2016-11-19 12:13:41 by OleskandrGavenko )

Источник

HTML/text/JavaSript Escaping/Encoding Script

These scripts are intended to explain how to "hide" HTML and/or javascript from other people who view your page's source code. It is not foolproof, but it does make it more difficult to read and understand the source code. Due to the nature of how these scripts work, the explanation may seem complicated and drawn out, but be patient and it should make sense once you gain a little experience with them. You don't really have to know the ins-and-outs of these scripts, but it does help you understand how and why they work. So, take a seat and I'll do my best to make this seem as un-complicated as possible.

Escape/Unescape

The first section of this page explains how to "escape" any text, HTML, or Javascript to make it generally unreadable to the common user. URL Escape Codes are two-character Hexadecimal (8-bit) values preceeded by a % sign. This is used primarily in browser URLs or for use when making cookies for characters that otherwise would not work, usually because they are reserved characters (like spaces and the like).

For example, if you had an HTML filename of page one , the escaped URL code would look like page%20one . The %20 is the escaped value for a space. Normally, you would only escape special characters (generally any character other than a-z, A-Z, and 0-9), but the script below actually escapes all the text simply by replacing all characters with their escaped equivalents. So, if you were to fully escape the words page one , it would look like: %70%61%67%65%20%6F%6E%65 . Now, none of the text is easily decipherable even though most of it was made up of normal characters.

Since the browser can inherently handle escape codes, this can be used pretty easily without having to add any more script to decipher them. So, if you want the browser to write that escaped text to the page, you could do something like:

All I'm doing here is putting the escaped string in a set of quotes (important!), wrapping that inside the built-in unescape() method, and then wrapping that in a document.write() method. This might seem a little worthless, but you could hide an email address this way to prevent web crawlers from snagging your e-mail address from your webpage to use in mass spam e-mailings, yet allowing visitors to read it fine. Unless, of course, you actually like getting Viagra solicitations. 🙂

For instance, fully escaped Script Asylum no-reply e-mail address would look like this to a web crawler:

document.write( unescape( '%6E%6F%72%65%70%6C%79%40%73%63%72%69%70%74%61%73%79%6C%75%6D%2E%63%6F%6D' ) );

. but would look like this to a visitor:

The two textboxes below will let you fully escape and unescape any text you want. Just type whatever text/HTML/JavaScript you want in the left box and click the --> button to fully escape it. Likewise, click the

Encoding/Decoding

Now, you probably have figured out that you could hide an entire HTML page using the above method; but there are two disadvantages to doing that: Size and ease of "cracking" your code.

When you fully escape an entire page, every single character becomes 3 characters. This will triple the size of your page. Not a big deal if the page is only about 10-50 KBytes in size; but when you have a fairly large page (>100 KBytes), the filesize increases rapidly. This would slow the load time for surfers without a broadband connection.

Also, if someone were to look at your source code, it would be pretty easy to figure out what you are doing. Then they can simply copy & paste the code and make a small script to display the normal content. There is no absolute foolproof way (client-side) to foil someone from viewing your source if they are determined enough; the best you can hope for is to make it as inconvenient as possible.

So, to address both concerns you could encode/decode the text. Again, it won't be foolproof to keep people from stealing your source content if they really want it. I am really using the terms "encode" and "decode" loosely here; what the following script does is not considered actual encoding, but it's easier to say it that way. The encoded output will be a bit longer than the original text, but a lot less than if you had simply escaped it all.

The above section just escapes the text. The section below actually shifts the Unicode values so the result looks like gibberish. Give it a try and you'll see; don't forget to try different Code Key values from the drop-down box.

  1. First, all the text is escaped.
  2. Then the script finds the Unicode values for each character in the string.
  3. Then the script adds whatever the Code Key drop-down box value is to each character's Unicode value.
  4. Then the script derives characters based on the shifted Unicode values.
  5. The Code Key value is also embedded in the decoded text so the script knows how to properly decode the string again.
  6. Finally, it escapes the result one more time to remove any special characters. Now, the output looks totally foreign to someone who cannot un-shift Unicode values in their head. 🙂

Once escaped, the function looks like this:

Neat huh? 🙂
Anyway, now you have to make the browser write that part of the script to the page by wrapping it in the document.write() and unescape() methods like this:

Once the script above is encoded using "code key" number 1, it looks like this:

Then, you decode the string and write it to the page by calling the dF() function (which was just unescaped and written to the page in the previous step) passing the string above like this:

  • Javascript Encoder - Designed to encode Javascript only. Useful to only encode and install a script in an already created HTML page.
  • HTML Page Encoder - Designed to encode your whole HTML page. You just enter your HTML sourcecode into one box, select the encoding scheme, and press the "encode" button. The output can be pasted directly into a blank page and saved as an HTML file.

Источник

How to escape & unescape HTML characters in string in JavaScript

Many candidates are rejected or down-leveled due to poor performance in their System Design Interview. Stand out in System Design Interviews and get hired in 2023 with this popular free course.

Escape HTML

Escaping HTML characters in a string means replacing the:

  • less than symbol ( <) with <
  • greater than symbol (>) with >
  • double quotes (") with "
  • single quote (’) with '
  • ampersand (&) with &

Let’s suppose we have an HTML element as a string:

We can escape the HTML of the string using the replace method of the string.

function escape(htmlStr)
return htmlStr.replace(/&/g, "&")
.replace(/
.replace(/>/g, ">")
.replace(/"/g, """)
.replace(/'/g, "'");
>
console.log(escape(""));

In the code above, we have used regex to globally replace the:

The replace method will return a new string by replacing the matched pattern with the replacement string.

Unescape HTML

Unescaping HTML in a string does the reverse of what we have done above, by replacing:

function unEscape(htmlStr)
htmlStr = htmlStr.replace(/</g , "<");
htmlStr = htmlStr.replace(/>/g , ">");
htmlStr = htmlStr.replace(/"/g , "\"");
htmlStr = htmlStr.replace(/'/g , "\'");
htmlStr = htmlStr.replace(/&/g , "&");
return htmlStr;
>
let unEscapedStr =unEscape(`<script>alert('hi')</script>`);
console.log(unEscapedStr);

Learn in-demand tech skills in half the time

Источник

Читайте также:  Public variable in php class
Оцените статью