Remove special chars python

Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

19 Answers 19

This can be done without regex:

>>> string = "Special $#! characters spaces 888323" >>> ''.join(e for e in string if e.isalnum()) 'Specialcharactersspaces888323' 
S.isalnum() -> bool Return True if all characters in S are alphanumeric and there is at least one character in S, False otherwise. 

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that’s the best way to go about it.

@DiegoNavarro except that’s not true, I benchmarked both the isalnum() and regex versions, and the regex one is 50-75% faster

Tried this in Python3 — it accepts unicode chars so it’s useless to me. Try string = «B223323\§§§$3\u445454» as an example. The result? ‘B2233233䑔54’

Additionally: «For 8-bit strings, this method is locale-dependent.»! Thus the regex alternative is strictly better!

Here is a regex to match a string of characters that are not a letters or numbers:

Here is the Python command to do a regex substitution:

I guess this doesn’t work with modified character in other languages, like á, ö, ñ, etc. Am I right? If so, how would it be the regex for it?

just add the special characters of that particular language. For example, to use for german text, re.sub(‘[^A-Za-z0-9 ,.-_\’äöüÄÖÜß]+’, », sample_text) expression can be used.

import re cleanString = re.sub('\W+','', string ) 

If you want spaces between words and numbers substitute » with ‘ ‘

Depends on the context — underscore is very useful for filenames and other identifiers, to the point that I don’t treat it as a special character but rather a sanitised space.I generally use this method myself.

r’\W+’ — slightly off topic (and very pedantic) but I suggest a habit that all regex patterns be raw strings

TLDR

I timed the provided answers.

is typically 3x faster than the next fastest provided top answer.

Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

  • string1 = ‘Special $#! characters spaces 888323’
  • string2 = ‘how much for the maple syrup? $20.99? That s ridiculous. ‘

Example 1

'.join(e for e in string if e.isalnum()) 

Example 2

import re re.sub('[^A-Za-z0-9]+', '', string) 

Example 3

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.

@kkurian If you read the beginning of my answer, this is merely a comparison of the previously proposed solutions above. You might want to comment on the originating answer. stackoverflow.com/a/25183802/2560922

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.') Out[20]: 'stringwithspecialcharslikeetcs' 

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string)) 

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)]) 

note: unpacking in [*args] valid from Python >= 3.5

@Alexey correct, In python3 map , filter , and reduce returns itertable object instead. Still in Python3+ I will prefer ».join(filter(str.isalnum, string)) (or to pass list in join use ».join([*filter(str.isalnum, string)]) ) over accepted answer.

I’m not certain ».join(filter(str.isalnum, string)) is an improvement on filter(str.isalnum, string) , at least to read. Is this really the Pythreenic (yeah, you can use that) way to do this?

@TheProletariat The point is just filter(str.isalnum, string) do not return string in Python3 as filter( ) in Python-3 returns iterator rather than argument type unlike Python-2.+

@GrijeshChauhan, I think you should update your answer to include both your Python2 and Python3 recommendations.

#!/usr/bin/python import re strs = "how much for the maple syrup? $20.99? That's ricidulous. " print strs nstr = re.sub(r'[?|$|.|!]',r'',strs) print nstr nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr) print nestr 

you can add more special character and that will be replaced by » means nothing i.e they will be removed.

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don’t want.

For example, if I want only characters from ‘a to z’ (upper and lower case) and numbers, I would exclude everything else:

import re s = re.sub(r"[^a-zA-Z0-9]","",s) 

This means «substitute every character that is not a number, or a character in the range ‘a to z’ or ‘A to Z’ with an empty string».

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won’t find any uppercase now.

import re s = re.sub(r"[^a-z0-9]","",s.lower()) 

Источник

Removing special characters from a string in Python

I have Text like this string in Python. How do I remove the ↑ in Python. I’ve tried most methods proposed by google but none seem to work.

Lorem Ipsum ↑ The results really show what a poisonous 

Did you explicitly mean the arrow char only or any special character? There’s a conflict between the title and the body.

5 Answers 5

>>> s = '''Lorem Ipsum ↑ The results really show what a poisonous''' >>> s = s.replace('↑', '') >>> print(s) Lorem Ipsum The results really show what a poisonous 

That works in the interpreter. If your code is in a file then you can declare the file encoding of your .py file by placing this line at the top:

s = '''Lorem Ipsum ↑ The results really show what a poisonous''' clean_string = "".join([ch for ch in s if ch.isalnum() or ch in string.punctuation or ch.isspace()]) 

this will remove all non punctuation/alphanumeric characters

Well, what you show here contains the unicode character U+2191. But you forgot to say whether it was a unicode string or a byte string and in the latter case what is the charset.

If it is a unicode string (Python 3 string or Python 2 unicode):

does the trick, whatever is your Python version or charset.

if it is a byte string (Python 2 string or Python 3 bytes)

s.replace(u'\u2191'.encode(charset), b'') 

does the trick provided you know what charset you use.

I always prefere this kind of input for non ascii characters, because the charset used to read Python source may not be the charset used when the program is run (that what the # -*- coding= . -*- line is meant for)

I use this script in python for replacing and removing characters:

#!/usr/bin/env python # -*- coding: UTF-8 -*- #Script for replacing characters on plain text file original = open('input.txt', 'r') final = open('output.txt',"w") diccionario = [ ("perros", "gatos"), ("↑", "") ] data = original.read() original.close() salida = reduce(lambda a, kv: a.replace(*kv), diccionario, data) final.write(salida) final.close() 

In this example, I am replacing the word «perros» for «gatos» and removing the ↑ symbol, be sure that the file you are replacing is saved in UTF-8 codification.

Источник

How to remove special characters in a string in Python 3?

I’m very confused how to remove not only special characters but also some alphabets between the special characters. Can anyone suggest a way to do that?

2 Answers 2

You can use html module and BeautifulSoup to get text without escaped tags:

s = "<b><i><u>Charming boutique selling trendy casual &amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp; jewelry.</u></i></b>" from bs4 import BeautifulSoup from html import unescape soup = BeautifulSoup(unescape(s), 'lxml') print(soup.text) 
Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry. 

When I checked the source page view, the string shows up with escaped characters, but when I printed it out on the command line, it shows up like this, Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry. Your solution removes all the tags, but $amp; doesn’t get removed. I tried to use replace function, but it doesn’t work as well.

import re string = '<b><i><u>Charming boutique selling trendy casual &amp; dressy apparel for women, some plus sized items, swimwear, shoes &amp; jewelry.</u></i></b>' string = re.sub('</?[a-z]+>', '', string) string = string.replace('&amp;', '&') print(string) # prints 'Charming boutique selling trendy casual & dressy apparel for women, some plus sized items, swimwear, shoes & jewelry.' 

Your string that you want to change looks like it was HTML that’s been escaped a few times over, so my solution only works for that kind of thing.

I used regex to replace the tags with empty strings, and also I replaced the escape for an ampersand with a literal & .

Hopefully this is what you’re looking for, let me know if you have any troubles.

Источник

Читайте также:  Php получить первую букву строки кириллица
Оцените статью