- Reading russian language data from csv
- Solution 2
- Solution 3
- Erba Aitbayev
- Comments
- Saved searches
- Use saved searches to filter your results more quickly
- read_csv fails to read file if there are cyrillic symbols in filename #17773
- read_csv fails to read file if there are cyrillic symbols in filename #17773
- Comments
- Code Sample, a copy-pastable example if possible
- Problem description
- Expected Output
- Output of pd.show_versions()
- INSTALLED VERSIONS
- 3 Ways to Handle non UTF-8 Characters in Pandas
- UTF-8
- Caveat
- Find the correct Encoding Using Python
Reading russian language data from csv
\ea is the windows-1251 / cp5347 encoding for к . Therefore, you need to use windows-1251 decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly — See «Unicode» in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader: """ A CSV reader which will iterate over lines in the CSV file "f", which is encoded in the given encoding. """ def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): f = UTF8Recoder(f, encoding) self.reader = csv.reader(f, dialect=dialect, **kwds) def next(self): row = self.reader.next() return [unicode(s, "utf-8") for s in row] def __iter__(self): return self
This would allow you to do:
def loadCsv(filename): lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" ) # if you really need lists then uncomment the next line # this will let you do call exact lines by doing `line_12 = lines[12]` # return list(lines) # this will return an "iterator", so that the file is read on each call # use this if you'll do a `for x in x` return lines
If you try to print dataset , then you’ll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u . To print the values, do:
for csv_line in loadCsv("myfile.csv"): print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput: for csv_line in loadCsv("myfile.csv"): my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.
Solution 2
import pandas as pd pd.read_csv(path_file , "cp1251")
import csv with open(path_file, encoding="cp1251", errors='ignore') as source_file: reader = csv.reader(source_file, delimiter=",")
Solution 3
Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.
Erba Aitbayev
Comments
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы 2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is ; symbol. I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset
mydata = loadCsv('krish(csv3).csv') print mydata
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
import codecs with codecs.open('krish(csv3).csv','r',encoding='utf8') as f: text = f.read() print text
newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data? I just want to read data from file and put it in 2-dimensional array.
return ([f.decode(‘cp1251’) if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, «rb»),delimiter=»;»))
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv fails to read file if there are cyrillic symbols in filename #17773
read_csv fails to read file if there are cyrillic symbols in filename #17773
Comments
Code Sample, a copy-pastable example if possible
import pandas cyrillic_filename = "./файл_1.csv" # 'c' engine fails: df = pandas.read_csv(cyrillic_filename, engine="c", encoding="cp1251") --------------------------------------------------------------------------- OSError Traceback (most recent call last) ipython-input-18-9cb08141730c> in module>() 2 3 cyrillic_filename = "./файл_1.csv" ----> 4 df = pandas.read_csv(cyrillic_filename , engine="c", encoding="cp1251") d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 403 404 # Create the parser. --> 405 parser = TextFileReader(filepath_or_buffer, **kwds) 406 407 if chunksize or iterator: d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 762 self.options['has_index_names'] = kwds['has_index_names'] 763 --> 764 self._make_engine(self.engine) 765 766 def close(self): d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine) 983 def _make_engine(self, engine='c'): 984 if engine == 'c': --> 985 self._engine = CParserWrapper(self.f, **self.options) 986 else: 987 if engine == 'python': d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds) 1603 kwds['allow_leading_cols'] = self.index_col is not False 1604 -> 1605 self._reader = parsers.TextReader(src, **kwds) 1606 1607 # XXX pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)() OSError: Initializing from file failed # 'python' engine work: df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251") df.size >>172440 # 'c' engine works if filename can be encoded to utf-8 latin_filename = "./file_1.csv" df = pandas.read_csv(latin_filename, engine="c", encoding="cp1251") df.size >>172440
Problem description
The ‘c’ engine should read the files with non-UTF-8 filenames
Expected Output
File content readed into dataframe
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.2
scipy: 0.19.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.0.0
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
None
The text was updated successfully, but these errors were encountered:
3 Ways to Handle non UTF-8 Characters in Pandas
So we’ve all gotten that error, you download a CSV from the web or get emailed it from your manager, who wants analysis done ASAP, and you find a card in your Kanban labelled URGENT AFF,so you open up VSCode, import Pandas and then type the following: pd.read_csv(‘some_important_file.csv’) . Now, instead of the actual import happening, you get the following, near un-interpretable stacktrace: What does that even mean?! And what the heck is utf-8 . As a brief primer/crash course, your computer (like all computers), stores everything as bits (or series of ones and zeros). Now, in order to represent human-readable things (think letters) from ones and zeros, the Internet Assigned Numbers Authority came together and came up with the ASCII mappings. These basically map bytes (binary bits) to codes (in base-10, so numbers) which represent various characters. For example, 00111111 is the binary for 063 which is the code for ? . These letters then come together to form words which form sentences. The number of unique characters that ASCII can handle is limited by the number of unique bytes (combinations of 1 and 0 ) available. However, to summarize: using 8 bits allows for 256 unique characters which is NO where close in handling every single character from every single language. This is where Unicode comes in; unicode assigns a «code points» in hexadecimal to each character. For example U+1F602 maps to 😂. This way, there are potentially millions of combinations, and is far broader than the original ASCII.
UTF-8
UTF-8 translates Unicode characters to a unique binary string, and vice versa. However, UTF-8, as its name suggests, uses an 8-bit word (similar to ASCII), to save memory. This is similar to a technique known as Huffman Coding which represents the most-used characters or tokens as the shortest words. This is intuitive in the sense that, we can afford to assign tokens used the least to larger bytes, as they are less likely to be sent together. If every character would be sent in 4 bytes instead, every text file you have would take up four times the space.
Caveat
However, this also means that the number of characters encoded by specifically UTF-8, is limited (just like ASCII). There are other UTFs (such as 16), however, this raises a key limitation, especially in the field of data science: sometimes we either don’t need the non-UTF characters or can’t process them, or we need to save on space. Therefore, here are three ways I handle non-UTF-8 characters for reading into a Pandas dataframe:
Find the correct Encoding Using Python
Pandas, by default, assumes utf-8 encoding every time you do pandas.read_csv , and it can feel like staring into a crystal ball trying to figure out the correct encoding. Your first bet is to use vanilla Python:
with open('file_name.csv') as f: print(f)