Create an utf-8 csv file in Python
I can’t create an utf-8 csv file in Python. I’m trying to read it’s docs, and in the examples section, it says:
For all other encodings the following UnicodeReader and UnicodeWriter classes can be used. They take an additional encoding parameter in their constructor and make sure that the data passes the real reader or writer encoded as UTF-8:
values = (unicode("Ñ", "utf-8"), unicode("é", "utf-8")) f = codecs.open('eggs.csv', 'w', encoding="utf-8") writer = UnicodeWriter(f) writer.writerow(values)
line 159, in writerow self.stream.write(data) File "/usr/lib/python2.6/codecs.py", line 686, in write return self.writer.write(data) File "/usr/lib/python2.6/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 22: ordinal not in range(128)
Can someone please give me a light so I can understand what the hell am I doing wrong since I set all the encoding everywhere before calling UnicodeWriter class?
class UnicodeWriter: """ A CSV writer which will write rows to CSV file "f", which is encoded in the given encoding. """ def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): # Redirect output to a queue self.queue = cStringIO.StringIO() self.writer = csv.writer(self.queue, dialect=dialect, **kwds) self.stream = f self.encoder = codecs.getincrementalencoder(encoding)() def writerow(self, row): self.writer.writerow([s.encode("utf-8") for s in row]) # Fetch UTF-8 output from the queue . data = self.queue.getvalue() data = data.decode("utf-8") # . and reencode it into the target encoding data = self.encoder.encode(data) # write to the target stream self.stream.write(data) # empty queue self.queue.truncate(0) def writerows(self, rows): for row in rows: self.writerow(row)
Python CSV Reader Encoding
Pandas and csv libraries are popular names when handling CSV files. Python comes pre-installed with csv, but for pandas, you must install it before using it.
When it comes to data manipulation and analysis, pandas reign supreme because it possesses many functions and attributes that can perform such tasks. This article focuses on how these libraries implement encoding when reading CSV files.
Reading CSV data
When reading CSV files (using pandas or csv), the following processes are conducted: decoding, parsing, data conversions (optional), and data fetching.
Decoding: To read a file, the library must first convert a series of bytes into characters from a particular charset. Sometimes, this section is challenging since the library might not be aware of the file’s encoding. The library may raise an exception at this moment. For instance, if it cannot recognize the encoding or runs into byte sequences that it cannot decode, it may produce an error message.
With Python 3 and local systems getting better at encoding, the encoding process mostly happens seamlessly without us having to explicitly define the encoding system when loading CSV files. However, encoding is still a vital issue when we want to filter out some unwanted characters in our CSV file or some cases, get data in the needed view.
We will save the following simple data into a UTF-8 encoded CSV file named “streets10.csv” and use it for our examples considering two encodings – ASCII and UTF-8. ASCII encoding is the most common character encoding format for English text, whereas UTF-8 contains much more characters.
Name | Streets |
Bob | NazarethkirtchStraße |
Alex | St Äbràhâm |
Table 1: Example data set that we will use in our example. It is saved with UTF-8 encoding as “streets10.csv”.
In the above data, the following characters are none ASCII characters: ß, Ä, à, and â. Any attempt to read the CSV file with ASCII encoding will result in encoding errors because of these characters.
Trouble with UTF-8 CSV input in Python
This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I’m attempting to follow the principle of the «Unicode Sandwich» and decode upon reading the file in:
import codecs import csv with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file: input_file = csv.reader(file, delimiter=",", quotechar='|') list = [] for row in input_file: list.extend(row)
This produces the dread ‘codec can’t encode characters in position, ordinal not in range(128)’ error. I’ve also tried adapting a solution from this answer, which returns a similar error
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader: target_list.extend(field1)
def unicode_csv_reader(utf8_data, dialect=csv.excel): csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') filename = 'inputs\encode.csv' reader = unicode_csv_reader(open(filename)) target_list = [] for field1 in reader: target_list.extend(field1)
Clearly I’m missing something. Most of the questions that I’ve seen regarding this problem seem to predate Python 2.7, so an update here might be useful.
Reading a UTF8 CSV file with Python
I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII.
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't do Unicode; encode temporarily as UTF-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) for row in csv_reader: # decode UTF-8 back to Unicode, cell by cell: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') filename = 'output.csv' reader = unicode_csv_reader(open(filename)) try: products = [] for field1, field2, field3 in reader: .
0665000FS10120684,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Bleu 0665000FS10120689,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Gris 0665000FS10120687,SD1200IS,Appareil photo numérique PowerShot de 10 Mpx de Canon avec trépied (SD1200IS) - Vert .
Traceback (most recent call last): File ".\Test.py", line 53, in for field1, field2, field3 in reader: File ".\Test.py", line 40, in unicode_csv_reader for row in csv_reader: File ".\Test.py", line 46, in utf_8_encoder yield line.encode('utf-8', 'ignore') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)