Python — CSV reader — Reading one row per time
Okay, I have a CSV file with several lines (more than 40k currently). Due to the massive number of lines, I need to read one by one and do a sequence of operations. This is the first question. The second is: How to read the csv file and encode it to utf-8? Second is how to read the file in utf-8 following the example: csv documentation. Mesmo utilizando a classe class UTF8Recoder: o retorno no meu print é \xe9 s\xf3 . Could someone help me solve this problem?
import preprocessing import pymongo import csv,codecs,cStringIO from pymongo import MongoClient from unicodedata import normalize from preprocessing import PreProcessing class UTF8Recoder: def __init__(self, f, encoding): self.reader = codecs.getreader(encoding)(f) def __iter__(self): return self def next(self): return self.reader.next().encode("utf-8") class UnicodeReader: def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds): f = UTF8Recoder(f, encoding) self.reader = csv.reader(f, dialect=dialect, **kwds) def next(self): '''next() -> unicode This function reads and returns the next line as a Unicode string. ''' row = self.reader.next() return [unicode(s, "utf-8") for s in row] def __iter__(self): return self with open('data/MyCSV.csv','rb') as csvfile: reader = UnicodeReader(csvfile) #writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL) for row in reader: print row def status_processing(corpus): myCorpus = preprocessing.PreProcessing() myCorpus.text = corpus print "Starting. " myCorpus.initial_processing() print "Done." print "----------------------------"
Edit 1: The solution of Mr. S Ringne works. But now, I can not do the operations inside my def . Here’s the new code:
for csvfile in pd.read_csv('data/AracajuAgoraNoticias_facebook_statuses.csv',encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2): def status_processing(csvfile): myCorpus = preprocessing.PreProcessing() myCorpus.text = csvfile print "Fazendo o processo inicial. " myCorpus.initial_processing() print "Feito." print "----------------------------"
def main(): status_processing(csvfile) main()
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Reading rows from a CSV file in Python [duplicate]
I know how to read the file in and print each column (for ex. — [‘Year’, ‘1’, ‘2’, ‘3’, etc] ). But what I actually want to do is read the rows, which would be like this [‘Year’, ‘Dec’, ‘Jan’] and then [‘1′, ’50’, ’60’] and so on. And then I would like to store those numbers [‘1′, ’50’, ’60’] into variables so I can total them later for ex.: Year_1 = [’50’, ’60’] . Then I can do sum(Year_1) = 110 . How would I go about doing that in Python 3?
10 Answers 10
import csv with open("test.csv", "r") as f: reader = csv.reader(f, delimiter="\t") for i, line in enumerate(reader): print 'line[<>] = <>'.format(i, line)
line[0] = ['Year:', 'Dec:', 'Jan:'] line[1] = ['1', '50', '60'] line[2] = ['2', '25', '50'] line[3] = ['3', '30', '30'] line[4] = ['4', '40', '20'] line[5] = ['5', '10', '10']
How would I make it so it prints the lines separately and not all together (ex. line 0 = [‘Year:’, ‘Dec:’, ‘Jan:’]), I tried print (line[0]) but it didn’t work.
I was getting the following error in python3: iterator should return strings, not bytes (did you open the file in text mode?) and solved it by changing rb to rt .
@J0ANMM Good callout. This answer was written at a time when Python 3 did not have as wide adoption and was thus implicitly targeted to Python 2. I will update the answer accordingly.
You could do something like this:
with open("data1.txt") as f: lis = [line.split() for line in f] # create a list of lists for i, x in enumerate(lis): #print the list items print "line = ".format(i, x) # output line0 = ['Year:', 'Dec:', 'Jan:'] line1 = ['1', '50', '60'] line2 = ['2', '25', '50'] line3 = ['3', '30', '30'] line4 = ['4', '40', '20'] line5 = ['5', '10', '10']
with open("data1.txt") as f: for i, line in enumerate(f): print "line = ".format(i, line.split()) # output line 0 = ['Year:', 'Dec:', 'Jan:'] line 1 = ['1', '50', '60'] line 2 = ['2', '25', '50'] line 3 = ['3', '30', '30'] line 4 = ['4', '40', '20'] line 5 = ['5', '10', '10']
with open('data1.txt') as f: print "".format(f.readline().split()) for x in f: x = x.split() print " = ".format(x[0],sum(map(int, x[1:]))) # output ['Year:', 'Dec:', 'Jan:'] 1 = 110 2 = 75 3 = 60 4 = 60 5 = 20
Ok got it, now how can I find the element in lis[0]? For example, I need to total the month numbers (50+60) so for year 1 it would be 110. lis[0][0] doesn’t work for me. That was my main goal.
Sorry I thought once I could read the columns I could figure it out myself. But your edited method isnt working for my «actual» file for some reason. See: i.imgur.com/EORK2.png. What I was trying to do is store each of the totals in a variable. so year1 = 110, etc. I’m not trying to just print it out, sorry for being so vague. I thought it would’ve been easier to do when I posted the question.
Reading it columnwise is harder?
Anyway this reads the line and stores the values in a list:
for line in open("csvfile.csv"): csv_row = line.split() #returns a list ["1","50","60"]
# pip install pandas import pandas as pd df = pd.read_table("csvfile.csv", sep=" ")
Size hasn’t got anything to do with it. It is with your program, which we would need to see to help you further 🙂
The Easiest way is this way :
from csv import reader # open file in read mode with open('file.csv', 'r') as read_obj: # pass the file object to reader() to get the reader object csv_reader = reader(read_obj) # Iterate over each row in the csv using reader object for row in csv_reader: # row variable is a list that represents a row in csv print(row) output: ['Year:', 'Dec:', 'Jan:'] ['1', '50', '60'] ['2', '25', '50'] ['3', '30', '30'] ['4', '40', '20'] ['5', '10', '10']
import csv with open('filepath/filename.csv', "rt", encoding='ascii') as infile: read = csv.reader(infile) for row in read : print (row)
This will solve your problem. Don’t forget to give the encoding.
# This program reads columns in a csv file import csv ifile = open('years.csv', "r") reader = csv.reader(ifile) # initialization and declaration of variables rownum = 0 year = 0 dec = 0 jan = 0 total_years = 0` for row in reader: if rownum == 0: header = row #work with header row if you like else: colnum = 0 for col in row: if colnum == 0: year = float(col) if colnum == 1: dec = float(col) if colnum == 2: jan = float(col) colnum += 1 # end of if structure # now we can process results if rownum != 0: print(year, dec, jan) total_years = total_years + year print(total_years) # time to go after the next row/bar rownum += 1 ifile.close()
A bit late but nonetheless. You need to create and identify the csv file named «years.csv»:
Year Dec Jan 1 50 60 2 25 50 3 30 30 4 40 20 5 10 10
Example:
import pandas as pd data = pd.read_csv('data.csv') # read row line by line for d in data.values: # read column by index print(d[2])
CSV read specific row
I have a CSV file with 100 rows. How do I read specific rows? I want to read say the 9th line or the 23rd line etc?
6 Answers 6
You could use a list comprehension to filter the file like so:
with open('file.csv') as fd: reader=csv.reader(fd) interestingrows=[row for idx, row in enumerate(reader) if idx in (28,62)] # now interestingrows contains the 28th and the 62th row after the header
How to read a single line. With below inputs is interestingrows=[row for idx, row in enumerate(myreader) if idx in (28)] TypeError: argument of type ‘int’ is not iterable
Use list to grab all the rows at once as a list. Then access your target rows by their index/offset in the list. For example:
#!/usr/bin/env python import csv with open('source.csv') as csv_file: csv_reader = csv.reader(csv_file) rows = list(csv_reader) print(rows[8]) print(rows[22])
You simply skip the necessary number of rows:
with open("test.csv", "rb") as infile: r = csv.reader(infile) for i in range(8): # count from 0 to 7 next(r) # and discard the rows row = next(r) # "row" contains row number 9 now
You could read all of them and then use normal lists to find them.
with open('bigfile.csv','rb') as longishfile: reader=csv.reader(longishfile) rows=[r for r in reader] print row[9] print row[88]
If you have a massive file, this can kill your memory but if the file’s got less than 10,000 lines you shouldn’t run into any big slowdowns.
You can do something like this :
with open('raw_data.csv') as csvfile: readCSV = list(csv.reader(csvfile, delimiter=',')) row_you_want = readCSV[index_of_row_you_want]
May be this could help you , using pandas you can easily do it with loc
''' Reading 3rd record using pandas -> (loc) Note : Index start from 0 If want to read second record then 3-1 -> 2 loc[2]` -> read second row and `:` -> entire row details ''' import pandas as pd df = pd.read_csv('employee_details.csv') df.loc[[2],:]
How to read a specific row of a csv file in python?
I have searched like crazy trying to find specifically how to read a row in a csv file. I need to read a random row out of 1000, each of which has 3 columns. The first column has an email. I need to put in a random email, and get columns 2 and 3 out. (Python 2.7, csv file) Example:
Name Date Color Ray May Gray Alex Apr Green Ann Jun Blue Kev Mar Gold Rob May Black
Instead of column 1 row 3, I need [Ann], her whole row. This is a CSV file, with over 1000 names. I have to put in her name and output her whole row. What I have tried
from collections import namedtuple Entry = namedtuple('Entry', 'Name, Date, Color') file_location = "C:/Users/abriman/Desktop/Book.csv" ss_dict = <> spreadsheet = file_location = "C:/Users/abriman/Desktop/Book.csv" for row in spreadsheet: entry = Entry(*tuple(row)) ss_dict['Ann']
Traceback (most recent call last): File "", line 2, in TypeError: __new__() takes exactly 4 arguments (2 given)
3 Answers 3
You’re on the right track. First issue: you’re never opening the file located at file_location . Thus, when you iterate for row in spreadsheet: , you’re iterating over the characters of spreadsheet , which are the characters of file_location , which are the characters of «C:/Users/. » . So the first thing you want to do is actually open the file:
spreadsheet = open(file_location, 'r')
You still have another issue in your loop. When you iterate over a file in a for loop, you get back the lines of the file. So, at each iteration, row will be a line, e.g. «Ray May Gray» . When you call tuple() on that, you’re going to get a tuple that looks like (‘R’, ‘a’, ‘y’, ‘ ‘, ‘ ‘, ‘M’, . ) . What you need to do is construct your tuple by splitting on whitespace:
Then, you need to add your entry to the dictionary ss_dict :
Finally, you can read out the value of ss_dict[‘Ann’] , but this should be outside your loop — if you do it inside your loop, you may be trying to read the value of ss_dict[‘Ann’] before it has been set. All in all, your code should look like this:
from collections import namedtuple Entry = namedtuple('Entry', 'Name, Date, Color') file_location = "C:/Users/abriman/Desktop/Book.csv" ss_dict = <> spreadsheet = open(file_location, 'r') #
Incidentally, the reason you're getting your error message there is that when you do for row in spreadsheet: with spreadsheet being a string, row is just a character, as I mentioned, and so tuple(row) is just a tuple containing one character, and hence is of length 1, so that you're only passing one argument rather than three when you do *tuple(row) .
All that said, you might want to consider looking at the csv module, which is part of the standard library, and is precisely designed for reading csv files. It will probably make your life easier in the long run.