Text File Parsing with Python
I am trying to parse a series of text files and save them as CSV files using Python (2.7.3). All text files have a 4 line long header which needs to be stripped out. The data lines have various delimiters including » (quote), — (dash), : column, and blank space. I found it a pain to code it in C++ with all these different delimiters, so I decided to try it in Python hearing it is relatively easier to do compared to C/C++. I wrote a piece of code to test it for a single line of data and it works, however, I could not manage to make it work for the actual file. For parsing a single line I was using the text object and «replace» method. It looks like my current implementation reads the text file as a list, and there is no replace method for the list object. Being a novice in Python, I got stuck at this point. Any input would be appreciated! Thanks!
# function for parsing the data def data_parser(text, dic): for i, j in dic.iteritems(): text = text.replace(i,j) return text # open input/output files inputfile = open('test.dat') outputfile = open('test.csv', 'w') my_text = inputfile.readlines()[4:] #reads to whole text file, skipping first 4 lines # sample text string, just for demonstration to let you know how the data looks like # my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636' # dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected reps = txt = data_parser(my_text, reps) outputfile.writelines(txt) inputfile.close() outputfile.close()
You should attach a copy of the file you need to parse and the expected output, that way it will be easier to help you.
Parsing text with Python
I hate parsing files, but it is something that I have had to do at the start of nearly every project. Parsing is not easy, and it can be a stumbling block for beginners. However, once you become comfortable with parsing files, you never have to worry about that part of the problem. That is why I recommend that beginners get comfortable with parsing files early on in their programming education. This article is aimed at Python beginners who are interested in learning to parse text files.
In this article, I will introduce you to my system for parsing files. I will briefly touch on parsing files in standard formats, but what I want to focus on is the parsing of complex text files. What do I mean by complex? Well, we will get to that, young padawan.
For reference, the slide deck that I use to present on this topic is available here. All of the code and the sample text that I use is available in my Github repo here.
Why parse files?
First, let us understand what the problem is. Why do we even need to parse files? In an imaginary world where all data existed in the same format, one could expect all programs to input and output that data. There would be no need to parse files. However, we live in a world where there is a wide variety of data formats. Some data formats are better suited to different applications. An individual program can only be expected to cater for a selection of these data formats. So, inevitably there is a need to convert data from one format to another for consumption by different programs. Sometimes data is not even in a standard format which makes things a little harder.
Parse Analyse (a string or text) into logical syntactic components.
I don’t like the above Oxford dictionary definition. So, here is my alternate definition.
Parse Convert data in a certain format into a more usable format.
The big picture
With that definition in mind, we can imagine that our input may be in any format. So, the first step, when faced with any parsing problem, is to understand the input data format. If you are lucky, there will be documentation that describes the data format. If not, you may have to decipher the data format for yourselves. That is always fun.
Once you understand the input data, the next step is to determine what would be a more usable format. Well, this depends entirely on how you plan on using the data. If the program that you want to feed the data into expects a CSV format, then that’s your end product. For further data analysis, I highly recommend reading the data into a pandas DataFrame .
If you a Python data analyst then you are most likely familiar with pandas. It is a Python package that provides the DataFrame class and other functions to do insanely powerful data analysis with minimal effort. It is an abstraction on top of Numpy which provides multi-dimensional arrays, similar to Matlab. The DataFrame is a 2D array, but it can have multiple row and column indices, which pandas calls MultiIndex , that essentially allows it to store multi-dimensional data. SQL or database style operations can be easily performed with pandas (Comparison with SQL). Pandas also comes with a suite of IO tools which includes functions to deal with CSV, MS Excel, JSON, HDF5 and other data formats.
Although, we would want to read the data into a feature-rich data structure like a pandas DataFrame , it would be very inefficient to create an empty DataFrame and directly write data to it. A DataFrame is a complex data structure, and writing something to a DataFrame item by item is computationally expensive. It’s a lot faster to read the data into a primitive data type like a list or a dict . Once the list or dict is created, pandas allows us to easily convert it to a DataFrame as you will see later on. The image below shows the standard process when it comes to parsing any file.
Parsing text in standard format
If your data is in a standard format or close enough, then there is probably an existing package that you can use to read your data with minimal effort.
For example, let’s say we have a CSV file, data.txt:
You can handle this easily with pandas.
Parsing a text file into a list in python
I’m completely new to Python, and I’m trying to read in a txt file that contains a combination of words and numbers. I can read in the txt file just fine, but I’m struggling to get the string into a format I can work with.
import matplotlib.pyplot as plt import numpy as np from numpy import loadtxt f= open("/Users/Jennifer/Desktop/test.txt", "r") lines=f.readlines() Data = [] list=lines[3] i=4 while i
I want a list that contains all the elements in lines 3-12 (starting from 0), which is all numbers. When I do print lines[1], I get the data from that line. When I do print lines, or print lines[3:12], I get each character preceded by \x00. For example, the word "Plate" becomes: ['\x00P\x00l\x00a\x00t\x00e. Using lines = [line.strip() for line in f] gets the same result. When I try to put individual lines together in the while loop above, I get the error "AttributeError: 'str' object has no attribute 'append'." How can I get a selection of lines from a txt file into a list? Thank you so much. Edit: The txt file looks like this: BLOCKS= 1 Plate: Phosphate Noisiness Assay 2000x 1.3 PlateFormat Endpoint Absorbance Raw FALSE 1 1 650 1 12 96 1 8
Temperature(¡C) 1 2 3 4 5 6 7 8 9 10 11 12
21.4 0.4977 0.5074 0.5183 0.5128 0.5021 0.5114 0.4993 0.5308 0.4837 0.5286 0.5231 0.5227
0.488 0.4742 0.5011 0.4868 0.4976 0.4845 0.4848 0.5179 0.4772 0.5363 0.5109 0.5197
0.4882 0.4913 0.4941 0.5188 0.4766 0.4914 0.495 0.5172 0.4826 0.5039 0.504 0.5451
0.4771 0.4875 0.523 0.4851 0.4757 0.4767 0.4918 0.5212 0.4742 0.5153 0.5027 0.5235
0.4474 0.4841 0.5193 0.4755 0.4649 0.4883 0.5165 0.5223 0.4799 0.5269 0.5091 0.5191
0.4721 0.4794 0.501 0.4467 0.4785 0.4792 0.4894 0.511 0.4778 0.5223 0.4888 0.5273
0.4122 0.4454 0.314 0.2747 0.4621 0.4416 0.3716 0.2534 0.4497 0.5778 0.2319 0.1038
0.4479 0.5368 0.3046 0.3115 0.4745 0.5116 0.3689 0.3915 0.4803 0.5209 0.1981 0.1062 ~End Original Filename: 2013-08-06 Phosphate Noisiness; Date Last Saved: 8/6/2013 7:00:55 PM Update I used this code:
f= open("/Users/Jennifer/Desktop/test.txt", "r") file_list = f.readlines() first_twelve = file_list[3:11] data = [x.replace('\t',' ') for x in first_twelve] data = [x.replace('\x00','') for x in data] data = [x.replace(' \r\n','') for x in data] print data
to get this result: [' 21.4 0.4977 0.5074 0.5183 0.5128 0.5021 0.5114 0.4993 0.5308 0.4837 0.5286 0.5231 0.5227 ', ' 0.488 0.4742 0.5011 0.4868 0.4976 0.4845 0.4848 0.5179 0.4772 0.5363 0.5109 0.5197 ', ' 0.4882 0.4913 0.4941 0.5188 0.4766 0.4914 0.495 0.5172 0.4826 0.5039 0.504 0.5451 ', ' 0.4771 0.4875 0.523 0.4851 0.4757 0.4767 0.4918 0.5212 0.4742 0.5153 0.5027 0.5235 ', ' 0.4474 0.4841 0.5193 0.4755 0.4649 0.4883 0.5165 0.5223 0.4799 0.5269 0.5091 0.5191 ', ' 0.4721 0.4794 0.501 0.4467 0.4785 0.4792 0.4894 0.511 0.4778 0.5223 0.4888 0.5273 ', ' 0.4122 0.4454 0.314 0.2747 0.4621 0.4416 0.3716 0.2534 0.4497 0.5778 0.2319 0.1038 ', ' 0.4479 0.5368 0.3046 0.3115 0.4745 0.5116 0.3689 0.3915 0.4803 0.5209 0.1981 0.1062 '] Which is (correct me if I'm wrong, very new to Python!) a list of lists, which I should be able to work with. Thank you so much to everyone who responded.
Parse a .txt file
How can i parsr this file and get the functions with type FUNC ? Also,from this txt how can i parse and extract .o name ? How can i get them by column wise parsing or else how. I need an immediate help. Waiting for an appropriate solution as usual
3 Answers 3
for line in open('thefile.txt'): fields = line.split('|') if len(fields) < 4: continue if fields[3].trim() != 'FUNC': continue dowhateveryouwishwith(line, fields)
Seriously? In a markup-language environment like SO, where markdown or HTML is the norm, you're critiquing the exact number of indent spaces?
It would be nice if this solution could be adapted to solve the other part of the OP's question: how to track the .o file. I like this answer's brevity compared to my own but it doesn't solve the total problem.
I think this might cost less than the use of regexes though i am not totally clear on what you are trying to accomplish
symbolList=[] for line in open('datafile.txt','r'): if '.o' in line: tempname=line.split()[-1][0:-2] pass if 'FUNC' not in line: pass else: symbolList.append((tempname,line.split('|')[0]))
I have learned from other posts it is cheaper and better to wrap up all of the data when you are reading through a file the first time. Thus if you wanted to wrap up the whole datafile in one pass then you could do the following instead
fullDict=<> for line in open('datafile.txt','r'): if '.o' in line: tempname=line.split()[-1][0:-2] if '|' not in line: pass else: tempDict=<> dataList=[dataItem.strip() for dataItem in line.strip().split('|')] name=dataList[0].strip() tempDict['Value']=dataList[1] tempDict['Class']=dataList[2] tempDict['Type']=dataList[3] tempDict['Size']=dataList[4] tempDict['Line']=dataList[5] tempDict['Section']=dataList[6] tempDict['o.name']=tempname fullDict[name]=tempDict tempDict=<>
Then if you want the Func type you would use the following:
funcDict=<> for record in fullDict: if fullDict[record]['Type']=='FUNC': funcDict[record]=fullDict[record]
Sorry for being so obsessive but I am trying to get a better handle on creating list comprehensions and I decided that this was worthy of a shot