Working with data files in python

Содержание

How to read .data files in Python?
What is a .data file?
Identifying data inside .data files
1. Testing: Text file
2. Testing: Binary File
3. Using Pandas to read .data files
What are the other types of formats to store data?
1. JSON Files
2. Pickle
Conclusion
References

How to read .data files in Python?

While working with data entry and data collection for training models, we come across .data files.

This is a file extension used by a few software in order to store data, one such example would be Analysis Studio, specializing in statistical analysis and data mining.

Working with the .data file extension is pretty simple and is more or less identifying the way the data is sorted, and then using Python commands to access the file accordingly.

What is a .data file?

.data files were developed as a means to store data.

A lot of the times, data in this format is either placed in a comma separated value format or a tab separated value format.

Along with that variation, the file may also be in text file format or in binary. In which case, we will be needing to access it in a different method.

We will be working with .csv files for this article, but let us first identify whether the content of the file is in text, or in binary.

Identifying data inside .data files

.data files come in two different variations, and the file itself is either in the form of text or in binary.

In order to find out which one it belongs to, we’ll need to load it up and test it out for ourselves.

1. Testing: Text file

.data files may mostly exist as text files, and accessing files in Python is pretty simple.

Being pre-built as a feature included in Python, we have no need to import any module in order to work with file handling.

That being said, the way to open, read, and write to a file in Python is as such:

# reading from the file file = open("biscuits.data", "r") file.read() file.close() # writing to the file file = open("biscuits.data", "w") file.write("Chocolate Chip") file.close()

2. Testing: Binary File

The .data files could also be in the form of binary files. This means that the way we must access the file also needs to change.

We will be working with a binary mode of reading and writing to the file, in this case, the mode is rb, or read binary.

# reading from the file file = open("biscuits.data", "rb") file.read() file.close() # writing to the file file = open("biscuits.data", "wb") file.write("Oreos") file.close()

File operations are relatively easy to understand in Python and are worth looking into if you wish to see the different file access modes and methods to access them.

Either one of these approaches should work, and should provide you with a method to retrieve the information regarding the contents stored inside the .data file.

Now that we know which format the file is present in, we can work with pandas to create a DataFrame for the csv file.

3. Using Pandas to read .data files

A simple method to extract info from these files after checking the type of content provided would be to simply use the read_csv() function provided by Pandas.

import pandas as pd # reading csv files data = pd.read_csv('file.data', sep=",") print(data) # reading tsv files data = pd.read_csv('otherfile.data', sep="\t") print(data)

This method also converts the data into a dataframe automatically.

Below used is a sample csv file, which was reformatted into a .data file and accessed using the same code as given above.

Series reference Description Period Previously published Revised 0 PPIQ.SQU900000 PPI output index - All industries 2020.06 1183 1184 1 PPIQ.SQU900001 PPI output index - All industries excl OOD 2020.06 1180 1181 2 PPIQ.SQUC76745 PPI published output commodity - Transport sup. 2020.06 1400 1603 3 PPIQ.SQUCC3100 PPI output index level 3 - Wood product manufa. 2020.06 1169 1170 4 PPIQ.SQUCC3110 PPI output index level 4 - Wood product manufa. 2020.06 1169 1170 .. . . . . . 73 PPIQ.SQNMN2100 PPI input index level 3 - Administrative and s. 2020.06 1194 1195 74 PPIQ.SQNRS211X PPI input index level 4 - Repair & maintenance 2020.06 1126 1127 75 FPIQ.SEC14 Farm expenses price index - Dairy farms - Freight 2020.06 1102 1120 76 FPIQ.SEC99 Farm expenses price index - Dairy farms - All . 2020.06 1067 1068 77 FPIQ.SEH14 Farm expenses price index - All farms - Freight 2020.06 1102 1110 [78 rows x 5 columns]

As you can see, it has indeed given us a DataFrame as an output.

What are the other types of formats to store data?

Sometimes, the default method to store data just doesn’t cut it. So, what are the alternatives to working with file storage?

1. JSON Files

As a method to store information, JSON is a wonderful data structure to work with, and the immense support for the JSON module in Python has the integration feel seemingly flawless.

However, in order to work with it in Python, you’ll need to import the json module in the script.

Now, after constructing a JSON compatible structure, the method to store it is a simple file operation with a json dumps .

# dumping the structure in the form of a JSON object in the file. with open("file.json", "w") as f: json.dumps(['foo', ], f) # you can also sort the keys, and pretty print the input using this module with open("file.json", "w") as f: json.dumps(['foo', ], f, indent=4, sort_keys=True)

Note that we are dumping into the file using the variable f.

The equivalent function to retrieve information from a JSON file is called load .

with open('file.json') as f: data = json.load(f)

This provides us with the structure and information of the JSON object inside the file.

2. Pickle

Normally, when you store information, the information is stored in a raw string format, causing the object to lose it’s properties, and we’ll need to reconstruct the object from a string through Python.

The pickle module is used to combat this issue, and was made for serializing and de-serializing Python object structures, such that it can be stored in a file.

This means that you can store a list through pickle and when it’s loaded up by the pickle module next time, you wouldn’t lose any of properties of the list object.

In order to use it, we’ll need to import the pickle module, there’s no need to install it as it’s a part of the standard python library.

Let us create a dictionary to work with all of our file operations till now.

apple = banana = orange = fruitShop = <> fruitShop["apple"] = apple fruitShop["banana"] = banana fruitShop["orange"] = orange

Working with the pickle module is just about as simple as working with JSON.

file = open('fruitPickles', 'ab') # the 'ab' mode allows for us to append to the file # in a binary format # the dump method appends to the file # in a secure serialized format. pickle.dump(fruitShop, file) file.close() file = open('fruitPickles', 'rb') # now, we can read from the file through the loads function. fruitShop = pickle.load(file) file.close()

Conclusion

You now know what .data files are, and how to work with them. Along with this, you also know the other options available to test out, in order to store and retrieve data.

Look into our other articles for an in-depth tutorial on each of these modules – File Handling, Pickle, and JSON.

References

Источник