Python threading file processing

Multithreaded File Saving in Python

It can become painfully slow in situations where you may need to save thousands of files. The hope is that multithreading can be used to speed up file saving.

In this tutorial, you will explore how to save thousands of files using multithreading.

Save Files One-By-One (slowly)

Saving a file in Python is relatively straightforward.

First, we can open the file and get a file handle using the built-in open() function. This is best used via the context manager so that the file is closed automatically once we are finished with it.

When calling open() we specify the path of the file to open and the mode in which to open it, such as ‘w’ for writing ASCII data.

Once open, we can then write data to the file. Once we exit the context manager block, the file will be closed automatically for us.

Tying this together, the save_file() function below takes a file path and ASCII data and writes it to a file.

Next we need to create some data to write.

To keep things simple we will generate CSV data files composed of lines of numbers separated by commas.

We will seed each file with a unique integer and use that in the calculation of each numeric value, combined with the line number and column number. This is to ensure that the content of each file is unique in order to simulate a moderately real-world scenario.

The generate_file() function below will generate a CSV file in memory and return it as a string that we might later save to a file.

The function takes an identifier which is a numerical seed for the values saved in the file. It also uses default arguments that specify the number of values to create per line (defaults to 10) and the number of lines to create in the file (defaults to 5,000).

This will result in just under one megabyte of CSV data per file.

Next we can create a function that defines the task of generating the data for one file and saving it to disk.

It will first call the generate_file() function to generate data.

Next, we will use the os.path.join() function to create a relative file path for the file and construct a unique filename that includes the unique identifier used to seed the content of the file.

Finally, the file contents are saved to the file path and progress of the task is reported.

The generate_and_save() function below ties this together, taking the directory path where we wish to save files and the unique identifier for the file as arguments.

The final step involves creating the sub-directory where we wish to save all of the files.

First, we can call the os.makedirs() function to create the directory, e.g. tmp/ relative to the current working directory where the Python script is located.

Next, we can create a fixed number of files, one-by-one by calling the generate_and_save() function.

The main() function implements this and uses default arguments to specify the sub-directory where to save files (e.g. tmp/) and the number of files to create (e.g. 5,000).

Tying this together, the complete example of creating 5,000 files sequentially is listed below.

Running the example first creates the unique content for each file, then saves the content with a unique filename from tmp/data-0000.csv to tmp/data-4999.csv.

Each file is just under one megabyte, and given that we have 5,000 files, that is about 5 gigabytes of data.

The program will take a moment to run and the running time will depend on the speed of your hardware, especially your CPU and hard drive speeds.

On my system, it completes in about 63.4 seconds.

How long did it take to run on your system?
Let me know in the comments below.

Next, let’s consider how we might speed this program up using concurrency.

Run your loops using all CPUs, download my FREE book to learn how.

How to Save Files Concurrently

We might think of the task of saving files to disk as being composed of two elements:

  1. Preparing the data in memory (e.g. generating it in this case).
  2. Writing data from main memory to disk.

Preparing the data that will be saved is a CPU-bound task. This means that it will be performed as fast as the CPU can run.

We can generate data for files concurrently, but the contents of each file is about one megabyte and will consume main memory. For example, generating all 5,000 files before saving them will occupy at minimum five gigabytes of main memory, and likely a factor of two or more than that.

Saving data to file is IO-bound. This means that the task is limited by the speed of the hardware in transferring data from main memory to the disk. It will run as fast as the disk can lay down bytes, which may be relatively fast for solid-state drives (SSDs) and slower for those with moving parts (spindle disks).

Writing data to disk is likely the slowest part of the program.

Threads would be appropriate for writing files concurrently. This is because it is computationally cheap to create hundreds or thousands of threads and each thread will release the global interpreter lock when performing IO operations like writing to a file, allowing other threads to progress.

Nevertheless, I suspect that most modern hard drives do not support parallel writes. That is, they can probably only write one file (or one byte or word of data) to disk at a time.

This means that even if we can make writing files concurrent using threads, it will very likely not offer any speed-up.

That being said, it does take a little bit of time to prepare the data for each file in memory before writing. It may be possible to focus on this aspect and make it parallel using processes.

Processes would be appropriate for making the data generation concurrent, as each process will be independent (e.g. have an independent global interpreter lock), allowing each to potentially run on a separate CPU core, as chosen by the underlying operating system.

Finally, it may be possible to hybridise the thread and process approaches.

Communication between processes is computationally expensive, so generating data in one process and sending it to another for saving would likely not be viable.

Nevertheless, it may be possible to have data generation occur within each process and have that process save files somewhat concurrently using its own additional worker threads. It really depends on how much effort is required to generate files compared to saving them for this to be effective.

Next, let’s start to explore the use of threads to speed-up file creation and saving.

Save Files Concurrently With Threads

The ThreadPoolExecutor provides a pool of worker threads that we can use to multithread the saving of thousands of data files.

Each file in the tmp/ directory will represent a task that will require the file to be created and saved to disk by a call to our generate_and_save() function.

Adapting our program to use the ThreadPoolExecutor is relatively straightforward.

First, we can create the thread pool and specify the number of worker threads. We will use 100 in this case, chosen arbitrarily.

Источник

Processing large data files with Python multithreading

We spend a lot of time waiting for some data preparation task to finish —the destiny of data scientists, you would say. Well, we can speed things up. Here are two techniques that will come handy: memory mapped files and multithreading.

The data

I had recently to extract terms and term frequencies from the Google Books Ngram corpus and found myself wondering if there are ways to speed up the task. The corpus consists of twenty-six files totalling 24GB of data. Each of the files I was interested in contains a term and other meta data, tab separated. The brute force approach of reading these files as pandas data frames was … slow. Since we wanted only the unique terms and their match counts, I thought I would try to make it faster 🙂

Memory mapped files

This technique is not new. It has been around for a long time and originated in Unix (before Linux!). Briefly, mmap bypasses the usual I/O buffering by loading the contents of a file into pages of memory. This works very well for computers with large memory footprints. That’s mostly OK with today’s desktops and laptops where having 32GB of memory is not anymore in the esoteric department. The Python library mimics most of the Unix functionality and offers a handy readline() function to extract the bytes one line at a time.

# map the entire file into memory
mm = mmap.mmap(fp.fileno(), 0)
# iterate over the block, until next newline
for line in iter(mm.readline, b""):
# convert the bytes to a utf-8 string and split the fields
term = line.decode("utf-8").split("\t")

The fp is a file-pointer that was previously opened with the r+b access attribute. There you go, with this simple tweak you have made file reading twice as fast (well, the exact improvement will depend on a lot of things such as disk HW, etc).

Multithreading

The next technique that always helps in making things faster is adding parallelism. In our case, the task was I/O bound. That is a good fit for scaling-up —i.e. adding threads. You will find good discussions on when it is better to scale-out (multi-processing) on search engines.

Python3 has a great standard library for managing a pool of threads and dynamically assign tasks to them. All with an incredibly simple API.

# use as many threads as possible, default: os.cpu_count()+4
with ThreadPoolExecutor() as threads:
t_res = threads.map(process_file, files)

The default value of max_workers for ThreadPoolExecutor is 5 threads per CPU core (as of Python v3.8). The map() API will receive a function to be applied to each member of a list and will run the function automatically when threads become available. Wow. That simple. In less than fifty minutes I had converted the 24GB input into a handy 75MB dataset to be analysed with pandas—voilà.

The complete code is on GitHub. Comments and remarks are always welcome.

PS: I added a progress bar with tqdm for each thread. I really don’t know how they manage to avoid scrambling of the lines on the screen … It works like a charm.

UPDATE: Two years later, this came up 🙂

Источник

Читайте также:  Импорт стиля
Оцените статью