- How to merge multiple CSV files with Python
- Steps to merge multiple CSV(identical) files with Python
- Step 1: Import modules and set the working directory
- Step 2: Match CSV files by pattern
- Step 3: Combine all files in the list and export as CSV
- Full Code
- Steps to merge multiple CSV(identical) files with Python with trace
- Combine multiple CSV files when the columns are different
- Bonus: Merge multiple files with Windows/Linux
- Linux
- Windows
- How To Merge Large CSV files Into A Single File With Python
- Dive into Python and learn how to automate tasks like merging chunky CSV or Excel files using few lines of code.
- Suggested On-Demand Courses:
- Introduction
- How to combine CSV files using Python?
- Combining Multiple CSV Files together
- Different Ways to Combine CSV Files in Python
- Method 1: append()
- Method 2: concat()
- Method 3: merge()
- Conclusion
- References
How to merge multiple CSV files with Python
In this guide, I’ll show you several ways to merge/combine multiple CSV files into a single one by using Python (it’ll work as well for text and other files). There will be bonus — how to merge multiple CSV files with one liner for Linux and Windows. Finally with a few lines of code you will be able to combine hundreds of files with full control of loaded data — you can convert all the CSV files into a Pandas DataFrame and then mark each row from which CSV file is coming.
Steps to merge multiple CSV(identical) files with Python
Note: that we assume — all files have the same number of columns and identical information inside
Short code example — concatenating all CSV files in Downloads folder:
import pandas as pd import glob path = r'~/Downloads' all_files = glob.glob(path + "/*.csv") all_files
Step 1: Import modules and set the working directory
First we will start with loading the required modules for the program and selecting working folder:
import os, glob import pandas as pd path = "/home/user/data/"
Step 2: Match CSV files by pattern
Next step is to collect all files needed to be combined. This will be done by:
all_files = glob.glob(os.path.join(path, "data_*.csv"))
The next code: data_*.csv match only files:
You can customize the selection for your needs having in mind that regex matching is used.
Step 3: Combine all files in the list and export as CSV
The final step is to load all selected files into a single DataFrame and converted it back to csv if needed:
df_merged = (pd.read_csv(f, sep=',') for f in all_files) df_merged = pd.concat(df_from_each_file, ignore_index=True) df_merged.to_csv( "merged.csv")
Note that you may change the separator by: sep=’,’ or change the headers and rows which to be loaded
You can find more about converting DataFrame to CSV file here: pandas.DataFrame.to_csv
Full Code
Below you can find the full code which can be used for merging multiple CSV files.
import os, glob import pandas as pd path = "/home/user/data/" all_files = glob.glob(os.path.join(path, "data_*.csv")) df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files) df_merged = pd.concat(df_from_each_file, ignore_index=True) df_merged.to_csv( "merged.csv")
Steps to merge multiple CSV(identical) files with Python with trace
Now let’s say that you want to merge multiple CSV files into a single DataFrame but also to have a column which represents from which file the row is coming. Something like:
row | col | col2 | file |
---|---|---|---|
1 | A | B | data_201901.csv |
2 | C | D | data_201902.csv |
This can be achieved very easy by small change of the code above:
import os, glob import pandas as pd path = "/home/user/data/" all_files = glob.glob(os.path.join(path, "*.csv")) all_df = [] for f in all_files: df = pd.read_csv(f, sep=',') df['file'] = f.split('/')[-1] all_df.append(df) merged_df = pd.concat(all_df, ignore_index=True, sort=True)
In this example we iterate over all selected files, then we extract the files names and create a column which contains this name.
Combine multiple CSV files when the columns are different
Sometimes the CSV files will differ for some columns or they might be the same only in the wrong order to be wrong. In this example you can find how to combine CSV files without identical structure:
import os, glob import pandas as pd path = "/home/user/data/" all_files = glob.glob(os.path.join(path, "*.csv")) all_df = [] for f in all_files: df = pd.read_csv(f, sep=',') f['file'] = f.split('/')[-1] all_df.append(df) merged_df = pd.concat(all_df, ignore_index=True, , sort=True)
Pandas will align the data by this method: pd.concat . In case of a missing column the rows for a given CSV file will contain NaN values:
row | col | col2 | col_201901 | file |
---|---|---|---|---|
1 | A | B | AA | data_201901.csv |
2 | C | D | NaN | data_201902.csv |
If you need to compare two csv files for differences with Python and Pandas you can check: Python Pandas Compare Two CSV files based on a Column
Bonus: Merge multiple files with Windows/Linux
Linux
Sometimes it’s enough to use the tools coming natively from your OS or in case of huge files. Using python to concatenate multiple huge files might be challenging. In this case for Linux it can be used:
sed 1d data_*.csv > merged.csv
In this case we are working in the current folder by matching all files starting with data_ . This is important because if you try to execute something like:
You will try to merge the newly output file as well which may cause issues. Another important note is that this will skip the first lines or headers of each file. In order to include headers you can do:
sed -n 1p data_1.csv > merged.csv sed 1d data_*.csv >> merged.csv
If the commands above are not working for you then you can try with the next two. The first one will merge all csv files but have problems if the files ends without new line:
head -n 1 1.csv > combined.out && tail -n+2 -q *.csv >> merged.out
The second one will merge the files and will add new line at the end of them:
head -n 1 1.csv > combined.out for f in *.csv; do tail -n 2 "$f"; printf "\n"; done >> merged.out
Windows
The Windows equivalent on this will be:
C:\> copy data_*.csv merged.csv
How To Merge Large CSV files Into A Single File With Python
Dive into Python and learn how to automate tasks like merging chunky CSV or Excel files using few lines of code.
Suggested On-Demand Courses:
Many of you contacted me asking for valuable resources to automate Excel (and in general spreadsheets) tasks with Python. Below I share four courses that I would recommend:
- Intermediate Python (Nano-Degree)ORData Analysts | Python + SQL (Nano-Degree)VERY HIGH quality courses for people committed to learn more advanced Python!→Obtain 70% Discount Through This Link
- Python Programming For Excel Users (Numpy & Pandas)
- Python For Spreadsheet Users (Pandas & Others)
- Python For Data Analysis & Visualisation (Pandas, Matplotlib, Seaborn)
Hope you’ll find them useful too! Now enjoy the article 😀
Introduction
Believe it or not, in 2022 there are still companies that hire external data consultants to perform tasks that would require minimal effort (even for a newbie) using a small Python script.
“…in 2022 there are still companies that hire external data consultants to perform tasks that would require minimal effort using a small Python script.”
Funnily enough, the same consultants pretend to use some black magic to perform straightforward jobs and charge unbelievably high fees. Money that could definitely be invested more wisely.
For instance, picture this: a big sales team having to merge multiple CSV or Excel files each month, coming from different departments, to create a unified performance report.
Despite these files often come with a similar format, at times they are so chunky, that a manual copy and paste is not even an option and could also lead to errors or missing data.
If this sounds familiar and you wish to learn how to automate such tasks with Python, you are in the right place!
How to combine CSV files using Python?
Often while working with CSV files, we need to deal with large datasets. Depending on the requirements of the data analysis, we may find that all the required data is not present in a single CSV file. Then the need arises to merge multiple files to get the desired data. However, copy-pasting the required columns from one file to another and that too from large datasets is not the best way to around it.
To solve this problem, we will learn how to use the append , merge and concat methods from Pandas to combine CSV files.
Combining Multiple CSV Files together
To begin with, let’s create sample CSV files that we will be using.
Notice that, all three files have the same columns or headers i.e. ‘name’, ‘age’ and ‘score’. Also, file 1 and file 3 have a common entry for the ‘name’ column which is Sam, but the rest of the values are different in these files.
Note that, in the below examples we are considering that all the CSV files are in the same folder as your Python code file. If this is not the case for you, please specify the paths accordingly while trying out the examples by yourself.
All the examples were executed in a Jupyter notebook.
Different Ways to Combine CSV Files in Python
Before starting, we will be creating a list of the CSV files that will be used in the examples below as follows:
import glob # list all csv files only csv_files = glob.glob('*.<>'.format('csv')) csv_files
['csv_file_1.csv', 'csv_file_2.csv', 'csv_file_3.csv']
Method 1: append()
Let’s look at the append method here to merge the three CSV files.
import pandas as pd df_csv_append = pd.DataFrame() # append the CSV files for file in csv_files: df = pd.read_csv(file) df_csv_append = df_csv_append.append(df, ignore_index=True) df_csv_append
The append method, as the name suggests, appends each file’s data frame to the end of the previous one. In the above code, we first create a data frame to store the result named df_csv_append. Then, we iterate through the list and read each CSV file and append it to the data frame df_csv_append.
Method 2: concat()
Another method used to combine CSV files is the Pandas concat() method. This method requires a series of objects as a parameter, hence we first create a series of the data frame objects of each CSV file and then apply the concat() method to it.
import pandas as pd df_csv_concat = pd.concat([pd.read_csv(file) for file in csv_files ], ignore_index=True) df_csv_concat
An easier-to-understand way of writing this code is:
l = [] for f in csv_files: l.append(pd.read_csv(f)) df_res = pd.concat(l, ignore_index=True) df_res
Both the above codes when executed produce the same output as shown below.
Notice that the resulting data frame is the same as that of the append() method.
Method 3: merge()
The merge method is used to join very large data frames. A join can be performed on two data frames at a time. We can specify the key based on which the join is to be performed.
It is a good practice to choose a key that is unique for each entry in the data frame, in order to avoid duplication of rows. We can also specify the type of join we wish to perform i.e. either of ‘inner’, ‘outer’, ‘left’, ‘right’ or ‘cross’ join.
We need to first read each CSV file into a separate data frame.
import pandas as pd df1 = pd.read_csv('csv_file_1.csv') df2 = pd.read_csv('csv_file_2.csv') df3 = pd.read_csv('csv_file_3.csv')
Joining df1 and df2:
df_merged = df1.merge(df2, how='outer') df_merged
Joining df1 and df3 based on the key ‘name’.
df_merged = df1.merge(df3, on="name", how='outer') df_merged
df1 and df3, both have an entry for the name ‘Sam’ and the age and score values for both of them are different. Hence, in the resulting data frame, there are columns for representing the entries from both df1 and df3. Since John and Bob are not common in the data frames df1 and df3, their values are NaN wherever applicable.
Conclusion
In this article, we learned about the Pandas methods namely concat, merge and append and how to use them to combine CSV files using Python.