This article describes which is the fastest way to read CSV among csv (Python standard module) , NumPy, pandas, and Excel VBA.
Conclusion
pandas is the fastest. csv (Python standard module) is twice as slow. NumPy and VBA are 20x slower.
Experimental method
- Measure the time 10 times to read a CSV file with Python standard module csv and put it in a variable. Measure this five times.
- Same with NumPy
- Same with pandas
- Measure the time 10 times to read a CSV file with VBA and put it in a EXCEL. Measure this five times.
CSV file to read
- 200,000 rows of data 0,1,2,3,4,5,6,7,8,9
- File size 4.0MB
This is what the file looks like when opened in Excel.
Code
Python standard module csv
#!/usr/bin/python
from time import time
import csv
file_name = r"C:\Users\yoshihiko\Desktop\yoshihiko.csv"
iteration = 10
def load_csv_BY_csv_reader(file):
result = []
for n in range(iteration):
with open(file, "r", encoding="utf-8") as f:
reader = csv.reader(f)
for row in reader:
result.append(row)
return result
start_time = time()
result = load_csv_BY_csv_reader(file_name)
elapsed_time = time() - start_time
print("{:.3f}".format(elapsed_time), "秒")
NumPy
#!/usr/bin/python
from time import time
import numpy as np
file_name = r"C:\Users\yoshihiko\Desktop\yoshihiko.csv"
iteration = 10
def load_csv_BY_numpy(file):
for n in range(iteration):
csv = np.loadtxt(file, delimiter=',')
return csv
start_time = time()
result = load_csv_BY_numpy(file_name)
elapsed_time = time() - start_time
print("{:.3f}".format(elapsed_time), "秒")
pandas
#!/usr/bin/python
from time import time
import pandas as pd
file_name = r"C:\Users\yoshihiko\Desktop\yoshihiko.csv"
iteration = 10
def load_csv_BY_pandas(file):
for n in range(iteration):
csv = pd.read_csv(file)
return csv
start_time = time()
result = load_csv_BY_pandas(file_name)
elapsed_time = time() - start_time
print("{:.3f}".format(elapsed_time), "秒")
VBA
Option Explicit
Private Sub read_csv()
Dim csv_path As String: csv_path = "C:\Users\yoshihiko\Desktop\yoshihiko.csv"
Dim csv_file_name As String: csv_file_name = "yoshihiko.csv"
Dim csv_sheet_name As String: csv_sheet_name = "yoshihiko"
Dim self_file_name As String: self_file_name = Application.ActiveWorkbook.Name
Dim i As Integer
Dim iterate As Integer: iterate = 10
Dim start_time As Variant
Dim elapsed_time As Variant
Application.ScreenUpdating = False
start_time = Timer
For i = 1 To iterate
Workbooks(self_file_name).Activate
Workbooks.Open csv_path
Workbooks(self_file_name).Worksheets("Sheet1").Range("A1:J20000") = _
Workbooks(csv_file_name).Worksheets(csv_sheet_name).Range("A1:J20000").Value
Workbooks(csv_file_name).Close
Next i
elapsed_time = Timer - start_time
Application.ScreenUpdating = True
MsgBox elapsed_time & "sec"
End Sub
PC
CPU : CORE i7 7th Gen ( i7-7500U 2.70GHz up to 3.50 GHz )
Result
- pandas is the fastest with an average of 1.001 seconds.
- Based on pandas, csv is 2x slower, NumPy and VBA 20x slower.
Trial Number | csv | NumPy | pandas | VBA |
1 | 2.445 | 20.861 | 1.157 | 24.836 |
2 | 2.140 | 19.784 | 1.012 | 22.086 |
3 | 2.124 | 19.460 | 0.943 | 20.719 |
4 | 2.094 | 20.080 | 0.954 | 26.750 |
5 | 2.100 | 20.484 | 0.940 | 21.180 |
average | 2.181 | 20.134 | 1.001 | 23.114 |
unit : second
Summary
pandas is the fastest.
リンク
リンク
コメント