There was a recent uproar on twitter due to a (now deleted) tweet by a data science interviewer which divided the python community - pandas.read_csv()
vs. the built-in csv
module.
It is perfectly fine in case you do not use the built-in csv
module and let me give you three solid reasons why you should use pandas.read_csv()
(without delving into feature-level comparison):
1. Fewer Lines of Code
Dataset - Download
For the above dataset, the simplest way of reading it using the csv
module is:
import csv
data = []
with open('ford_escort.csv', newline='') as f:
reader = csv.reader(f, skipinitialspace=True)
for row in reader:
data.append(row)
Now, let us access the same CSV file using pandas.read_csv()
.
import pandas
df = pandas.read_csv('ford_escort.csv', skipinitialspace=True)
According to the Zen of Python which lists the guiding principles for Python's design, Simple is better than complex and Flat is better than nested. This clearly resonates when we use pandas.read_csv()
as there are fewer lines of code (lower margin of error) and the interface is also not nested.
2. Automatic Type Detection
The csv
module treats the file like a plain-text and each datum is stored as a sequence of strings.
dtypes = [type(item) for item in data[1]]
print(dtypes)
Output:
[<class 'str'>, <class 'str'>, <class 'str'>]
Whereas, pandas.read_csv()
automatically detects the suitable type for each column.
print(df.dtypes)
Output:
Year int64
Mileage (thousands) int64
Price int64
dtype: object
It also supports easy type conversion of any column using the dtype
argument.
3. Columnar Datasets
CSV is one of the oldest and most widely used data serialisation format. At times it is used to store columnar data where one might need to access specific columns post reading.
To extract a column (for example Price
) from the above data
, we have to write a list comprehension:
price = [int(row[2]) for row in data[1:]]
print(price)
Output:
[9991, 9925, 10491, 10990, 9493, 9991, 10490, 9491, 9491, 9990, 9491, 9990, 9990, 9390, 9990, 9990, 9990, 8990, 7990, 5994, 5994, 5500, 11000]
Whereas pandas.read_csv()
returns a dataframe from which one can directly access any column using the [ ]
(slice) operator.
price = df["Price"].to_list()
print(price)
Output:
[9991, 9925, 10491, 10990, 9493, 9991, 10490, 9491, 9491, 9990, 9491, 9990, 9990, 9390, 9990, 9990, 9990, 8990, 7990, 5994, 5994, 5500, 11000]
This flexibility of accessing data in any direction, surely makes pandas.read_csv()
better suited for real world applications.