3 Solid Reasons Why You Should Use pandas.read_csv()

3 Solid Reasons Why You Should Use pandas.read_csv()

There was a recent uproar on twitter due to a (now deleted) tweet by a data science interviewer which divided the python community - pandas.read_csv() vs. the built-in csv module.

pandas.png

It is perfectly fine in case you do not use the built-in csv module and let me give you three solid reasons why you should use pandas.read_csv() (without delving into feature-level comparison):

1. Fewer Lines of Code

Dataset - Download

For the above dataset, the simplest way of reading it using the csv module is:

import csv
data = []
with open('ford_escort.csv', newline='') as f:
    reader = csv.reader(f, skipinitialspace=True)
    for row in reader:
        data.append(row)

Now, let us access the same CSV file using pandas.read_csv().

import pandas
df = pandas.read_csv('ford_escort.csv', skipinitialspace=True)

According to the Zen of Python which lists the guiding principles for Python's design, Simple is better than complex and Flat is better than nested. This clearly resonates when we use pandas.read_csv() as there are fewer lines of code (lower margin of error) and the interface is also not nested.

2. Automatic Type Detection

The csv module treats the file like a plain-text and each datum is stored as a sequence of strings.

dtypes = [type(item) for item in data[1]]
print(dtypes)

Output:

[<class 'str'>, <class 'str'>, <class 'str'>]

Whereas, pandas.read_csv() automatically detects the suitable type for each column.

print(df.dtypes)

Output:

Year                   int64
Mileage (thousands)    int64
Price                  int64
dtype: object

It also supports easy type conversion of any column using the dtype argument.

3. Columnar Datasets

CSV is one of the oldest and most widely used data serialisation format. At times it is used to store columnar data where one might need to access specific columns post reading.

To extract a column (for example Price) from the above data, we have to write a list comprehension:

price = [int(row[2]) for row in data[1:]]
print(price)

Output:

[9991, 9925, 10491, 10990, 9493, 9991, 10490, 9491, 9491, 9990, 9491, 9990, 9990, 9390, 9990, 9990, 9990, 8990, 7990, 5994, 5994, 5500, 11000]

Whereas pandas.read_csv() returns a dataframe from which one can directly access any column using the [ ] (slice) operator.

price = df["Price"].to_list()
print(price)

Output:

[9991, 9925, 10491, 10990, 9493, 9991, 10490, 9491, 9491, 9990, 9491, 9990, 9990, 9390, 9990, 9990, 9990, 8990, 7990, 5994, 5994, 5500, 11000]

This flexibility of accessing data in any direction, surely makes pandas.read_csv() better suited for real world applications.