Unleash the Power of Pandas: Making Data Analysis Easier



What is Pandas?

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. It has functions for analyzing, cleaning, exploring, and manipulating data.

 


Pandas Series

A Pandas Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data. example

import pandas as pd
a = [1, 7, 2]
pd_series= pd.Series(a)
print(pd_series)
output:
0 1
1 7
2 2
dtype: int64

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc. With the index argument, you can name your own labels. example

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
output:
x 1
y 7
z 2
dtype: int64

A Series can be converted back to a dictionary with its to_dict method: example

myvar.to_dict()
Output:
{'x': 1,'y': 7, 'z': 2}

Pandas DataFrames

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. example

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df) 
Output:
calories duration
0 420 50
1 380 40
2 390 45

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s). For large DataFrames, the head method selects only the first five rows,Similarly, tail returns the last five rows: example;

print(df.loc[0]) # returns a Pandas Series
print(df.loc[[0, 1]]) # returns a Pandas Dataframe
output:
calories 420
duration 50
Name: 0, dtype: int64
       calories  duration
0 420 50
1 380 40

If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order, If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result: example

df = pd.DataFrame(data, clolumns=['duration','calories','total'])

Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame. The example below uses a CSV file called: ‘data.csv’. you can download it here data.csv

import pandas as pd

# Load a comma separated file (CSV file) into a DataFrame:
df = pd.read_csv('data.csv')

print(df)

output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]

You can also use pd.read_json('data.json') to read Json file. this is useful because big data sets are often stored, or extracted as Json which is similar to python dictionary.

Analyzing DataFrames

One of the most used method for getting a quick overview of the DataFrame, is the head() method. The head() method returns the headers and a specified number of rows, starting from the top.

There is also a tail() method for viewing the last rows of the DataFrame. The tail() method returns the headers and a specified number of rows, starting from the bottom.

To get info about data, The DataFrames object has a method called info(), that gives you more information about the data set. example

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(3))
print(df.info())


output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None

Data Cleaning

Data cleaning means fixing bad data in your data set. Bad data could be Empty cells, Data in wrong format, Wrong data, Duplicates

import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True) # remove rows that contain empty cells
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True) # Replace NULL values in the "Calories" columns with mean value

Discovering Duplicates

Duplicate rows are rows that have been registered more than one time. To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

To remove duplicates, use the drop_duplicates() method. example

import pandas as pd

df = pd.read_csv('data.csv')

df.drop_duplicates(inplace = True)

Pandas — Data Correlations

A great aspect of the Pandas module is the corr() method. The corr() method calculates the relationship between each column in your data set.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.corr())

output:
Duration Pulse Maxpulse Calories
Duration 1.000000 -0.155408 0.009403 0.922721
Pulse -0.155408 1.000000 0.786535 0.025120
Maxpulse 0.009403 0.786535 1.000000 0.203814
Calories 0.922721 0.025120 0.203814 1.000000

The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns. The number varies from -1 to 1

Conclusion

The Pandas library is really an amazing tool to have in Python. It offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

Comments

Popular Posts