Unleash the Power of Pandas: Making Data Analysis Easier

July 19, 2023

Unleash the Power of Pandas: Making Data Analysis Easier

Pandas Series

A Pandas Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data. example

import pandas as pd

a = [1, 7, 2]

pd_series= pd.Series(a)

print(pd_series)

output:
0    1
1    7
2    2
dtype: int64

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc. With the index argument, you can name your own labels. example

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

output:
x    1
y    7
z    2
dtype: int64

A Series can be converted back to a dictionary with its to_dict method: example

myvar.to_dict()

Output:
 {'x': 1,'y': 7, 'z': 2}

Pandas DataFrames

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. example

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

Output:
       calories   duration
  0       420        50
  1       380        40
  2       390        45

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s). For large DataFrames, the head method selects only the first five rows,Similarly, tail returns the last five rows: example;

print(df.loc[0]) # returns a Pandas Series
print(df.loc[[0, 1]]) # returns a Pandas Dataframe

output:
  calories    420
  duration     50
  Name: 0, dtype: int64

       calories  duration
  0       420        50
  1       380        40

If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order, If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result: example

df = pd.DataFrame(data, clolumns=['duration','calories','total'])

Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame. The example below uses a CSV file called: ‘data.csv’. you can download it here data.csv

import pandas as pd

# Load a comma separated file (CSV file) into a DataFrame:
df = pd.read_csv('data.csv')

print(df) 

output:
         Duration  Pulse  Maxpulse  Calories
  0          60    110       130     409.1
  1          60    117       145     479.0
  2          60    103       135     340.0
  3          45    109       175     282.4
  4          45    117       148     406.0
  ..        ...    ...       ...       ...
  164        60    105       140     290.8
  165        60    110       145     300.4
  166        60    115       145     310.2
  167        75    120       150     320.4
  168        75    125       150     330.4
  
  [169 rows x 4 columns]

You can also use pd.read_json('data.json') to read Json file. this is useful because big data sets are often stored, or extracted as Json which is similar to python dictionary.

Analyzing DataFrames

One of the most used method for getting a quick overview of the DataFrame, is the head() method. The head() method returns the headers and a specified number of rows, starting from the top.

There is also a tail() method for viewing the last rows of the DataFrame. The tail() method returns the headers and a specified number of rows, starting from the bottom.

To get info about data, The DataFrames object has a method called info(), that gives you more information about the data set. example

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(3))
print(df.info())


output:
       Duration  Pulse  Maxpulse  Calories
  0        60    110       130     409.1
  1        60    117       145     479.0
  2        60    103       135     340.0

  <class 'pandas.core.frame.DataFrame'>
  RangeIndex: 169 entries, 0 to 168
  Data columns (total 4 columns):
   #   Column    Non-Null Count  Dtype  
  ---  ------    --------------  -----  
   0   Duration  169 non-null    int64  
   1   Pulse     169 non-null    int64  
   2   Maxpulse  169 non-null    int64  
   3   Calories  164 non-null    float64
  dtypes: float64(1), int64(3)
  memory usage: 5.4 KB
  None

Data Cleaning

Data cleaning means fixing bad data in your data set. Bad data could be Empty cells, Data in wrong format, Wrong data, Duplicates

import pandas as pd

df = pd.read_csv('data.csv') 

df.dropna(inplace = True) # remove rows that contain empty cells
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True) # Replace NULL values in the "Calories" columns with mean value

Discovering Duplicates

Duplicate rows are rows that have been registered more than one time. To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

To remove duplicates, use the drop_duplicates() method. example

import pandas as pd

df = pd.read_csv('data.csv')

df.drop_duplicates(inplace = True)

Pandas — Data Correlations

A great aspect of the Pandas module is the corr() method. The corr() method calculates the relationship between each column in your data set.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.corr())

output:
            Duration     Pulse  Maxpulse  Calories
  Duration  1.000000 -0.155408  0.009403  0.922721
  Pulse    -0.155408  1.000000  0.786535  0.025120
  Maxpulse  0.009403  0.786535  1.000000  0.203814
  Calories  0.922721  0.025120  0.203814  1.000000

The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns. The number varies from -1 to 1

Conclusion

The Pandas library is really an amazing tool to have in Python. It offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

Search This Blog

Easy tutorial

Unleash the Power of Pandas: Making Data Analysis Easier

Pandas Series

Pandas DataFrames

Load Files Into a DataFrame

Analyzing DataFrames

Data Cleaning

Discovering Duplicates

Pandas — Data Correlations

Comments

Popular Posts

Exploring Your Data with Plotting and Visualization Tools