Unleash the Power of Pandas: Making Data Analysis Easier
Pandas Series
A Pandas Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data. example
import pandas as pd
a = [1, 7, 2]
pd_series= pd.Series(a)
print(pd_series)
output:
0 1
1 7
2 2
dtype: int64
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc. With the index
argument, you can name your own labels. example
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
output:
x 1
y 7
z 2
dtype: int64
A Series can be converted back to a dictionary with its to_dict
method: example
myvar.to_dict()
Output:
{'x': 1,'y': 7, 'z': 2}
Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc
attribute to return one or more specified row(s). For large DataFrames, the head
method selects only the first five rows,Similarly, tail
returns the last five rows: example;
print(df.loc[0]) # returns a Pandas Series
print(df.loc[[0, 1]]) # returns a Pandas Dataframe
output:
calories 420
duration 50
Name: 0, dtype: int64
calories duration
0 420 50
1 380 40
If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order, If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result: example
df = pd.DataFrame(data, clolumns=['duration','calories','total'])
Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame. The example below uses a CSV file called: ‘data.csv’. you can download it here data.csv
import pandas as pd
# Load a comma separated file (CSV file) into a DataFrame:
df = pd.read_csv('data.csv')
print(df)
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
[169 rows x 4 columns]
You can also use pd.read_json('data.json')
to read Json file. this is useful because big data sets are often stored, or extracted as Json which is similar to python dictionary.
Analyzing DataFrames
One of the most used method for getting a quick overview of the DataFrame, is the head()
method. The head()
method returns the headers and a specified number of rows, starting from the top.
There is also a tail()
method for viewing the last rows of the DataFrame. The tail()
method returns the headers and a specified number of rows, starting from the bottom.
To get info about data, The DataFrames object has a method called info()
, that gives you more information about the data set. example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(3))
print(df.info())
output:
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
Data Cleaning
Data cleaning means fixing bad data in your data set. Bad data could be Empty cells, Data in wrong format, Wrong data, Duplicates
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True) # remove rows that contain empty cells
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True) # Replace NULL values in the "Calories" columns with mean value
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time. To discover duplicates, we can use the duplicated()
method.
The duplicated()
method returns a Boolean values for each row:
To remove duplicates, use the drop_duplicates()
method. example
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace = True)
Pandas — Data Correlations
A great aspect of the Pandas module is the corr()
method. The corr()
method calculates the relationship between each column in your data set.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.corr())
output:
Duration Pulse Maxpulse Calories
Duration 1.000000 -0.155408 0.009403 0.922721
Pulse -0.155408 1.000000 0.786535 0.025120
Maxpulse 0.009403 0.786535 1.000000 0.203814
Calories 0.922721 0.025120 0.203814 1.000000
The Result of the corr()
method is a table with a lot of numbers that represents how well the relationship is between two columns. The number varies from -1 to 1
Conclusion
The Pandas library is really an amazing tool to have in Python. It offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.
Comments