Data Visualization with Python

July 19, 2023

Data Visualization with Python

Introduction

Data visualization is one of the key components in being able to understand data, as a data scientist. Why is it so important? In this blog, I would show you how data visualization can help you identify all kinds of interesting parts of your data such as outliers, spikes, groupings, tendencies, and more, that can help you realize how your data looks graphically.

In this blog, I would like to use Matplotlib and Seaborn which are quite popular libraries for data visualization in Python.

Visualization Distribution

There are a couple of ways to dig into data, among these is by looking at its distribution, or how the data is organized along an axis.

Loading dataset

The example below uses a diabetes prediction dataset in a CSV file format. You can download it here

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('data/diabetes_prediction_dataset.csv')
df.head()

let’s filter a person with an age greater than 30, then plot a histogram for the distribution

filtered_df= df[(df['age']>30)]
filtered_df['age'].plot(kind='hist',bins=10)

Create a 2D histogram showing the relationship between age and blood glucose level.

y = filtered_df['age']
x = filtered_df['blood_glucose_level']

fig, ax = plt.subplots(tight_layout=True)
hist = ax.hist2d(x,y)

let’s work with the filtered dataset, and create a labelled and stacked histogram superimposing blood status with Hemoglobin A1c test

x1 = filtered_df.loc[filtered_df.gender=='Male', 'HbA1c_level']
x2 = filtered_df.loc[filtered_df.gender=='Female', 'HbA1c_level']

kwargs = dict(alpha=0.5, bins=20)

plt.hist(x1, **kwargs, color='red', label='low')
plt.hist(x2, **kwargs, color='orange', label='high')

plt.gca().set(title=' Status', ylabel='Hemoglobin A1c test')
plt.legend();

It is now time to use work with Seaborn, to create a smooth plot about HbA1c_level by showing data density

sns.kdeplot(filtered_df['HbA1c_level'])
plt.show()

let us show the blood glucose level mass density per smoking history:

sns.kdeplot(
   data=filtered_df, x="blood_glucose_level", hue="smoking_history",
   fill=True, common_norm=False, palette="crest",
   alpha=.3, linewidth=0,
)

Create a 2D kdeplot comparing heart_disease and hypertension with hue showing diabetes status

sns.kdeplot(data=filtered_df, x="hypertension", y="heart_disease", hue="diabetes")

Search This Blog

Easy tutorial

Data Visualization with Python

Introduction

Visualization Distribution

Comments

Popular Posts

Tensorflow

An introduction to Machine Learning