Why Exploratory Data Analysis?
Some advantages of Exploratory Data Analysis include:
Improve understanding
of variables by extracting averages, mean, minimum, and maximum values, etc.Discover errors
, outliers, and missing values in the data.Identify patterns
by visualizing data in graphs such as box plots, scatter plots, and histograms.
Hence, the main goal is to understand the data better and use tools effectively to gain valuable insights or draw conclusions.
The Advantages of Exploratory Data Analysis
Example in Python
The iris fisher dataset has been used to demonstrate EDA tasks as shown in the following code blocks.
The formed dataset contains a set of 150 records under five attributes – sepal length (cm)
, sepal width (cm)
, petal length (cm)
, petal width (cm)
, and class
(represents the flower species).
# Importing libraries
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
# Loading data for analysis
iris_data = load_iris()
# Creating a dataframe
iris_dataframe = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_dataframe['class'] = iris_data.target
print(iris_dataframe.head())
Statistics
The first step in data analysis is to observe the statistical values of the data to decide if it needs to be preprocessed in order to make it more consistent
Describe
The describe()
method of a pandas
data frame gives us important statistics of the data like min
, max
, mean
, standard deviation
, and quartiles
.
For example, we want to verify the minimum
and maximum
values in our data. This can be done by invoking the describe()
method:
# Summary of numerical variables
print(iris_dataframe.describe())
Data cleaning
Removing nulls
In order to identify the number of nulls within each column, we can invoke the isnull()
method on each column of the pandas
data frame.
If null values are found within a column, they can be replaced with the column mean using the fillna()
method:
# Retrieving number of nulls in each column
print("Number of nulls in each column:")
print(iris_dataframe.apply(lambda x: sum(x.isnull()),axis=0))
# filling null values with mean for a column
iris_dataframe['sepal length (cm)'].fillna(iris_dataframe['sepal length (cm)'].mean(), inplace=True)
Data visualizations
As human beings, it is difficult to visualize statistical values. As an alternative, visualizations can be utilized in order to better understand the data and detect patterns.
Here, we can visualize our data using histograms
, box-plot
, and scatter plot
.
Histogram
We will plot the frequency of sepal width
and sepal length
of the flowers within our dataset. This helps us to understand the underlying distribution:
# Histogram for sepal length and sepal width
fig = plt.figure(figsize= (10,5))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('sepal length (cm')
ax1.set_ylabel('Count')
iris_dataframe['sepal length (cm)'].hist()
ax2 = fig.add_subplot(122)
ax2.set_xlabel('sepal width (cm)')
ax2.set_ylabel('Count')
iris_dataframe['sepal width (cm)'].hist(ax=ax2)
plt.show()
Histograms for Sepal Length and Width (cm)
Box plot
We can look for outliers in the sepal width
feature of our dataset; then, decide whether or not to remove these outliers from our dataset:
# Creating a box plot
iris_dataframe.boxplot(column='sepal width (cm)', by = 'class');
title_boxplot = 'sepal width (cm) by class'
plt.title( title_boxplot )
plt.suptitle('')
plt.ylabel('sepal width(cm)')
plt.show()
Box Plot for Sepal Width (cm)
Scatter plot
For each class of flowers within our dataset, we can judge how petal width
and petal length
are related to each other:
# Scatter plot of petal length and petal width for different classes
color= ['red' if l == 0 else 'blue' if l==1 else'green' for l in iris_data.target]
plt.scatter(iris_dataframe['petal length (cm)'], iris_dataframe['petal width (cm)'], color=color);
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.show()
Scatter Plot for Sepal Length vs. Width
What is Exploratory Data Analysis? Review:
In our experience, we suggest you solve this What is Exploratory Data Analysis? and gain some new skills from Professionals completely free and we assure you will be worth it.
If you are stuck anywhere between any coding problem, just visit Queslers to get the What is Exploratory Data Analysis?
Find on Educative
Conclusion:
I hope this What is Exploratory Data Analysis? would be useful for you to learn something new from this problem. If it helped you then don’t forget to bookmark our site for more Coding Solutions.
This Problem is intended for audiences of all experiences who are interested in learning about Data Science in a business context; there are no prerequisites.
Keep Learning!
More Coding Solutions >>
LeetCode Solutions
Hacker Rank Solutions
CodeChef Solutions