What is exploratory data analysis?

Posted on: May 31, 2023

Data Analytics Statistics Technology Information Concept

Exploratory data analysis (EDA) is a preliminary method of data analysis used to help people analyse, understand, characterise, and summarise data.

During the exploratory analysis phase, data analysts can use EDA techniques and machine learning algorithms to support their explorations, and often use data visualisation methods to help identify and illustrate data:

trends
patterns
relationships, correlations, and connections
outliers
anomalies.

Through this data exploration, data analysts can then interrogate datasets for hypothesis testing, answering questions, and spotting missing values.

The EDA process typically begins with data collection and data cleaning before datasets are analysed. Following data mining and analysis, main characteristics are summarised, statistical modeling can occur, and data can be illustrated through graphical visualisations.

After this has occurred, confirmatory data analysis can commence.

What are the four primary types of exploratory data analysis?

While there are several types of exploratory data analysis, experts in data analysis generally agree that there are four primary types.

Univariate non-graphical

Univariate non-graphical analysis is known as the simplest form of data analysis. This is because the data it analyses has a single variable, so it doesn’t have to consider relationships between data variables. Instead, it aims to describe data and find patterns within it.

Univariate graphical

Univariate graphical analysis offers a more thorough understanding of the data being analysed, presenting quantitative data in a graphical format.

There are several different types of univariate graphical visualisations used in univariate analysis, including:

Histograms – use bar plots to illustrate the distribution of numerical data.
Stem-and-leaf plots – also known as a stem-and-leaf display, these visualisations show all data values as well as the shape of their distribution.
Box plots – developed by American mathematician and statistician John Tukey in the 20th century, box plots offer a visual representation of the median and quartiles within datasets. This includes the minimum value, lower quartile, median, upper quartile, and maximum value. Through this visualisation, data scientists and statisticians can pinpoint things like skewness, which measures distribution asymmetry in data. Another element of box plots is what’s known as whiskers, which are lines that extend from the box plot to indicate variability outside the upper and lower quartiles.
Violin plot – visualises the distribution of numerical data. It is similar to box plot but depicts more than just summary statistics.

Multivariate nongraphical

Multivariate nongraphical visualisations are similar to univariate nongraphical analysis, with the key difference being that it considers more than one variable, and can show the relationship between two – or more – variables in data through cross-tabulation or statistics.

Multivariate graphical

Multivariate data graphical analysis uses graphics to display relationships between multiple variables. In multivariate analysis, this is either done through grouping or faceting:

Grouping maps the values of the first two variables on the x-axis and the y-axis. Additional variables can then be mapped using other visual characteristics, such as colour, shape, and line type. This graphical method means data scientists can plot the data for multiple groups in a single graph.
Faceting graphs, such as bar charts, consist of several separate plots – one for each level of a third variable, or combination of variables.

Exploratory data analysis techniques and tools

Python

Python is one of the most commonly used programming languages in the exploratory analysis of data points.

Online libraries

pandas is an open-source software library for data analysis in Python.
Matplotlib is a data visualisation and graphical plotting library that can create static, animated, and interactive visualisations in Python.
Seaborn is a Python data visualisation library that provides a high-level interface for drawing statistical graphics.

Scatterplot

Scatterplot is one of the more common graphical techniques used for multivariate data. It is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.

Clustering and dimension reduction

Clustering and dimension reduction techniques can create graphical data displays even where there are many variables.

Bivariate visualisations and summary statistics

Bivariate visualisations and summary statistics are used to assess the relationship between each variable in a dataset, as well as a target variable.

K-means clustering

K-means is a clustering method that assigns data points into K groups, mapping the number of clusters based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered with the same categorical variables.

Predictive models

Predictive models analyse data to predict outcomes. One predictive model example is linear regression analysis, which predicts the value of one variable based on the value of another variable.

Correlation heat map

A correlation heat map is a visual graphic that shows how each of the variables in a dataset are correlated.

What is the difference between exploratory data analysis and confirmatory data analysis?

Exploratory data analysis typically precedes confirmatory data analysis. While the exploratory analysis tests hypotheses, spots missing values, and so on, the confirmatory analysis confirms whether the hypothesis was correct or incorrect.

What is the difference between descriptive data analysis and exploratory data analysis?

Descriptive data analysis often precedes exploratory data analysis in the data analysis process. The descriptive analysis summarises data, including patterns and measurements, while exploratory analysis goes deeper, identifying relationships and correlations between data variables.

Explore data analysis techniques in greater depth

Use data analysis to lead organisations to better decision-making, insight, and competitive advantage by studying the 100% online MSc Management with Data Analytics at Keele University. This flexible, part-time programme has been designed for leaders and aspiring leaders who are aiming to progress into more senior roles and want to develop a firm understanding of the strategic and operational challenges in running an organisation, particularly through the lens of harnessing data for success.

One of the key modules on this master’s degree explores visualisation for data analytics, providing you with a comprehensive understanding of the use of data analytics within areas such as health, security, science, and business. The module also equips you with a variety of data science visualisation techniques to enable you to make sense of the emergence and growth of big data.