Hands-on Exercise 5B

Published

February 5, 2024

Modified

February 8, 2024

6 Visual Correlation Analysis

6.1 Overview and Learning Outcomes

This hands-on exercise is based on Chapter 6 of the R for Visual Analytics book.

Correlation coefficient is a popular statistic used to measure the type and strength of the relationship between two variables. The values of a correlation coefficient ranges between -1.0 and 1.0.

  • A value of 1 shows a perfect linear relationship between the two variables.

  • A value of -1.0 shows a perfect inverse relationship between the two variables.

  • A value of 0.0 shows no linear relationship between the two variables.

When multivariate data are used, the correlation coefficeints of the pair comparisons are displayed in a table form known as a correlation matrix or a scatterplot matrix.

There are three broad reasons for computing a correlation matrix:

  1. Reveal the relationship between high-dimensional variables pair-wisely.

  2. Input into other analyses - e.g., correlation matrices commonly used as inputs for exploratory factor analysis, confirmatory factor analysis, structural equation models, and linear regression when excluding missing values pairwise.

  3. Check other analyses - e.g., a high amount of correlations in a linear regression suggests that its estimates will be unreliable.

When the dataset is large, both in terms of the number of observations and the number of variables, a corrgram tend to be used to visually explore and analyse the structure and the patterns of relations among variables. It is designed based on two main schemes:

  1. Rendering the value of a correlation to depict its sign and magnitude; and

  2. Reordering the variables in a correlation matrix so that “similar” variables are positioned adjacently, facilitating perception.

The learning outcomes are:

  • Create correlation matrix using the pairs() function in the graphics package.

  • Plot corrgram using corrplot package.

  • Create an interactive correlation matrix using the plotly package.

6.2 Getting Started

6.2.1 Installing and Loading Required Libraries

In this hands-on exercise, the following R packages are used:

  • tidyverse (i.e. readr, tidyr, dplyr) for performing data science tasks such as importing, tidying, and wrangling data;

  • ggstatsplot for plotting statistics; and

  • corrplot for plotting correlogram.

The code chunk below uses the p_load() function in the pacman package to check if the packages are installed. If yes, they are then loaded into the R environment. If no, they are installed, then loaded into the R environment.

pacman::p_load(tidyverse, ggstatsplot,
               corrplot)

6.2.2 Importing Data

The dataset for this hands-on exercise is imported into the R environment using the read_csv() function in the readr package and stored as the R object, wine. It contains data regarding the Wine Quality Data Set of the UCI Machine Learning Repository.

wine = read_csv("data/wine_quality.csv")

The tibble data frame, wine, has 13 columns and 6,497 rows. Besides quality and type, the rest of the variables are numerical and continuous data type.

6.3 Building Correlation Matrix: pairs()

6.3.1 Building Basic Correlation Matrix

The pairs() function in the graphics package is used to plot a scatter plot matrix (11 by 11).

pairs(wine[,1:11])

The required input of the pairs() function can be a matrix or data frame. In the plot below, columns 2 to 12 of wine is used to build the scatterplot matrix. The variables are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.

pairs(wine[,2:12])

6.3.2 Drawing Lower Corner

The plot may be customised via the arguments in the pairs() function. For example, it is a common practice to show either the upper half or lower half of the correlation matrix instead of both given that it is symmetric.

pairs(wine[,2:12], upper.panel = NULL)

pairs(wine[,2:12], lower.panel = NULL)

6.3.3 Adding Correlation Coefficients

To show the correlation coefficient of each pair of variables instead of a scatter plot, the panel.cor() function will be used. This will also show higher correlations in a larger font.

panel.cor = function(x, y, digits=2, prefix="", cex.cor, ...) {
usr = par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r = abs(cor(x, y, use="complete.obs"))
txt = format(c(r, 0.123456789), digits=digits)[1]
txt = paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor = 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}

pairs(wine[,2:12], 
      upper.panel = panel.cor)

6.4 Visualising Correlation Matrix: ggcormat()

A major limitation of the correlation matrix is that the scatter plots appear very cluttered when the number of observations is relatively large (i.e. more than 500 observations). To over come this problem, a corrgram data visualisation technique suggested by D. J. Murdoch and E. D. Chow (1996) and Friendly, M (2002) and will be used.

The are at least three R packages provide functions to plot corrgram:

Also, some R packages like ggstatsplot package also provides functions for building corrgram.

6.4.1 Basic Plot

On of the advantage of using the ggcorrmat() function in the ggstatsplot package over other methods to visualise a correlation matrix is its ability to provide a comprehensive and yet professional statistical report.

ggcorrmat(
  data = wine, 
  cor.vars = 1:11)

6.4.2 Improved Plot

A title and a subtitle are added to enhance the plot. Also, the matrix is rearranged in order.

ggcorrmat(
  data = wine, 
  cor.vars = 1:11,
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  title    = "Correlogram for wine dataset",
  subtitle = "Four pairs are no significant at p < 0.05"
)

Note:

  • “cor.vars” argument is used to compute the correlation matrix needed to build the corrgram.

  • “ggcorrplot.args” argument provides additional (mostly aesthetic) arguments that will be passed to the function. The list should avoid any of the following arguments since they are already internally being used: corr, method, p.mat, sig.level, ggtheme, colors, lab, pch, legend.title, digits.

Also, specific components of the plot such as the font size of the x-axis, y-axis, and the statistical report can be controlled as follows.

ggplot.component = list(
    theme(text=element_text(size=5),
      axis.text.x = element_text(size = 8),
      axis.text.y = element_text(size = 8)))

6.4.3 Building Multiple Plots

Since the ggstasplot package is an extension of ggplot2, it also supports faceting. This feature is available via the grouped_ggcorrmat() function.

grouped_ggcorrmat(
  data = wine,
  cor.vars = 1:11,
  grouping.var = type,
  type = "robust",
  p.adjust.method = "holm",
  plotgrid.args = list(ncol = 2),
  ggcorrplot.args = list(outline.color = "black", 
                         hc.order = TRUE,
                         tl.cex = 10),
  annotation.args = list(
    tag_levels = "a",
    title = "Correlogram for wine dataset",
    subtitle = "The measures are: alcohol, sulphates, fixed acidity, citric acid, chlorides, residual sugar, density, free sulfur dioxide and volatile acidity",
    caption = "Dataset: UCI Machine Learning Repository"))

Note:

  • “grouping.var” argument is for building a facet plot.

  • Behind the function, the patchwork package is used to create the multiplot. The “plotgrid.args” argument provides a list of additional arguments passed to patchwork::wrap_plots, except for guides argument which is already separately specified earlier.

  • Likewise, the “annotation.args” argument is calling the plot annotation arguments of patchwork package.

6.5 Visualising Correlation Matrix: corrplot Package

An Introduction to corrplot Package provides a basic understanding of the corrplot package.

6.5.1 Basic Plot

The cor() function in the stats package is used to compute the correlation matrix of wine.

wine.cor = cor(wine[, 1:11])

Then, the corrplot() function in the corrplot package is used to plot the corrgram using the default settings.

corrplot(wine.cor)

Note:

  • Default visual object used to plot the corrgram is circle.

  • Default layout is a symmetric matrix.

  • Default colour scheme is diverging blue-red. Blue colours are used to represent pair variables with positive correlation coefficients and red colours are used to represent pair variables with negative correlation coefficients.

  • Intensity of the colour (i.e., saturation) is used to represent the strength of the correlation coefficient. Darker colours indicate relatively stronger linear relationship between the paired variables. Lighter colours indicates relatively weaker linear relationship.

6.5.2 Working with Visual Geometrics

In the corrplot package, there are seven visual geometrics (parameter method) that can be used to encode the attribute values. They are: circle, square, ellipse, number, shade, color and pie. The default is circle, which can be changed via the “method” argument.

corrplot(wine.cor, 
         method = "ellipse") 

6.5.3 Working with Layout

The corrplor() function supports three layout types, namely: “full”, “upper” or “lower”. The default is “full” which display the full matrix. The default setting can be changed via the “type” argument.

corrplot(wine.cor, 
         method = "ellipse", 
         type="lower")

The default layout of the corrgram can be further customised. For example, the “diag” and “tl.col” arguments are used to turn off the diagonal cells and change the axis text label colour to black colour respectively.

corrplot(wine.cor, 
         method = "ellipse", 
         type="lower",
         diag = FALSE,
         tl.col = "black")

6.5.4 Working with Mixed Layout

With the corrplot package, it is possible to design corrgram with mixed visual matrix in one half and numerical matrix in the other half. In order to create a coorgram with mixed layout, the corrplot.mixed() function, a wrapped function for mixed visualisation style, will be used.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Note:

  • The “lower” and “upper” arguments are used to define the visualisation method used. In this case ellipse is used to map the lower half of the corrgram and numerical matrix (i.e. number) is used to map the upper half of the corrgram.

  • The “tl.pos” argument is used to specify the placement of the axis label.

  • The “diag” argument is used to specify the glyph on the principal diagonal of the corrgram.

6.5.5 Combining Corrgram with Significant Test

In statistical analysis, we are also interested to know which pair of variables with statistically significant correlation coefficients.

A corrgram is combined with the significant test. The cor.mtest() function is used to compute the p-values and confidence interval for each pair of variables. Then, it is added to the “p.mat” argument.

wine.sig = cor.mtest(wine.cor, conf.level= .95)

corrplot(wine.cor,
         method = "number",
         type = "lower",
         diag = FALSE,
         tl.col = "black",
         tl.srt = 45,
         p.mat = wine.sig$p,
         sig.level = .05)

The corrgram reveals that not all correlation pairs are statistically significant. For example the correlation between total sulfur dioxide and free surfur dioxide is statistically significant at significant level of 0.1 but not the pair between total sulfur dioxide and citric acid.

6.5.6 Reordering Corrgram

Matrix reorder is very important for mining the hidden structure and patterns in a corrgram.

By default, the order of attributes of a corrgram is sorted according to the correlation matrix (i.e. “original”). The default setting can be overwritten using the “order argument.

Currently, the corrplot package support four sorting methods, they are:

  • “AOE” is for the angular order of the eigenvectors. See Michael Friendly (2002) for details.

  • “FPC” for the first principal component order.

  • “hclust” for hierarchical clustering order, and “hclust.method” for the agglomeration method to be used.

    • “hclust.method” should be one of “ward”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
  • “alphabet” for alphabetical order.

More algorithms can be found in the seriation package.

corrplot.mixed(wine.cor, 
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               order="AOE",
               tl.col = "black")

6.5.7 Reordering Correlation Matrix with hclust Method

If setting “order” argument as “hclust”, the plot can draw rectangles around the corrgram based on the results of hierarchical clustering.

corrplot(wine.cor, 
         method = "ellipse", 
         tl.pos = "lt",
         tl.col = "black",
         order="hclust",
         hclust.method = "ward.D",
         addrect = 3)

6.6 References

  • Michael Friendly (2002). “Corrgrams: Exploratory displays for correlation matrices”. The American Statistician, 56, 316–324.

  • D.J. Murdoch, E.D. Chow (1996). “A graphical display of large correlation matrices”. The American Statistician, 50, 178–180.

  • ggscatmat() and ggpair``s() functions of GGally.

  • corrgram calculates correlation of variables and displays the results graphically. Included panel functions can display points, shading, ellipses, and correlation values with confidence intervals.

~~~ End of Hands-on Exercise 5B ~~~