Hands-on Exercise 5D

Published

February 5, 2024

Modified

February 1, 2024

15 Visual Multivariate Analysis with Parallel Coordinates Plot

15.1 Overview and Learning Outcomes

This hands-on exercise is based on Chapter 15 of the R for Visual Analytics book.

A parallel coordinates plot is a data visualisation specially designed for visualising and analysing multivariate, numerical data. It is ideal for comparing multiple variables and seeing the relationships between them. For example, the variables contributing to a happiness index.

Parallel coordinates was invented by Alfred Inselberg in the 1970s as a way to visualize high-dimensional data. This data visualisation technique is more often found in academic and scientific communities than in business and consumer data visualisations. As pointed out by Stephen Few (2006), “This certainly isn’t a chart that you would present to the board of directors or place on your web site for the general public. In fact, the strength of parallel coordinates isn’t in their ability to communicate some truth in the data to others, but rather in their ability to bring meaningful multivariate patterns and comparisons to light when used interactively for analysis.” For example, parallel coordinates plot can be used to characterise clusters detected during customer segmentation.

The learning outcomes are:

  • Plot a static parallel coordinates plots using the ggparcoord() function in the GGally package;

  • Plot an interactive parallel coordinates plots using the parcoords package; and

  • Plot an interactive parallel coordinates plots using the parallelPlot package.

15.2 Getting Started

15.2.1 Installing and Loading Required Libraries

In this hands-on exercise, the following R packages are used:

  • tidyverse (i.e. readr, tidyr, dplyr) for performing data science tasks such as importing, tidying, and wrangling data;

  • GGally (ggplot2 extension) for plotting pairwise plot matrix, scatterplot plot matrix, parallel coordinates plot, survival plot, and several functions to plot networks.

  • parallelPlot for plotting interactive parallel coordinates plot.

The code chunk below uses the p_load() function in the pacman package to check if the packages are installed. If yes, they are then loaded into the R environment. If no, they are installed, then loaded into the R environment.

pacman::p_load(tidyverse, GGally,
               parallelPlot)

15.2.2 Importing Data

The dataset for this hands-on exercise is imported into the R environment using the read_csv() function in the readr package and stored as the R object, wh. The data is from the World Happiness 2018 report.

wh = read_csv("data/WHData-2018.csv")

The tibble data frame, wh, has 12 columns and 156 rows. Other than the “Country” and “Region” variables, the remaining variables are continuous numerical data.

15.3 Plotting Static Parallel Coordinates Plot

In this sub-section, the ggparcoord() function in the GGally package is used to plot static parallel coordinates plots.

15.3.1 Basic Plot

A basic static plot is plotted below.

ggparcoord(data = wh, 
           columns = c(7:12))

Note: Only two arguments, “data” and “columns” are used. The “data” argument is used to map the data object (i.e. wh) and the “columns” argument is used to select the columns for preparing the parallel coordinates plot.

15.3.2 Adding Boxplot

The basic parallel coordinates failed to reveal any meaningful understanding of the World Happiness measures. Hence, further arguments would be added to the ggparcoord() function to reveal more insights.

ggparcoord(data = wh, 
           columns = c(7:12), 
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Parallel Coordinates Plot of World Happines Variables")

Note:

  • The “groupColumn” argument is used to group the observations (i.e. parallel lines) using a single variable (i.e. “Region”) and colour the parallel coordinates lines by “Region”.

  • The “scale” argument is used to scale the variables in the parallel coordinates plot using the “uniminmax” method. The method univariately scale each variable so the minimum of the variable is zero and the maximum is one.

  • The “alphaLines” argument is used to reduce the intensity of the line colour to 0.2. The permissible value range is between 0 to 1.

  • The “boxplot” argument is set to “TRUE” to add the boxplot. The default is FALSE.

  • The “title” argument is used to provide a title for the parallel coordinates plot.

15.3.3 Adding Facet

Since the ggparcoord() function is developed by extending the ggplot2 package, some ggplot2 functions can be added when plotting a parallel coordinates plot.

The facet_wrap() function in the ggplot2 package is used below to plot 10 small multiple parallel coordinates plots, for each geographical region. However, one of the aesthetic defects of the plot is that some of the variable names overlap on the x-axis.

ggparcoord(data = wh, 
           columns = c(7:12), 
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of World Happines Variables by Region") +
  facet_wrap(~ Region)

15.3.4 Rotating x-axis Text Label

To make the x-axis text labels readable, they are rotated by 30 degrees using the theme() function in the ggplot2 package. The “axis.text.x” argument is set with the value of “element_text(angle = 30)”.

ggparcoord(data = wh, 
           columns = c(7:12), 
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of World Happines Variables by Region") +
  facet_wrap(~ Region) + 
  theme(axis.text.x = element_text(angle = 30))

15.3.5 Adjusting Rotated x-axis Text Label

Rotating the x-axis text labels to 30 degrees causes them to overlap with the plot. This can be avoided by adjusting the text location using the “hjust” argument in the theme() function by setting the value of “element_text(angle = 30, hjust = 1)”.

ggparcoord(data = wh, 
           columns = c(7:12), 
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plots of World Happines Variables by Region") +
  facet_wrap(~ Region) + 
  theme(axis.text.x = element_text(angle = 30, hjust=1))

15.4 Plotting Interactive Parallel Coordinates Plot: parallelPlot Package

The parallelPlot package is specially designed to plot parallel coordinates plots using the htmlwidgets package and d3.js (JavaScript library). In this sub-section, the parallelPlot package is used to build interactive parallel coordinates plots.

15.4.1 Basic Plot

The parallelPlot() function in the parallelPlot package is used to make the basic plot below. Some of the axis labels are too long.

wh = wh %>%
  select("Happiness score", c(7:12))
parallelPlot(wh,
             width = 320,
             height = 250)

15.4.2 Rotating Axis Label

The “rotateTitle” argument is added to avoid overlapping axis labels.

One of the useful interactive feature of a plot made with the parallelPlot() function is the ability to click on a variable of interest which would then show the different colour intensities across the plot.

parallelPlot(wh,
             rotateTitle = TRUE)

15.4.3 Changing Colour Scheme

The default blue colour scheme can be changed using the “continousCS” argument.

parallelPlot(wh,
             continuousCS = "YlOrRd",
             rotateTitle = TRUE)

15.4.4 Adding Histogram

The histogram along the axis of each variable can be plotted using the “histoVisibility”.

histoVisibility = rep(TRUE, ncol(wh))
parallelPlot(wh,
             rotateTitle = TRUE,
             histoVisibility = histoVisibility)

~~~ End of Hands-on Exercise 5D ~~~