Chapter 2, data visualization using the grammar of graphics

This chapter explains the grammar of graphics, which is a powerful model for describing a large class of data visualizations. After reading this chapter, you will be able to

State the advantages of the grammar of graphics relative to previous plotting systems
Install the animint2 R package
Translate plot sketches into ggplot code in R
Render ggplots on web pages using animint2
Create multi-layer ggplots
Create multi-panel ggplots

History and purpose of the grammar of graphics

Most computer systems for data analysis provide functions for creating plots to visualize patterns in data. The oldest systems provide very general functions for drawing basic plot components such as lines and points (e.g. the graphics and grid packages in R). If you use one of these general systems, then it is your job to put the components together to form a meaningful, interpretable plot. The advantage of general systems is that they impose few limitations on what kinds of plots can be created. The disadvantage is that general systems typically do not provide functions for automating common plotting tasks (axes, panels, legends).

To overcome the disadvantages of these general plotting systems, charting packages such as lattice were developed (Sarkar, 2008). Such packages have several pre-defined chart types, and provide a dedicated function for creating each chart type. For example, lattice provides the bwplot function for making box and whisker plots. The advantage of such systems is that they make it much easier to create entire plots, including a legend and panels. The disadvantage is the set of pre-defined chart types, which means that it is not easy to create more complex graphics.

Newer plotting systems based on the grammar of graphics are situated between these two extremes. Wilkinson proposed the grammar of graphics in order to describe and create a large class of plots (Wilkinson, 2005). Wickham later implemented several ideas from the grammar of graphics in the ggplot2 R package (Wickham, 2009). The ggplot2 package has several advantages with respect to previous plotting systems.

Like general plotting systems, and unlike lattice, ggplot2 imposes few limitations on the types of plots that can be created (there are no pre-defined chart types).
Unlike general plotting systems, and like lattice, ggplot2 makes it easy to include common plot elements such as axes, panels, and legends.
Since ggplot2 is based on the grammar of graphics, an explicit mapping of data variables to visual properties is required. Later in this chapter, we will explain how this mapping allows sketches of plot ideas to be directly translated into R code.

Finally, all of the previously discussed plotting systems are intended for creating static graphics, which can be viewed equally well on a computer screen or on paper. However, the main topic of this manual is animint2, an R package for interactive graphics. In contrast to static graphics, interactive graphics are best viewed on a computer with a mouse and keyboard that can be used to interact with the plot.

Since many concepts from static graphics are also useful in interactive graphics, the animint2 package is implemented as an extension/fork of ggplot2. In this chapter we will introduce the main features of ggplot2 which will also be useful for interactive plot design in later chapters.

In 2013, we created the animint package, which depends on the ggplot2 package. However during 2014-2017, the ggplot2 package introduced many changes that were incompatible with the interactive grammar of animint. Therefore in 2018 we created the animint2 package which copies/forks the relevant parts of the ggplot2 package. Now animint2 can be used without having ggplot2 installed. In fact, it is recommended to use animint2 without attaching (via library) ggplot2. However it is fine to use animint2 along with packages that import/load ggplot2. For an example, see Chapter 16, which uses the penaltyLearning package (which imports ggplot2).

Installing and attaching animint2

To install the most recent release of animint2 from CRAN,

if(!requireNamespace("animint2"))install.packages("animint2")

To install an even more recent development version of animint2 from GitHub,

if(!requireNamespace("animint2")){
  if(!requireNamespace("remotes"))install.packages("remotes")
  remotes::install_github("tdhock/animint2")
}

Once you have installed animint2, you can load and attach all of its exported functions via:

library(animint2)

Translating plot sketches into ggplots

This section explains how to translate a plot sketch into R code. We use a data set from the World Bank as an example, and we begin by loading and looking at a subset these data.

data(WorldBank, package="animint2")
WorldBank$Region <- sub(" (all income levels)", "", WorldBank$region, fixed=TRUE)
WorldBankSome <- subset(WorldBank, year %% 5 == 0)
table(WorldBankSome$year)

## 
## 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 
##  214  214  214  214  214  214  214  214  214  214  214

head(WorldBankSome)

##     iso2c country year fertility.rate life.expectancy population
## 266    AD Andorra 1960             NA              NA      13414
## 271    AD Andorra 1965             NA              NA      18551
## 276    AD Andorra 1970             NA              NA      24279
## 281    AD Andorra 1975             NA              NA      30706
## 286    AD Andorra 1980             NA              NA      36063
## 291    AD Andorra 1985             NA              NA      44597
##     GDP.per.capita.Current.USD 15.to.25.yr.female.literacy iso3c
## 266                         NA                          NA   AND
## 271                         NA                          NA   AND
## 276                   3238.089                          NA   AND
## 281                   7168.399                          NA   AND
## 286                  12377.714                          NA   AND
## 291                   7775.028                          NA   AND
##                                        region          capital longitude
## 266 Europe & Central Asia (all income levels) Andorra la Vella    1.5218
## 271 Europe & Central Asia (all income levels) Andorra la Vella    1.5218
## 276 Europe & Central Asia (all income levels) Andorra la Vella    1.5218
## 281 Europe & Central Asia (all income levels) Andorra la Vella    1.5218
## 286 Europe & Central Asia (all income levels) Andorra la Vella    1.5218
## 291 Europe & Central Asia (all income levels) Andorra la Vella    1.5218
##     latitude               income        lending                Region
## 266  42.5075 High income: nonOECD Not classified Europe & Central Asia
## 271  42.5075 High income: nonOECD Not classified Europe & Central Asia
## 276  42.5075 High income: nonOECD Not classified Europe & Central Asia
## 281  42.5075 High income: nonOECD Not classified Europe & Central Asia
## 286  42.5075 High income: nonOECD Not classified Europe & Central Asia
## 291  42.5075 High income: nonOECD Not classified Europe & Central Asia

tail(WorldBankSome)

##       iso2c  country year fertility.rate life.expectancy population
## 13011    ZW Zimbabwe 1985          6.224        61.44741    8860331
## 13016    ZW Zimbabwe 1990          5.187        60.52893   10461782
## 13021    ZW Zimbabwe 1995          4.369        53.09839   11639364
## 13026    ZW Zimbabwe 2000          3.862        44.61798   12503652
## 13031    ZW Zimbabwe 2005          3.606        43.86159   12710589
## 13036    ZW Zimbabwe 2010          3.290        49.86088   13076978
##       GDP.per.capita.Current.USD 15.to.25.yr.female.literacy iso3c
## 13011                   636.2357                          NA   ZWE
## 13016                   839.6100                          NA   ZWE
## 13021                   610.9673                          NA   ZWE
## 13026                   535.0403                          NA   ZWE
## 13031                   452.7890                          NA   ZWE
## 13036                   568.4275                    99.55316   ZWE
##                                       region capital longitude latitude
## 13011 Sub-Saharan Africa (all income levels)  Harare   31.0672 -17.8312
## 13016 Sub-Saharan Africa (all income levels)  Harare   31.0672 -17.8312
## 13021 Sub-Saharan Africa (all income levels)  Harare   31.0672 -17.8312
## 13026 Sub-Saharan Africa (all income levels)  Harare   31.0672 -17.8312
## 13031 Sub-Saharan Africa (all income levels)  Harare   31.0672 -17.8312
## 13036 Sub-Saharan Africa (all income levels)  Harare   31.0672 -17.8312
##           income lending             Region
## 13011 Low income   Blend Sub-Saharan Africa
## 13016 Low income   Blend Sub-Saharan Africa
## 13021 Low income   Blend Sub-Saharan Africa
## 13026 Low income   Blend Sub-Saharan Africa
## 13031 Low income   Blend Sub-Saharan Africa
## 13036 Low income   Blend Sub-Saharan Africa

dim(WorldBankSome)

## [1] 2354   16

The WorldBank data set consist of measures such as fertility rate and life expectancy for each country over the period 1960-2010. The code above prints the first and last few rows, and the dimension of the data table, for the subset of every 5 years (2354 rows and 16 columns).

Suppose that we are interested to see if there is any relationship between life expectancy and fertility rate. We could fix one year, then use those two data variables in a scatterplot. Consider the figure below which sketches the main components of that data visualization.

World Bank scatterplot

The sketch above shows life expectancy on the horizontal (x) axis, fertility rate on the vertical (y) axis, and a legend for the region. These elements of the sketch can be directly translated into R code using the following method. First, we need to construct a data table that has one row for every country in 1975, and columns named life.expectancy, fertility.rate, and region. The WorldBank data already has these columns, so all we need to do is consider the subset for the year 1975:

WorldBank1975 <- subset(WorldBank, year==1975)
head(WorldBank1975)

##     iso2c              country year fertility.rate life.expectancy population
## 281    AD              Andorra 1975             NA              NA      30706
## 334    AE United Arab Emirates 1975          6.009        66.18539     532742
## 387    AF          Afghanistan 1975          7.692        37.25712   12551790
## 440    AG  Antigua and Barbuda 1975             NA              NA      69253
## 493    AL              Albania 1975          4.417        68.32583    2426592
## 546    AM              Armenia 1975          2.745        70.52751    2825650
##     GDP.per.capita.Current.USD 15.to.25.yr.female.literacy iso3c
## 281                  7168.3987                          NA   AND
## 334                 27631.8985                    56.27697   ARE
## 387                   188.5521                          NA   AFG
## 440                         NA                          NA   ATG
## 493                         NA                          NA   ALB
## 546                         NA                          NA   ARM
##                                             region          capital longitude
## 281      Europe & Central Asia (all income levels) Andorra la Vella    1.5218
## 334 Middle East & North Africa (all income levels)        Abu Dhabi   54.3705
## 387                                     South Asia            Kabul   69.1761
## 440  Latin America & Caribbean (all income levels)     Saint John's  -61.8456
## 493      Europe & Central Asia (all income levels)           Tirane   19.8172
## 546      Europe & Central Asia (all income levels)          Yerevan    44.509
##     latitude               income        lending                     Region
## 281  42.5075 High income: nonOECD Not classified      Europe & Central Asia
## 334  24.4764 High income: nonOECD Not classified Middle East & North Africa
## 387  34.5228           Low income            IDA                 South Asia
## 440  17.1175  Upper middle income           IBRD  Latin America & Caribbean
## 493  41.3317  Upper middle income           IBRD      Europe & Central Asia
## 546  40.1596  Lower middle income          Blend      Europe & Central Asia

The code above prints the data for 1975, which clearly has the appropriate columns, and one row for each country. The next step is to use the notes in the sketch to code a ggplot with a corresponding aes or aesthetic mapping of data variables to visual properties:

scatter <- ggplot()+
  geom_point(
    mapping=aes(x=life.expectancy, y=fertility.rate, color=Region),
    data=WorldBank1975)
scatter

## Warning: Removed 27 rows containing missing values (geom_point).

The aes function is called with names for visual properties (x, y, color) and values for the corresponding data variables (life.expectancy, fertility.rate, region). This mapping is applied to the variables in the WorldBank1975 data table, in order to create the visual properties of the geom_point. The ggplot was saved as the scatter object, which when printed on the R command line shows the plot on a graphics device. Note that we automatically have a region color legend.

Rendering ggplots on web pages using animint

This section explains how the animint2 package can be used to render ggplots on web pages. The ggplot from the previous section can be rendered with animint2, by using the animint function.

animint(scatter)

If, when you run the code above, the animint does not render in your web browser for some reason (for example if you see a blank web page), then please consult our wiki FAQ which will help you find a solution. Internally, the animint function creates a list of class animint, and then R runs the print.animint function via the S3 object system. The animint2 package implements a compiler that takes the list as input, and outputs a web page with a data visualization. The compiler is the animint2dir function, which compiles the animint scatter.viz list to a directory of data and code files that can be rendered in a web browser. It is activated automatically by the print.animint function.

When viewed in a web browser, the animint plot should look mostly the same as static versions produced by standard R graphics devices. One difference is that the region legend is interactive: clicking a legend entry will hide or show the points of that color.

Exercise: try changing the aes mapping of the ggplot, and then making a new animint. Quantitative variables like population are best shown using the x/y axes or point size. Qualitative variables like lending are best shown using point color or fill.

Multi-layer data visualization (multiple geoms)

Multi-layer data visualization is useful when you want to display several different geoms or data sets in the same plot. For example, consider the following sketch which adds a geom_path to the previous data visualization.

multi-layer WorldBank data viz

Note how the sketch above includes two different geoms (point and path). The two geoms share a common definition of the x, y, and color aesthetics, but have different data sets. Below we translate this sketch into R code.

WorldBankBefore1975 <- subset(WorldBank, 1970 <= year & year <= 1975)
two.layers <- scatter+
  geom_path(aes(
    x=life.expectancy, 
    y=fertility.rate, 
    color=Region,
    group=country),
    data=WorldBankBefore1975)
(viz.two.layers <- animint(two.layers))

Note that we save the return value of the animint function to the viz.two.layers object (which is also printed due to the parentheses). In this manual we will often use variable names that start with viz to denote animint data visualization objects, which are in fact lists of ggplots and options.

The plot above shows a data visualization with 2 geoms/layers:

the geom_point shows the life expectancy, fertility rate, and region of all countries in 1975.
the geom_path shows the same variables for the previous 5 years.

The addition of the geom_path shows how the countries changed over time. In particular, it shows that most countries moved to the right and down, meaning higher life expectancy and lower fertility rate. However, there are some exceptions. For example, the two East Asian countries in the bottom left suffered a decrease in life expectancy over this period. And there are some countries which showed an increased fertility rate.

Exercise: try changing the region legend to an income legend. Hint: you need to use the same aes(color=income) specification for all geoms. You may want to use scale_color_manual with a sequential color palette, see RColorBrewer::display.brewer.all(type="seq") and read the appendix for more details.

Can we add the names of the countries to the data viz? Below, we add another layer with a text label for each country’s name.

three.layers <- two.layers+
  geom_text(aes(
    x=life.expectancy, 
    y=fertility.rate, 
    color=Region,
    label=country),
    data=WorldBank1975)
animint(three.layers)

This data viz is not so easy to read, since there are so many overlapping text labels. The interactive region legend helps a little, by allowing the user to hide data from selected regions. However, it would be even better if the user could show and hide the text for individual countries. That type of interaction can be achieved using the showSelected and clickSelects aesthetics which we explain in Chapters 3-4.

For now, we move on to discuss a major strength of animint: data visualization with multiple linked plots.

Multi-plot data visualization

Multi-plot data visualization is useful when you want to show some related data sets using more than one aesthetic mapping. In interactive data visualization, one plot is often used to display a summary, and another plot is used to display details. For example, consider a data visualization with two plots: a time series with World Bank data from 1960-2010 (summary), and a scatterplot with data from 1975 (details). We sketch the time series plot below.

WorldBank data viz with two plots

Note how the sketch above can be directly translated into the R code below. First we copy the existing viz list (viz.two.layers), then we assign a ggplot to a new element named timeSeries.

viz.two.plots <- viz.two.layers
viz.two.plots$timeSeries <- ggplot()+
  geom_line(aes(
    x=year, 
    y=fertility.rate, 
    color=Region, 
    group=country),
    data=WorldBankSome)

That results in a named list of two elements (both elements are ggplots with class gganimint).

summary(viz.two.plots)

##            Length Class     Mode
## plot1      9      gganimint list
## timeSeries 9      gganimint list

This data visualization list can be printed/rendered by typing its name. Since the list contains two ggplots, animint2 renders the data viz as two linked plots.

viz.two.plots

The data visualization above contains two ggplots, which each map different data variables to the horizontal x axis. The time series uses aes(x=year), and shows a summary of fertility rate values over all years. The scatterplot uses aes(x=life.expectancy), and shows details of the relationship between fertility rate and life expectancy during 1975.

Try clicking a legend entry in either the scatterplot or the time series above. You should see the data and legends in both plots update simultaneously. Since aes(color=Region) was specified in both plots, animint creates a single shared selector variable called Region. Clicking either legend has the effect of updating the set of selected regions, and so animint updates the legends and data in both plots accordingly. This is the main mechanism that animint uses to create interactive data visualizations with linked plots, and will be discussed in more detail in the next two chapters.

Exercise: use animint to create a data viz with three plots, by creating a list with three ggplots. For example, you could add a time series of another data variable such as life.expectancy or population.

Note that both ggplots map the fertility rate variable to the y axis. However, since they are separate plots, the ranges of their y axes are computed separately. That means that even when the two plots are rendered side-by-side, the two y axis are not exactly aligned. That is a problem since it would make it easier to decode the data visualization if each unit of vertical space was used to show the same amount of fertility rate. To achieve that effect, we use facets in the next section.

Multi-panel data visualization (facets)

Panels or facets are sub-plots that show related data visualizations. One of the main strengths of ggplots is that different kinds of multi-panel plots are relatively easy to create. Multi-panel data visualization is useful for two different purposes:

You want to align the axes of several related plots containing different geoms. This facilitates comparison between several different geoms, and is a technique that is also useful for interactive data visualization.
You want to divide the data from one geom into several panels. This facilitates comparison between data subsets, and is less useful for interactive data visualization (interactivity can often be used instead, to achieve the same effect of comparing data subsets).

Different geoms in each panel (aligned axes)

We begin by explaining the how facets are useful to align the axes of related plots. Consider the sketch below which contains a plot with two panels.

World Bank aligned plot

Note that the two panels plot different geoms using a panel-specific aesthetic mapping. The point and path in the left panel have x=life.expectancy, and the line in the right panel has x=year. Also note that we specified facet=x.var, so we need to add a variable called x.var to each of the three data sets. We translate this sketch to the R code below.

add.x.var <- function(df, x.var){
  data.frame(df, x.var=factor(x.var, c("life expectancy", "year")))
}
(viz.aligned <- animint(
  scatter=ggplot()+
    theme_bw()+
    theme_animint(width=600)+
    geom_point(aes(
      x=life.expectancy, y=fertility.rate, color=Region),
      data=add.x.var(WorldBank1975, "life expectancy"))+
    geom_path(aes(
      x=life.expectancy, y=fertility.rate, color=Region,
      group=country),
      data=add.x.var(WorldBankBefore1975, "life expectancy"))+
    geom_line(aes(
      x=year, y=fertility.rate, color=Region, group=country),
      data=add.x.var(WorldBankSome, "year"))+
    xlab("")+
    facet_grid(. ~ x.var, scales="free")))

The data visualization above contains a single ggplot with two panels and three layers. The left panel shows the geom_point and geom_path, and the right panel shows the geom_line. The panels have a shared axis for fertility rate, which ensures that the lines in the time series panel can be directly compared with the points and paths in the scatterplot panel.

Note that we used the add.x.var function to add a x.var variable to each data set, and then we used that variable in facet_grid(scales="free"). We call this the addColumn then facet idiom, which is generally useful for creating a multi-panel data visualization with aligned axes. In particular, if we wanted to change the order of the panels in the data visualization, we would only need to edit the order of the factor levels in the definition of add.x.var.

Also note that theme_bw means to use black panel borders and white panel backgrounds, and panel.margin=0 means to use no space between panels. Eliminating the space between panels means that more space will be used for the panels, which serves to emphasize the data. We call this the Space saving facets idiom, which is generally useful in any ggplot with facets.

In the data viz above, the text labels overlap a bit, which can be fixed by either (exercise for the reader)

using breaks argument of scale_x_continuous()
reducing text size with theme() and element_text()
increasing plot width using theme_animint(), see Chapter 6 for more info.

Same geoms in each panel (compare data subsets)

The second reason for using plots with multiple panels in a data visualization is to compare subsets of observations. This facilitates comparison between data subsets, and can be used in at least two different situations:

One geom’s data set has too many observations to display informatively in one panel.
You want to compare different subsets of data that is plotted for one geom.

For example, consider the sketch below.

World Bank panels plot

Note that the three panels plot the same two geoms (point and path). Since facet=show.year, and there are three panels shown, we will need to create data tables which have three values for the show.year variable. The geom_point has data for just 3 years, and the geom_path has data for 15 years (but 3 values of show.year). The code below creates these two data sets for three years of the WorldBank data set.

show.point.list <- list()
show.path.list <- list()
for(show.year in c(1975, 1985, 1995)){
  show.point.list[[paste(show.year)]] <- data.frame(
    show.year, subset(WorldBank, year==show.year))
  show.path.list[[paste(show.year)]] <- data.frame(
    show.year, subset(WorldBank, show.year - 5 <= year & year <= show.year))
}
show.point <- do.call(rbind, show.point.list)
show.path <- do.call(rbind, show.path.list)

We used a for loop over three values of show.year, the variable which we will use later in facet_grid. For each value of show.year, we store a data subset as a named element of a list. After the for loop, we use do.call with rbind to combine the data subsets. This is an example of the list of data tables idiom, which is generally useful for interactive data visualization.

Below, we facet on the show.year variable to create a data visualization with three panels.

animint(
  scatter=ggplot()+
    geom_point(aes(
      x=life.expectancy, y=fertility.rate, color=Region),
      data=show.point)+
    geom_path(aes(
      x=life.expectancy, y=fertility.rate, color=Region,
      group=country),
      data=show.path)+
    facet_grid(. ~ show.year)+
    theme_bw())

The data visualization above contains a single ggplot with three panels. It shows more of the WorldBank data set than the previous visualizations which showed only the data from 1975. However, it still only shows a relatively small data subset. You may be tempted to try using a panel to display every year (not just 1975, 1985, and 1995). However, beware that this type of multi-panel data visualization is especially useful if there are only a few data subsets. With more than about 10 panels, it becomes difficult to see all the data at once, and thus difficult to make meaningful comparisons.

Instead of showing all of the data at once, we can instead create an animated data visualization that shows the viewer different data subsets over time. In the next chapter, we will show how the new showSelected keyword can be used to achieve animation, and reveal more details of this data set.

Chapter summary and exercises

This chapter presented the basics of static data visualization using ggplot2. We showed how animint can be used to render a list of ggplots in a web browser. We explained two features of ggplot2 that make it ideal for data visualization: multi-layer and multi-panel graphics.

Exercises:

What are the three main advantages of ggplot2 relative to previous plotting systems such as grid and lattice?
What is the purpose of multi-layer graphics?
What are the two different reasons for creating multi-panel graphics? Which of these two types is useful with interactivity?
Let us define “A < B” to mean that “one B can contain several A.” Which of the following statements is true?
- ggplot < panel
- panel < ggplot
- ggplot < animint
- animint < ggplot
- layer < panel
- panel < layer
- layer < ggplot
- ggplot < layer
In the viz.aligned facets, why is it important to use the scales="free" argument?
In viz.aligned we showed a ggplot with a scatterplot panel on the left and a time series panel on the right. Make another version of the data visualization with the time series panel on the left and the scatterplot panel on the right.
In viz.aligned the scatterplot displays fertility rate and life expectancy, but the time series displays only fertility rate. Make another version of the data visualization that shows both time series. Hint: use both horizontal and vertical panels in facet_grid.
Use aes(size=population) in the scatterplot to show the population of each country. Hint: scale_size_animint(pixel.range=c(5, 10) means that circles with a radius of 5/10 pixels should be used represent the minimum/maximum population.
Create a multi-panel data visualization that shows each year of the WorldBank data set in a separate panel. What are the limitations of using static graphics to visualize these data?
Create viz.aligned using a plotting system that is not based on the grammar of graphics. For example, you can use functions from the graphics package in R (plot, points, lines, etc), or matplotlib in Python. What are some advantages of ggplot2 and animint?

Next, Chapter 3 explains the showSelected keyword, which indicates a variable to use for subsetting the data before plotting.

The animint2 Manual by Toby Dylan Hocking

Chapter contents