Chapter 2, data visualization using the grammar of graphics
This chapter explains the grammar of graphics, which is a powerful model for describing a large class of data visualizations. After reading this chapter, you will be able to
- State the advantages of the grammar of graphics relative to previous plotting systems
- Install the animint2 R package
- Translate plot sketches into ggplot code in R
- Render ggplots on web pages using animint2
- Create multi-layer ggplots
- Create multi-panel ggplots
History and purpose of the grammar of graphics
Most computer systems for data analysis provide functions for creating plots to visualize patterns in data. The oldest systems provide very general functions for drawing basic plot components such as lines and points (e.g. the graphics
and grid
packages in R). If you use one of these general systems, then it is your job to put the components together to form a meaningful, interpretable plot. The advantage of general systems is that they impose few limitations on what kinds of plots can be created. The disadvantage is that general systems typically do not provide functions for automating common plotting tasks (axes, panels, legends).
To overcome the disadvantages of these general plotting systems, charting packages such as lattice
were developed (Sarkar, 2008). Such packages have several pre-defined chart types, and provide a dedicated function for creating each chart type. For example, lattice
provides the bwplot
function for making box and whisker plots. The advantage of such systems is that they make it much easier to create entire plots, including a legend and panels. The disadvantage is the set of pre-defined chart types, which means that it is not easy to create more complex graphics.
Newer plotting systems based on the grammar of graphics are situated between these two extremes. Wilkinson proposed the grammar of graphics in order to describe and create a large class of plots (Wilkinson, 2005). Wickham later implemented several ideas from the grammar of graphics in the ggplot2
R package (Wickham, 2009). The ggplot2
package has several advantages with respect to previous plotting systems.
- Like general plotting systems, and unlike
lattice
,ggplot2
imposes few limitations on the types of plots that can be created (there are no pre-defined chart types). - Unlike general plotting systems, and like
lattice
,ggplot2
makes it easy to include common plot elements such as axes, panels, and legends. - Since
ggplot2
is based on the grammar of graphics, an explicit mapping of data variables to visual properties is required. Later in this chapter, we will explain how this mapping allows sketches of plot ideas to be directly translated into R code.
Finally, all of the previously discussed plotting systems are intended for creating static graphics, which can be viewed equally well on a computer screen or on paper. However, the main topic of this manual is animint2
, an R package for interactive graphics. In contrast to static graphics, interactive graphics are best viewed on a computer with a mouse and keyboard that can be used to interact with the plot.
Since many concepts from static graphics are also useful in interactive graphics, the animint2
package is implemented as an extension/fork of ggplot2
. In this chapter we will introduce the main features of ggplot2
which will also be useful for interactive plot design in later chapters.
In 2013, we created the animint package, which depends on the ggplot2 package. However during 2014-2017, the ggplot2 package introduced many changes that were incompatible with the interactive grammar of animint. Therefore in 2018 we created the animint2 package which copies/forks the relevant parts of the ggplot2 package. Now animint2 can be used without having ggplot2 installed. In fact, it is recommended to use animint2 without attaching (via library) ggplot2. However it is fine to use animint2 along with packages that import/load ggplot2. For an example, see Chapter 16, which uses the penaltyLearning package (which imports ggplot2).
Installing and attaching animint2
To install the most recent release of animint2
from CRAN,
if(!requireNamespace("animint2"))install.packages("animint2")
To install an even more recent development version of animint2
from GitHub,
if(!requireNamespace("animint2")){
if(!requireNamespace("remotes"))install.packages("remotes")
remotes::install_github("tdhock/animint2")
}
Once you have installed animint2, you can load and attach all of its exported functions via:
library(animint2)
Translating plot sketches into ggplots
This section explains how to translate a plot sketch into R code. We use a data set from the World Bank as an example, and we begin by loading and looking at a subset these data.
data(WorldBank, package="animint2")
WorldBank$Region <- sub(" (all income levels)", "", WorldBank$region, fixed=TRUE)
WorldBankSome <- subset(WorldBank, year %% 5 == 0)
table(WorldBankSome$year)
##
## 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
## 214 214 214 214 214 214 214 214 214 214 214
head(WorldBankSome)
## iso2c country year fertility.rate life.expectancy population
## 266 AD Andorra 1960 NA NA 13414
## 271 AD Andorra 1965 NA NA 18551
## 276 AD Andorra 1970 NA NA 24279
## 281 AD Andorra 1975 NA NA 30706
## 286 AD Andorra 1980 NA NA 36063
## 291 AD Andorra 1985 NA NA 44597
## GDP.per.capita.Current.USD 15.to.25.yr.female.literacy iso3c
## 266 NA NA AND
## 271 NA NA AND
## 276 3238.089 NA AND
## 281 7168.399 NA AND
## 286 12377.714 NA AND
## 291 7775.028 NA AND
## region capital longitude
## 266 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## 271 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## 276 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## 281 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## 286 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## 291 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## latitude income lending Region
## 266 42.5075 High income: nonOECD Not classified Europe & Central Asia
## 271 42.5075 High income: nonOECD Not classified Europe & Central Asia
## 276 42.5075 High income: nonOECD Not classified Europe & Central Asia
## 281 42.5075 High income: nonOECD Not classified Europe & Central Asia
## 286 42.5075 High income: nonOECD Not classified Europe & Central Asia
## 291 42.5075 High income: nonOECD Not classified Europe & Central Asia
tail(WorldBankSome)
## iso2c country year fertility.rate life.expectancy population
## 13011 ZW Zimbabwe 1985 6.224 61.44741 8860331
## 13016 ZW Zimbabwe 1990 5.187 60.52893 10461782
## 13021 ZW Zimbabwe 1995 4.369 53.09839 11639364
## 13026 ZW Zimbabwe 2000 3.862 44.61798 12503652
## 13031 ZW Zimbabwe 2005 3.606 43.86159 12710589
## 13036 ZW Zimbabwe 2010 3.290 49.86088 13076978
## GDP.per.capita.Current.USD 15.to.25.yr.female.literacy iso3c
## 13011 636.2357 NA ZWE
## 13016 839.6100 NA ZWE
## 13021 610.9673 NA ZWE
## 13026 535.0403 NA ZWE
## 13031 452.7890 NA ZWE
## 13036 568.4275 99.55316 ZWE
## region capital longitude latitude
## 13011 Sub-Saharan Africa (all income levels) Harare 31.0672 -17.8312
## 13016 Sub-Saharan Africa (all income levels) Harare 31.0672 -17.8312
## 13021 Sub-Saharan Africa (all income levels) Harare 31.0672 -17.8312
## 13026 Sub-Saharan Africa (all income levels) Harare 31.0672 -17.8312
## 13031 Sub-Saharan Africa (all income levels) Harare 31.0672 -17.8312
## 13036 Sub-Saharan Africa (all income levels) Harare 31.0672 -17.8312
## income lending Region
## 13011 Low income Blend Sub-Saharan Africa
## 13016 Low income Blend Sub-Saharan Africa
## 13021 Low income Blend Sub-Saharan Africa
## 13026 Low income Blend Sub-Saharan Africa
## 13031 Low income Blend Sub-Saharan Africa
## 13036 Low income Blend Sub-Saharan Africa
dim(WorldBankSome)
## [1] 2354 16
The WorldBank
data set consist of measures such as fertility rate and life expectancy for each country over the period 1960-2010. The code above prints the first and last few rows, and the dimension of the data table, for the subset of every 5 years (2354 rows and 16 columns).
Suppose that we are interested to see if there is any relationship between life expectancy and fertility rate. We could fix one year, then use those two data variables in a scatterplot. Consider the figure below which sketches the main components of that data visualization.
The sketch above shows life expectancy on the horizontal (x) axis, fertility rate on the vertical (y) axis, and a legend for the region. These elements of the sketch can be directly translated into R code using the following method. First, we need to construct a data table that has one row for every country in 1975, and columns named life.expectancy
, fertility.rate
, and region
. The WorldBank
data already has these columns, so all we need to do is consider the subset for the year 1975:
WorldBank1975 <- subset(WorldBank, year==1975)
head(WorldBank1975)
## iso2c country year fertility.rate life.expectancy population
## 281 AD Andorra 1975 NA NA 30706
## 334 AE United Arab Emirates 1975 6.009 66.18539 532742
## 387 AF Afghanistan 1975 7.692 37.25712 12551790
## 440 AG Antigua and Barbuda 1975 NA NA 69253
## 493 AL Albania 1975 4.417 68.32583 2426592
## 546 AM Armenia 1975 2.745 70.52751 2825650
## GDP.per.capita.Current.USD 15.to.25.yr.female.literacy iso3c
## 281 7168.3987 NA AND
## 334 27631.8985 56.27697 ARE
## 387 188.5521 NA AFG
## 440 NA NA ATG
## 493 NA NA ALB
## 546 NA NA ARM
## region capital longitude
## 281 Europe & Central Asia (all income levels) Andorra la Vella 1.5218
## 334 Middle East & North Africa (all income levels) Abu Dhabi 54.3705
## 387 South Asia Kabul 69.1761
## 440 Latin America & Caribbean (all income levels) Saint John's -61.8456
## 493 Europe & Central Asia (all income levels) Tirane 19.8172
## 546 Europe & Central Asia (all income levels) Yerevan 44.509
## latitude income lending Region
## 281 42.5075 High income: nonOECD Not classified Europe & Central Asia
## 334 24.4764 High income: nonOECD Not classified Middle East & North Africa
## 387 34.5228 Low income IDA South Asia
## 440 17.1175 Upper middle income IBRD Latin America & Caribbean
## 493 41.3317 Upper middle income IBRD Europe & Central Asia
## 546 40.1596 Lower middle income Blend Europe & Central Asia
The code above prints the data for 1975, which clearly has the appropriate columns, and one row for each country. The next step is to use the notes in the sketch to code a ggplot with a corresponding aes
or aesthetic mapping of data variables to visual properties:
scatter <- ggplot()+
geom_point(
mapping=aes(x=life.expectancy, y=fertility.rate, color=Region),
data=WorldBank1975)
scatter
## Warning: Removed 27 rows containing missing values (geom_point).
The aes
function is called with names for visual properties (x
, y
, color
) and values for the corresponding data variables (life.expectancy
, fertility.rate
, region
). This mapping is applied to the variables in the WorldBank1975
data table, in order to create the visual properties of the geom_point
. The ggplot was saved as the scatter
object, which when printed on the R command line shows the plot on a graphics device. Note that we automatically have a region
color legend.
Rendering ggplots on web pages using animint
This section explains how the animint2
package can be used to render ggplots on web pages. The ggplot from the previous section can be rendered with animint2, by using the animint
function.
animint(scatter)
If, when you run the code above, the animint does not render in your web browser for some reason (for example if you see a blank web page), then please consult our wiki FAQ which will help you find a solution. Internally, the animint
function creates a list of class animint, and then R runs the print.animint
function via the S3 object system. The animint2
package implements a compiler that takes the list as input, and outputs a web page with a data visualization. The compiler is the animint2dir
function, which compiles the animint scatter.viz
list to a directory of data and code files that can be rendered in a web browser. It is activated automatically by the print.animint
function.
When viewed in a web browser, the animint plot should look mostly the same as static versions produced by standard R graphics devices. One difference is that the region legend is interactive: clicking a legend entry will hide or show the points of that color.
Exercise: try changing the aes
mapping of the ggplot, and then making a new animint. Quantitative variables like population
are best shown using the x
/y
axes or point size
. Qualitative variables like lending
are best shown using point color
or fill
.
Multi-layer data visualization (multiple geoms)
Multi-layer data visualization is useful when you want to display several different geoms or data sets in the same plot. For example, consider the following sketch which adds a geom_path
to the previous data visualization.
Note how the sketch above includes two different geoms (point and path). The two geoms share a common definition of the x
, y
, and color
aesthetics, but have different data sets. Below we translate this sketch into R code.
WorldBankBefore1975 <- subset(WorldBank, 1970 <= year & year <= 1975)
two.layers <- scatter+
geom_path(aes(
x=life.expectancy,
y=fertility.rate,
color=Region,
group=country),
data=WorldBankBefore1975)
(viz.two.layers <- animint(two.layers))
Note that we save the return value of the animint
function to the viz.two.layers
object (which is also printed due to the parentheses). In this manual we will often use variable names that start with viz
to denote animint data visualization objects, which are in fact lists of ggplots and options.
The plot above shows a data visualization with 2 geoms/layers:
- the
geom_point
shows the life expectancy, fertility rate, and region of all countries in 1975. - the
geom_path
shows the same variables for the previous 5 years.
The addition of the geom_path
shows how the countries changed over time. In particular, it shows that most countries moved to the right and down, meaning higher life expectancy and lower fertility rate. However, there are some exceptions. For example, the two East Asian countries in the bottom left suffered a decrease in life expectancy over this period. And there are some countries which showed an increased fertility rate.
Exercise: try changing the region
legend to an income
legend. Hint: you need to use the same aes(color=income)
specification for all geoms. You may want to use scale_color_manual
with a sequential color palette, see RColorBrewer::display.brewer.all(type="seq")
and read the appendix for more details.
Can we add the names of the countries to the data viz? Below, we add another layer with a text label for each country’s name.
three.layers <- two.layers+
geom_text(aes(
x=life.expectancy,
y=fertility.rate,
color=Region,
label=country),
data=WorldBank1975)
animint(three.layers)
This data viz is not so easy to read, since there are so many overlapping text labels. The interactive region legend helps a little, by allowing the user to hide data from selected regions. However, it would be even better if the user could show and hide the text for individual countries. That type of interaction can be achieved using the showSelected and clickSelects aesthetics which we explain in Chapters 3-4.
For now, we move on to discuss a major strength of animint: data visualization with multiple linked plots.
Multi-plot data visualization
Multi-plot data visualization is useful when you want to show some related data sets using more than one aesthetic mapping. In interactive data visualization, one plot is often used to display a summary, and another plot is used to display details. For example, consider a data visualization with two plots: a time series with World Bank data from 1960-2010 (summary), and a scatterplot with data from 1975 (details). We sketch the time series plot below.
Note how the sketch above can be directly translated into the R code below. First we copy the existing viz list (viz.two.layers
), then we assign a ggplot to a new element named timeSeries
.
viz.two.plots <- viz.two.layers
viz.two.plots$timeSeries <- ggplot()+
geom_line(aes(
x=year,
y=fertility.rate,
color=Region,
group=country),
data=WorldBankSome)
That results in a named list of two elements (both elements are ggplots with class gganimint
).
summary(viz.two.plots)
## Length Class Mode
## plot1 9 gganimint list
## timeSeries 9 gganimint list
This data visualization list can be printed/rendered by typing its name. Since the list contains two ggplots, animint2
renders the data viz as two linked plots.
viz.two.plots
The data visualization above contains two ggplots, which each map different data variables to the horizontal x
axis. The time series uses aes(x=year)
, and shows a summary of fertility rate values over all years. The scatterplot uses aes(x=life.expectancy)
, and shows details of the relationship between fertility rate and life expectancy during 1975.
Try clicking a legend entry in either the scatterplot or the time series above. You should see the data and legends in both plots update simultaneously. Since aes(color=Region)
was specified in both plots, animint creates a single shared selector variable called Region
. Clicking either legend has the effect of updating the set of selected regions, and so animint updates the legends and data in both plots accordingly. This is the main mechanism that animint uses to create interactive data visualizations with linked plots, and will be discussed in more detail in the next two chapters.
Exercise: use animint to create a data viz with three plots, by creating a list with three ggplots. For example, you could add a time series of another data variable such as life.expectancy
or population
.
Note that both ggplots map the fertility rate variable to the y axis. However, since they are separate plots, the ranges of their y axes are computed separately. That means that even when the two plots are rendered side-by-side, the two y axis are not exactly aligned. That is a problem since it would make it easier to decode the data visualization if each unit of vertical space was used to show the same amount of fertility rate. To achieve that effect, we use facets in the next section.
Multi-panel data visualization (facets)
Panels or facets are sub-plots that show related data visualizations. One of the main strengths of ggplots is that different kinds of multi-panel plots are relatively easy to create. Multi-panel data visualization is useful for two different purposes:
- You want to align the axes of several related plots containing different geoms. This facilitates comparison between several different geoms, and is a technique that is also useful for interactive data visualization.
- You want to divide the data from one geom into several panels. This facilitates comparison between data subsets, and is less useful for interactive data visualization (interactivity can often be used instead, to achieve the same effect of comparing data subsets).
Different geoms in each panel (aligned axes)
We begin by explaining the how facets are useful to align the axes of related plots. Consider the sketch below which contains a plot with two panels.
Note that the two panels plot different geoms using a panel-specific aesthetic mapping. The point and path in the left panel have x=life.expectancy
, and the line in the right panel has x=year
. Also note that we specified facet=x.var
, so we need to add a variable called x.var
to each of the three data sets. We translate this sketch to the R code below.
add.x.var <- function(df, x.var){
data.frame(df, x.var=factor(x.var, c("life expectancy", "year")))
}
(viz.aligned <- animint(
scatter=ggplot()+
theme_bw()+
theme_animint(width=600)+
geom_point(aes(
x=life.expectancy, y=fertility.rate, color=Region),
data=add.x.var(WorldBank1975, "life expectancy"))+
geom_path(aes(
x=life.expectancy, y=fertility.rate, color=Region,
group=country),
data=add.x.var(WorldBankBefore1975, "life expectancy"))+
geom_line(aes(
x=year, y=fertility.rate, color=Region, group=country),
data=add.x.var(WorldBankSome, "year"))+
xlab("")+
facet_grid(. ~ x.var, scales="free")))
The data visualization above contains a single ggplot with two panels and three layers. The left panel shows the geom_point
and geom_path
, and the right panel shows the geom_line
. The panels have a shared axis for fertility rate, which ensures that the lines in the time series panel can be directly compared with the points and paths in the scatterplot panel.
Note that we used the add.x.var
function to add a x.var
variable to each data set, and then we used that variable in facet_grid(scales="free")
. We call this the addColumn then facet idiom, which is generally useful for creating a multi-panel data visualization with aligned axes. In particular, if we wanted to change the order of the panels in the data visualization, we would only need to edit the order of the factor levels in the definition of add.x.var
.
Also note that theme_bw
means to use black panel borders and white panel backgrounds, and panel.margin=0
means to use no space between panels. Eliminating the space between panels means that more space will be used for the panels, which serves to emphasize the data. We call this the Space saving facets idiom, which is generally useful in any ggplot with facets.
In the data viz above, the text labels overlap a bit, which can be fixed by either (exercise for the reader)
- using
breaks
argument ofscale_x_continuous()
- reducing text size with
theme()
andelement_text()
- increasing plot width using
theme_animint()
, see Chapter 6 for more info.
Same geoms in each panel (compare data subsets)
The second reason for using plots with multiple panels in a data visualization is to compare subsets of observations. This facilitates comparison between data subsets, and can be used in at least two different situations:
- One geom’s data set has too many observations to display informatively in one panel.
- You want to compare different subsets of data that is plotted for one geom.
For example, consider the sketch below.
Note that the three panels plot the same two geoms (point and path). Since facet=show.year
, and there are three panels shown, we will need to create data tables which have three values for the show.year
variable. The geom_point
has data for just 3 years, and the geom_path
has data for 15 years (but 3 values of show.year
). The code below creates these two data sets for three years of the WorldBank data set.
show.point.list <- list()
show.path.list <- list()
for(show.year in c(1975, 1985, 1995)){
show.point.list[[paste(show.year)]] <- data.frame(
show.year, subset(WorldBank, year==show.year))
show.path.list[[paste(show.year)]] <- data.frame(
show.year, subset(WorldBank, show.year - 5 <= year & year <= show.year))
}
show.point <- do.call(rbind, show.point.list)
show.path <- do.call(rbind, show.path.list)
We used a for loop over three values of show.year
, the variable which we will use later in facet_grid
. For each value of show.year
, we store a data subset as a named element of a list. After the for loop, we use do.call
with rbind
to combine the data subsets. This is an example of the list of data tables idiom, which is generally useful for interactive data visualization.
Below, we facet on the show.year
variable to create a data visualization with three panels.
animint(
scatter=ggplot()+
geom_point(aes(
x=life.expectancy, y=fertility.rate, color=Region),
data=show.point)+
geom_path(aes(
x=life.expectancy, y=fertility.rate, color=Region,
group=country),
data=show.path)+
facet_grid(. ~ show.year)+
theme_bw())
The data visualization above contains a single ggplot with three panels. It shows more of the WorldBank data set than the previous visualizations which showed only the data from 1975. However, it still only shows a relatively small data subset. You may be tempted to try using a panel to display every year (not just 1975, 1985, and 1995). However, beware that this type of multi-panel data visualization is especially useful if there are only a few data subsets. With more than about 10 panels, it becomes difficult to see all the data at once, and thus difficult to make meaningful comparisons.
Instead of showing all of the data at once, we can instead create an animated data visualization that shows the viewer different data subsets over time. In the next chapter, we will show how the new showSelected
keyword can be used to achieve animation, and reveal more details of this data set.
Chapter summary and exercises
This chapter presented the basics of static data visualization using ggplot2. We showed how animint can be used to render a list of ggplots in a web browser. We explained two features of ggplot2 that make it ideal for data visualization: multi-layer and multi-panel graphics.
Exercises:
- What are the three main advantages of
ggplot2
relative to previous plotting systems such asgrid
andlattice
? - What is the purpose of multi-layer graphics?
- What are the two different reasons for creating multi-panel graphics? Which of these two types is useful with interactivity?
Let us define “A < B” to mean that “one B can contain several A.” Which of the following statements is true?
- ggplot < panel
- panel < ggplot
- ggplot < animint
- animint < ggplot
- layer < panel
- panel < layer
- layer < ggplot
- ggplot < layer
- In the
viz.aligned
facets, why is it important to use thescales="free"
argument? - In
viz.aligned
we showed a ggplot with a scatterplot panel on the left and a time series panel on the right. Make another version of the data visualization with the time series panel on the left and the scatterplot panel on the right. - In
viz.aligned
the scatterplot displays fertility rate and life expectancy, but the time series displays only fertility rate. Make another version of the data visualization that shows both time series. Hint: use both horizontal and vertical panels infacet_grid
. - Use
aes(size=population)
in the scatterplot to show the population of each country. Hint:scale_size_animint(pixel.range=c(5, 10)
means that circles with a radius of 5/10 pixels should be used represent the minimum/maximum population. - Create a multi-panel data visualization that shows each year of the
WorldBank
data set in a separate panel. What are the limitations of using static graphics to visualize these data? Create
viz.aligned
using a plotting system that is not based on the grammar of graphics. For example, you can use functions from thegraphics
package in R (plot
,points
,lines
, etc), or matplotlib in Python. What are some advantages of ggplot2 and animint?
Next, Chapter 3 explains the showSelected
keyword, which indicates a variable to use for subsetting the data before plotting.