Introduction

The webpage for this session is available at:

In this session, we are going to learn some basics about cleaning data in R. The folder for this session is available at https://tinyurl.com/45vxsawu.

For Session 2 you will need:

FileA_RMarkdown_uOttawabiblio.rmd

This is the same notebook that I will be showing with the code removed

It’s not necessary for you to use this file, you can also do it in a completely new notebook or R script

data/

SciHub_SampleData.csv

SciHubDOI.csv

There are other files

FileB_MarkDown_uOttawabiblio.rmd

this is the same file as above, but with the code already there

FileB_MarkDown_uOttawabiblio.nb.html

this is this the html file of the completed notebook

notebook_images/

this is just the images that are in the notebook

For Session 3 you will need:

data_visualization.rmd

This is the same notebook that I will be showing with some of the code removed

data/

scatter.csv

bar_plot.csv

synthetic_gatecounts.csv

We won’t actually work with this one, but it’s in the file

R Studio Orientation

At the in person session, I would now give an overview of R Studio. If you are going through this at a later date, you can watch this video.

When you first open R you should see this:

RLanding Once you open a file, you should see this.

RFile The above images are from the RDM Jumpstart Program. They also have introductory lessons on R, which are available here.

There’s 3 key features of R

R can do operations

125+65

## [1] 190

45*76

## [1] 3420

8959/32

## [1] 279.9688

You can assign values to objects. Then do operations on the objects

x=3 
y=6
x*y

## [1] 18

These values can be characters

test_string="uOttawaBiblio"
print(test_string)

## [1] "uOttawaBiblio"

It can also be multiple values, these are what we call lists

test_number_list=c(2,4,6,7,8,3)
test_character_list=c("Spring","Summer","Fall","Winter")

They can also be dataframes

df=read.csv("data/testfile.csv")

R has functions, and the functions are in packages.

We have seen a function already. print() and read.csv() are baseR functions (aka default). The function is the thing outside the brackets, and you perform the function on the argument, which is inside the bracket.

So, for the example above, the function was print(), and the argument was "test_string".

To get extra functions, you need to download packages. Read more about functions and packages here.

Set Up

Working Directory

First, we are going to set ourselves up in a working directory.

Note: if you downloaded the whole folder, and you opened one of the provided files, ignore the advice about where to save things. It should all be organized already

Save the R notebook or R Script file to somewhere that makes sense, this should be the same location where you have the data stored for this session. See the example below.
Select "Session" from the top menu bar, then "Set working directory" then "to Source file location". The directory should now be printed on the top of the console. See the example below.

Installing Tidyverse

The following examples are going to be done using functions from tidyverse. tidyverse is a collection of packages that contain functions that are so commonly used for analyses, that people decided to just makes sure that you could download these all at once AND that they would be highly inter operable.You can learn more about tidyverse here

There are two ways to get a package for the first time, the first is to run install.packages() with the package name in the brackets, the second is to go over to the panel on the lower right, hit the "Packages" tab, then install and type "tidyverse"

You do not have to install packages every time, but you do need to load them every time using library()

Lets load our package:

#this is how you install using code, this is equivalent to going through the Packages panel. I've commented it out since I don't actually need to install 
#install.packages("tidyverse") 
#Loading the package
library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Upload Data

Now, we can load our data, and assign it the name scihub_df Then take a look at the first few rows using the head() function.There is also a tail() function to see the last rows. For more info on uploading data and the different formats you can use, check out this

I have elected to locate my data by specifying a file path. You could also do it like scihub_df=read.csv(file.choose()) to open up a file explorer.

#upload the dataset, its located in the data file
scihub_df=read.csv("data/SciHub_SampleData.csv") 
#show the first 6 rows
head(scihub_df)

##             Timestamp                            DOI IP.identifier
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2       7809386
## 2 2017-09-07 22:58:24      10.1080/14786430601032386       1358764
## 3 2017-05-02 09:59:00              10.1021/la501330j       6039317
## 4 2017-07-09 09:07:20              10.1063/1.4913415       5997924
## 5 2017-05-03 08:40:56              10.1021/jp809992g        858831
## 6 2017-05-03 22:11:34              10.1021/ja025109g        858831
##   User.identifier Country.according.to.GeoIP City.according.to.GeoIP Latitude
## 1        16866302                     Canada            Boucherville 45.59137
## 2        33577860                     Canada                 Toronto 43.65323
## 3         9158745                     Canada                 Toronto 43.65323
## 4        19896736                     Canada                 Toronto 43.65323
## 5         9278539                     Canada                 Toronto 43.65323
## 6         9370108                     Canada                 Toronto 43.65323
##   Longitude
## 1 -73.43641
## 2 -79.38318
## 3 -79.38318
## 4 -79.38318
## 5 -79.38318
## 6 -79.38318

Rename Columns

Looks good, but from experience, those titles column names might make life difficult later, lets rename them to something without spaces. We can then check to make sure the names were changed properly and we didn’t mess anything up.

For more examples of how to rename columns check out this link.

We can then use the names() function to see what the names of the columns are.

#change the names of scihub_df. The list needs to be the same length as the number of columns 
colnames(scihub_df)=c("Timestamp",
                 "DOI",
                 "IP_ID",
                 "User_ID",
                 "Country_GeoIP",
                 "City_GeoIP",
                 "Latitude",
                 "Longitude")
#just print the names of columns to confirm they are the new names 
names(scihub_df)

## [1] "Timestamp"     "DOI"           "IP_ID"         "User_ID"      
## [5] "Country_GeoIP" "City_GeoIP"    "Latitude"      "Longitude"

Session 2 - Data Tidying

Basic Tidying and Analyses

Selecting Columns

tidyverse uses something called "pipes", which look like %>% or |>, which tells R to automatically use the last output as the input for the next function. Lets see an example.

Let’s say we only want a subset of the columns in "scihub_df", not all 8. We can use the select() function to get those

#create new dataframe based on scihub_df, just selecting the 3 columns we cant 
scihub_df_reduced=scihub_df%>%
  select(Timestamp,DOI,City_GeoIP)#just selecting these three columns 
#preview the first 6 rows so we can see if it did what we think it did 
head(scihub_df_reduced)

##             Timestamp                            DOI   City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24      10.1080/14786430601032386      Toronto
## 3 2017-05-02 09:59:00              10.1021/la501330j      Toronto
## 4 2017-07-09 09:07:20              10.1063/1.4913415      Toronto
## 5 2017-05-03 08:40:56              10.1021/jp809992g      Toronto
## 6 2017-05-03 22:11:34              10.1021/ja025109g      Toronto

Filtering Rows

We could also go the other way, and only take certain rows. Let’s say we only wanted rows where the city was “Ottawa”, we can use the filter() function to find those. We can then use the print() function so see our new dataframe in the console.

Note: this is case sensitive

#making a df that is just for Ottawa
scihub_df_ottawa=scihub_df%>% #using the same original dataset
  filter(City_GeoIP=="Ottawa") #select only the rows with "Ottawa" (case sensitive) int he City_GeoIP column
#print the whole dataset since it's small
print(scihub_df_ottawa)

##             Timestamp                       DOI    IP_ID  User_ID Country_GeoIP
## 1 2017-03-26 03:00:42           10.2307/1547968  4587502  6727298        Canada
## 2 2017-07-21 16:20:54 10.1017/S1049096516001633 10172999 23057469        Canada
##   City_GeoIP Latitude Longitude
## 1     Ottawa 45.42153 -75.69719
## 2     Ottawa 45.42153 -75.69719

Summarizing Groups

There are a lot of basic things we can do. Lets just try getting a summary of how many time each city appears in the dataset. We’re going to use the "scihub_df_reduced" set (the one where we used select() to pick cetain columns).

We’re going to start by using the group_by() function. The group_by() functions creates groups based on a certain column, and then all subsequent operations (eg. summing, averaging, counting) are done on a per group basis. Learn more about group_by() here.

city_summary=scihub_df_reduced%>% #using the dataset with 3 columns 
  group_by(City_GeoIP)%>% #make the groups based on city 
  count() #count how many went into each group 
#see first 6 rows (they are automatically sorted alphabetically by grouping variable (aka City_GeoIP))
head(city_summary)

## # A tibble: 6 × 2
## # Groups:   City_GeoIP [6]
##   City_GeoIP       n
##   <chr>        <int>
## 1 Ajax            12
## 2 Baddeck          2
## 3 Baie-Comeau      2
## 4 Beaconsfield    15
## 5 Boucherville    10
## 6 Bracebridge      1

If you want to do a little sanity check, the sum of everything in column n should be 1000.
We can double check like this using the sum() function:

sum(city_summary$n)

## [1] 1000

Fixing Typos

Did anyone notice anything about the summarized data?

Yes, we have two different spellings for Montréal.

Lets fix it.

We’re not going to actually make a new dataset, we’re just going to edit what we already did. By adding a new line before the group_by() where we use a function called mutate(). mutate() is a very versatile function and can be used for a lot of different applications. You can read more about that here.

One thing you can do with mutate() is called a "nested function" this is where you have a function inside another function. In this case we are going to use the replace() function.

The replace() function is formatted like this: replace("column that we need to edit","what values in the column need to be edited,"What we want the new value to be")

Note: there are a lot of different ways to fix typos in data sets, this is just one of many.

city_summary=scihub_df_reduced%>% #3 column dataset 
  mutate(City_GeoIP = replace(City_GeoIP, City_GeoIP == "Montréal", "Montreal"))%>% #fixing the error 
  group_by(City_GeoIP)%>%#set groups based on the city, same process as above :) 
  count()

If you remember, before we had 76 observations, now we have 75.

Dates

Notice that we have a timestamp column, this has both date and the time. Could be useful, but maybe we just want the date. To do this, we are going to load a new package, called lubridate which is specifically used for working with date formats.

library(lubridate) #loading a package

We actually have a few ways we could do this.
1. Use lubridate functions
2. Separate using the space as a delimiter.
3. Extract the first 10 characters of each row into it’s own column

Let’s do the 1st option. We are going to do another nested function with mutate() using the ymd_hms() function from lubridate

scihub_df_reduced_date=scihub_df%>% #start with the original dataset
  select(Timestamp,DOI,City_GeoIP)%>% #select the columns we need 
  mutate(Timestamp=ymd_hms(Timestamp))%>% #make sure the time is interpreted in the correct format 
  mutate(Date=date(Timestamp)) #extract the date 

head(scihub_df_reduced_date) #preview the top 6

##             Timestamp                            DOI   City_GeoIP       Date
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville 2017-06-26
## 2 2017-09-07 22:58:24      10.1080/14786430601032386      Toronto 2017-09-07
## 3 2017-05-02 09:59:00              10.1021/la501330j      Toronto 2017-05-02
## 4 2017-07-09 09:07:20              10.1063/1.4913415      Toronto 2017-07-09
## 5 2017-05-03 08:40:56              10.1021/jp809992g      Toronto 2017-05-03
## 6 2017-05-03 22:11:34              10.1021/ja025109g      Toronto 2017-05-03

The Separate function

Lets try it using the separate() function to get the time (Option 2)

scihub_df_reduced_time=scihub_df%>% #same selection procedure as above 
  select(Timestamp,DOI,City_GeoIP)%>%
  separate(Timestamp, c("Date", "Time"), " ") #separate the date and time based on the space (the blank in between the quotes) and call the two new columns "Date" and "time" 

head(scihub_df_reduced_time)

##         Date     Time                            DOI   City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24      10.1080/14786430601032386      Toronto
## 3 2017-05-02 09:59:00              10.1021/la501330j      Toronto
## 4 2017-07-09 09:07:20              10.1063/1.4913415      Toronto
## 5 2017-05-03 08:40:56              10.1021/jp809992g      Toronto
## 6 2017-05-03 22:11:34              10.1021/ja025109g      Toronto

There is also a paste() function in R. It’s very similar to the concatenate in Excel, and you can learn more about it here.

Finally, you will probably want to save your work after everything. To do this, we can use the write.csv() function.

The format for this is write.csv(data, filepath). After running this, you can check the file location to see if a new file has appeared.

write.csv(scihub_df_reduced,"data/scihub_df_reduced.csv")

Bonus Content if we get time

Joins

So, we have this information about DOI, but what if we want more information? Luckily we have the title and other publication information available from Zotero, and we can export a csv from Zotero and “join” it to our existing dataset.

This csv is going to have a lot of columns. But maybe we only want DOI (Column 9), Title (Column 5) and Publication Year (Column 3). Before when we selected, we used the names of the columns, but we can also select based on the column number.

Notice that we were able to pipe the read.csv immediately into the select()

zotero=read.csv("data/SciHubDOI.csv")%>%
  select(9,5,3) #selecting based on position rather than name 
head(zotero)

##                       DOI
## 1       10.1021/jp809992g
## 2   10.1093/beheco/arx008
## 3       10.1149/1.2069301
## 4       10.1002/dap.30253
## 5 10.1126/science.aaa9092
## 6  10.1002/anie.201605430
##                                                                                                               Title
## 1 Spectroscopic Studies of Pristine and Fluorinated Nano-ZrO<sub>2</sub> in Photostimulated Heterogeneous Processes
## 2                                                                           Why is the giant panda black and white?
## 3    Solid‐State NMR Studies of Ions in Protective Coatings: II . Lithium and Cesium Ions in Polybutadiene Coatings
## 4                                                      How to learn and use your institution's student voting rates
## 5                                                                            Boreal forest health and global change
## 6                                   From Alkanes to Carboxylic Acids: Terminal Oxygenation by a Fungal Peroxygenase
##   Publication.Year
## 1             2009
## 2             2017
## 3             1992
## 4             2016
## 5             2015
## 6             2016

Now, lets join the datasets together. We are using left_join() here, but there are lots of different types of joins that you can learn more about here.

scihub_zotero=scihub_df_reduced%>%
  left_join(zotero,by="DOI") #telling it to join the dataset zotero by the values in column DOI 

head(scihub_zotero)

##             Timestamp                            DOI   City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24      10.1080/14786430601032386      Toronto
## 3 2017-05-02 09:59:00              10.1021/la501330j      Toronto
## 4 2017-07-09 09:07:20              10.1063/1.4913415      Toronto
## 5 2017-05-03 08:40:56              10.1021/jp809992g      Toronto
## 6 2017-05-03 22:11:34              10.1021/ja025109g      Toronto
##                                                                                                                                                                              Title
## 1                                                                                                                 Breast stimulation for cervical ripening and induction of labour
## 2                                                                                                         Adsorption characteristics of parent and copper-sputtered RD silica gels
## 3 Micropatterned Ferrocenyl Monolayers Covalently Bound to Hydrogen-Terminated Silicon Surfaces: Effects of Pattern Size on the Cyclic Voltammetry and Capacitance Characteristics
## 4                                                                              Conduction of molecular electronic devices: Qualitative insights through atom-atom polarizabilities
## 5                                                                Spectroscopic Studies of Pristine and Fluorinated Nano-ZrO<sub>2</sub> in Photostimulated Heterogeneous Processes
## 6                                Structural Basis for BABIM Inhibition of Botulinum Neurotoxin Type B Protease [ <i>J. Am. Chem. Soc.</i> <b>2000</b> , <i>122</i> , 11268−11269].
##   Publication.Year
## 1             2005
## 2             2007
## 3             2014
## 4             2015
## 5             2009
## 6             2002

Pivots

We’re going to combine a few things we have seen so far. 1. making lists.
2. group_by(), but this time we will have TWO groupings.
3. filter, but this time with a list of options and not just one.

We’re going to start with our reduced set. Let’s refresh on what it looks like.

head(scihub_df_reduced)

##             Timestamp                            DOI   City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24      10.1080/14786430601032386      Toronto
## 3 2017-05-02 09:59:00              10.1021/la501330j      Toronto
## 4 2017-07-09 09:07:20              10.1063/1.4913415      Toronto
## 5 2017-05-03 08:40:56              10.1021/jp809992g      Toronto
## 6 2017-05-03 22:11:34              10.1021/ja025109g      Toronto

We have 3 columns: Timestamp, DOI and City_GeoIP. But maybe we want to see how often each DOI comes up in each city and the organize the information so we have 1 column for each city.

For the sake of not creating a huge dataset, we’re going to only include certain cities. Lets define those using a list.

cities_list=c("Ottawa","Toronto","Montreal","Burnaby")

Now we know what we’re working with, we can string everything together. The final line is pivot_wider, it will be easier to explain what it does after you have seen the final product.

scihub_pivot=scihub_df_reduced%>%
  group_by(City_GeoIP,DOI)%>% #group by city and DOI, so we'll get a summary of the doi count per city 
  count()%>%
  filter(City_GeoIP %in% cities_list)%>% #filter, but only keep values that appear in cities_list
  pivot_wider(id_cols=DOI,names_from=City_GeoIP,values_from=n)#here is the pivot, we say that the rows should be based on DOI, the new column names are going to be the city, and the values in the cells are the counts of that DOI in that city

head(scihub_pivot)

## # A tibble: 6 × 5
## # Groups:   DOI [6]
##   DOI                           Burnaby Montreal Ottawa Toronto
##   <chr>                           <int>    <int>  <int>   <int>
## 1 10.1021/jp011934s                   1       NA     NA       1
## 2 10.1126/science.197.4307.967        1       NA     NA       3
## 3 10.1002/wcc.81                     NA        2     NA       4
## 4 10.1016/0006-8993(77)90423-1       NA        1     NA      NA
## 5 10.1016/S2214-109X(16)30188-7      NA        3     NA      NA
## 6 10.1037/a0017364                   NA        1     NA      NA

Session 3 Content -Data Visualization

This is going to be an overview of making some basic plots in ggplot. We will cover

Scatter plots
Bar Plots
Box Plots
Violin Plots
- The aesthetics that go with them

On this page, I have gone through and collected material that I have found to be the most helpful when learning ggplot. All of these will linked as we go through.If you want any more information on a particular topic, those would be great places to start.

To start, I cannot recommend the R Graph Gallery enough. It is the first place I go when I need inspiration, and is one of the most extensive resources for R graphing on the internet.

First step will be to load all the libraries we might need. Make sure these are installed (if you don’t know how to install packages look here) or see the earlier part of this lesson here

ggplot2 is part of the tidyverse suite of packages, I like to install the whole thing at once in case I need to do any data tidying before plotting. An optional install for this lesson is patchwork, it is a nice package for laying out multiple plots,but it’s not necessary for today’s lesson.

library(tidyverse)
library(patchwork)

## Warning: package 'patchwork' was built under R version 4.4.3

A few notes on ggplot

The ggplot cheatsheet
ggplot has
- data
- aesthetics
  - size
  - transparency
  - colour
  - fill
- geometries
  - geom_point()
  - geom_bar()
  - geom_boxplot()
  - geom_text()
  - geom_violin()
  - any many more

Basic Anatomy of a ggplot

Much like how we could assign values to variables, and then call up those variables and perform operations on those variables, we can also assign plots to variables. You do this in the same way, we see below how I assign the plot to the variable p. If you assign to a variable, you will have to call the variable in order for the plot to appear.

p=ggplot(aes(aes1,aes2))+ #these are global aesthetics that will apply to all the points (required)
    geom_X(aes(aes1,aes2))+ #X=point|bar|violin|etc, you can have many `geom`s in one plot (1 required)
    theme() # a lot of your specifications will go here (not required)
  
p #this is how you get your plot to show up

You could also just get the plot to show up automatically if you don’t set it to an object

ggplot(aes(aes1,aes2))+ 
    geom_X(aes(aes1,aes2))+ 
    theme()

# a plot would appear here if there was actually any data here

Scatter Plots

Scatter plots are an excellent first plot to start off with. There are lots of ways to manipulate scatter plots to give very informative figures-which you will see farther down on this page.

Further information on making scatter plots can be found here.

First thing first, load the data. What I have written in the chunk below may not work for you if you have downloaded the data separately and stored in in a different folder. You can also do scatter=read.csv(file.choose()) to open a file navigation window and select from there.

As with anything, its always a good idea to look at the data and make sure it uploaded properly before you start plotting. This also makes sure you know what the column names are. You can see more about using head() in the code from session 2

This is a synthetic dataset looking at Universities, their enrollment numbers, library budget and collection size. Note: I was asked how you could incorporate AI into learning R, one thing I decided to try was if it could make me a data set to practice with, so that is where this came from

scatter=read.csv(("data/scatter.csv"))
head(scatter)

##           Uni_Name     Country Enrollment Lib_Budget Collection_Size
## 1 University of A0 Netherlands      35828     768.49           94965
## 2 University of A1       Japan      20711     394.83          108214
## 3 University of A2       India       5420      61.16           16128
## 4 University of A3      Sweden      33216     841.94          161352
## 5 University of A4       Italy       2301      44.13            9833
## 6 University of A5          UK      47236    1384.20          282199

Basic Scatter Pot

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+# set my x and y values to the appropriate columns 
  geom_point()#specify that I want it shown as a scatter plot

Scatter plot with trendline

Sometimes you want to add a trend line, or line of best fit.For more information on how to get a line of best fit see the documentation for geom_smooth

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point()+#the first two lines are the same as above
  geom_smooth(method="lm")#now I add another geom to specify that I would like a trendline added. method="lm" is an argument I needed to specify to say what type of trend line I wanted.

## `geom_smooth()` using formula = 'y ~ x'

Scatter plot with different aes for the points

When you have a lot of nice metadata associated with the variables you are plotting. It is nice to incorporate these into your figures. You can normally change:

Shape
Colour
Fill
Alpha (aka transparency)
Size

Let’s change the colour of the points based on Country and size of the points based on Enrollment

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))#the top line is the same, but now in the geom_point() i have added asthetics, one for colour and one for size, and specified the column that should be used for both.

Now, let’s change the shape of the points based on Country, but keep the size the same.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(shape=Country,size=Enrollment))# same as above, but I have changed colour to shape.

You can see we get a lot of strange shapes. You can specify the shapes you want by using a number code. You can find all those here

If you want to change the shape of a point, but don’t want the shape to be dependent on the value of a variable, then you can place the argument outside of the aes() for that geom.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment),shape=15) #notice that all the other asthetics had been inside the aes(), but shape=15 is outside, so it will apply to all points. I decided on shape 15 after consulting the link I added above.

Whats the difference between fill and colour? Also why are there different versions of the same shape?

Basically, there are some shapes that have fill AND colour, so you can change the colour of the outline AND the entire shape, others may only have one or the other. Again, the link above explains which shapes have what properties.

Bar Charts and Violin Plots

For this section we need the file bar_plot.csv. Again to upload it, what I have written in this chunk may not work for you. You may have to do something along the lines of bar=read.csv(file.choose()) and then select the bar_plot.csv from wherever you saved it on your computer.

The data set we will be using is looking at how much time (in minutes) students in different years of school spend in the library.

Bar Chart

More information on making bar charts can be found here

bar=read.csv(("data/bar_plot.csv"))
head(bar)

##   student  year  time minutes
## 1       1 year4 week2     225
## 2       2 year4 week2     450
## 3       3 year4 week2     255
## 4       4 year4 week2      75
## 5       5 year4 week2      75
## 6       6 year4 week2     210

You’ll notice that this data is showing individual observations, but we have the option to automatically plot the mean of each group. As we see below, plotting individually is not very helpful or easy to read.

ggplot(bar, aes(x=time,y=minutes,fill=year))+
  geom_bar(position="dodge2",stat = "identity") #"dodge2" is telling R to plot individual values side by size, "identity" is saying to leave the values as they appear in the csv and don't do any calculations on them.

Luckily, this kind of thing is really easy to fix, and we can have it automatically plot summary statistics for us, and we will do this for the rest of the plots in this section.

All of these will have stat="summary" and fun="mean" in the geom_bar(), this is how we can make sure we are plotting the means of each category.

Grouped Bar Chart

ggplot(bar, aes(x=time,y=minutes,fill=year))+ # same first line as the others 
  geom_bar(position="dodge",stat = "summary",fun="mean")# in here, I have specified the position, the fact I want a summary statistic plotted, and the specific statistic I want run, in this case the mean.

Stacked Bar Chart

Changing the position will allow us to have different types of bar charts. To get it stacked, use position="stack".

ggplot(bar, aes(x=time,y=minutes,fill=year))+
  geom_bar(position="stack",stat = "summary", fun= "mean")# I changed the position to stack from dodge, so now the bars will stack on top of each other.

Percent Bar Chart

ggplot(bar, aes(x=time,y=minutes,fill=year))+
  geom_bar(position="fill",stat = "summary", fun= "mean")# again, the only thing I changed is the position.

To summarize:

grouped bar chart: position="dodge"
stacked bar chart: position="stack"
percent bar chart: position="fill"

Box and Whisker Plot

Box and Whisker plots are considered an improvement over the bar plot because they give a better idea of the spread of the data. You can see the mean, quartiles and outliers, these are not evident with the bar plot.

More information on making box and whisker plots can be found here

ggplot(bar, aes(x=time,y=minutes,fill=year))+ #notice that this is the exact same first line as the barplot
  geom_boxplot()

Violin Plot

Violin plots are sometimes considered another level up from the box and whisker (so to keep track bar<box<violin) since it gives a better (more visual) idea of how the points are distributed.

More information on violin plots can be found here

ggplot(bar, aes(x=time,y=minutes,fill=year))+
  geom_violin()#again, this is the only thing we have changed

You may have noticed that for all of these, the axis are in the order final, midterm, week2. While not a big deal, it would be nice if they were week2, midterm, final. We are going to get into how to change that later. For now, we will stick to the basic plots.

Layering Plots

You can also combine different types of plots and and layer them on top of each other. For example, maybe we like the overview that a violin plot gives, but still want to see the actual numbers.

ggplot(bar, aes(x=time,y=minutes))+ # I removed the fill argument here, just so there is not too much going on and I can more clearly demonstrate my point
  geom_violin(alpha=0.8)+ #this is the first appearance of the transparency (alpha) argument, I set it to 0.8 (which can be understood as 80%) so we can better see the impact of the order in which we specify the geoms
  geom_jitter(width = 0.15)# geom_jitter is very similar to geom_point(), except it ensure that no points are plotted on top of each other. The width is just how big of an area I want them to be plotted over

Lets look at the impact of the order in which we called the layers. In the above example, we called the geom_violin() before we called geom_point(), lets try the other way around.

ggplot(bar, aes(x=time,y=minutes))+
  geom_jitter(width = 0.15)+
  geom_violin(alpha=0.8)

Because I made the violins transparent (thats what the alpha is for), we can see that the violins were placed on top of the points, instead of the points being on top of the violins.

Customizing your Plots

This is the same plot we made above, we haven’t added any customizations. Lets start by just renaming the x and y axis. By default, they just take the name of the column the data comes from.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")

Themes

themes are a quick way to change a lot of the visual aspects of you plot at one time. There are a lot of different themes you can use when making your plots. You can see a description of them all here, and I’ll show you some examples below.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  theme_bw()#this is the "black and white" theme

We can see that the background it no longer grey, and the gridlines are light grey.

Now I’m going to show a bunch of different themes, and I’m also going to add a title to each of them so that we know which plot is which. Notice here that all the plots are assigned to objects, so they are not going to automatically populate.

scatter_bw=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle(("theme_bw"))+
  theme_bw()+
  theme(legend.position = "none")
scatter_light=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle("theme_classic")+
  theme_classic()+
  theme(legend.position = "none")
scatter_dark=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle("theme_dark")+
  theme_dark()+
  theme(legend.position = "none")
scatter_minimal=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle("theme_minimal")+
  theme_minimal()+
  theme(legend.position = "none")

To compare all these at once, I am now going to show you a library called patchwork.

Patchwork

Patchwork is a lovely package that allows you to very simply arrange your plots in whatever manner you like. See the documentation here. You can see an example of how you would use it below.

You’ll notice that I used the objects that each plot was assigned to.

(scatter_bw|scatter_light)/(scatter_dark|scatter_minimal)

Now we’ve seen patchwork, we’ll go back to customizing stuff. Let’s saw we want to make more specific changes. Let’s say we want to remove the gridlines in the background.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle("Removing All Gridlines")+
  theme_bw()+
  theme(panel.grid = element_blank())# we have a general theme function where we can specify that we don't want any grids

We can also be more specific, and only remove the vertical lines by specify the "x" grid lines. The same could be done for the horizontal ones by specifying "y"

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle("Removing All Vertical Lines")+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x=element_blank())

### Modifying the legend

There are a lot of ways you can change the legend. You will probably need to do some research for your specific use case. There are options in theme() and options in guide() and a lot of others.

This is normally where I go when I first start troubleshooting my legend problems.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  ggtitle("Renaming the Legends")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+ # I have added a guide() function to specify the titles of both of my legends. As default they just do the column name. 
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

Colouring

There are A LOT of different ways you can colour your plots. So I advise you explore this when the time comes, but there are a lot of pre-set colour schemes that look great and are easy to implement. You can find resources on them here

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  scale_colour_viridis_d()+ #this is where I specified the colours I wanted 
  ggtitle("Scatterplot with Viridis Colouring")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  scale_colour_brewer(palette = "Paired")+  # I have changed from viridis to another palette, and selected which palette I wanted from a list. 
  ggtitle("Scatterplot with Colour Brewer")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

If you’re going to get into manual colouring look here to figure out what kind of colours are available.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  scale_colour_manual(values=c("orangered3","slateblue","lightseagreen","orchid3","sienna2","dodgerblue"))+
  ggtitle("Scatterplot with Manual Colours")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

If you are hoping to apply a specific colour to a specific category you can do it like this.

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  scale_colour_manual(values=c(India="orangered3",Italy="slateblue",Japan="lightseagreen",Netherlands="orchid3",Sweden="sienna2",UK="dodgerblue"))+
  ggtitle("Scatterplot with Manual Colours-Assigned to Continent")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

Facets

Sometimes when you have a lot of data it can be useful to facet your plots. This is really easy! As you can see below. The different options for facets can be seen here

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())+
  facet_grid(~Country)

You can see that the facet made our x-axis labels difficult to see. Luckily this is one of the many elements we can fix in theme()

ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
  geom_point(aes(colour=Country,size=Enrollment))+
  xlab("Library Budget")+
  ylab("Collection Size")+
  guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
  theme_bw()+
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.text.x = element_text(angle=45,vjust=0.5))+
  facet_grid(~Country)

Changing Factor Levels

Remember up above with our bar plot we saw that the categories for the sleep dataset weren’t in the right order. We are going to reset the factor levels and then it should plot properly.

We’ll start with a basic plot of the old factor levels

bar_ofl=ggplot(bar, aes(x=time,y=minutes,fill=year))+geom_bar(position="dodge",stat="identity")

Now lets reset the factor levels, then do the same plot again, and then compare the two

bar$time <- factor(bar$time,levels = c("week2","midterm","final"))
bar_rfl=ggplot(bar, aes(x=time,y=minutes,fill=year))+geom_bar(position="dodge",stat="identity")
(bar_ofl|bar_rfl)

Trouble shooting

check the order of your layers
check the class of the columns you are trying to use using class()
check the factor levels with levels()
to check how many of each category you have use table(df$col)

Other Fun Examples

R also allows you to make animated vizualizations. For example we can look at the number of library visitors over a period of time, using synthetic data created by Melissa and Chantal for the Power BI session.

This was another opportunity for me to explore whether AI could code. I had tried a few years ago and it wasn’t perfect, so I tried with this application. The code ChatGPT gave me looked ok, but didn’t actually produce the image, I still had to troubleshoot, and it was not an issue I would have been able to fix quickly if I was a beginner.

data=read.csv("data/synthetic_gatecounts.csv")%>%
  dplyr::select(name,date,visitors)%>%
  mutate(Timestamp=ymd_hm(date))%>% #make sure the time is interpreted in the correct format 
  mutate(Date=date(Timestamp))%>% #extract the date 
  dplyr::select(name,Date,visitors)

# Calculate cumulative visitors
data_cumulative <- data %>%
  arrange(name, Date) %>%
  group_by(name) %>%
  mutate(cumulative_visitors = cumsum(visitors)) %>%
  ungroup()

# Create animated bar plot
plot <- ggplot(data_cumulative, aes(x = reorder(name, -cumulative_visitors), 
                                    y = cumulative_visitors, 
                                    fill = name)) +
  geom_col(show.legend = FALSE) +
  labs(title = 'Cumulative Visitors by Place',
       subtitle = 'Date: {frame_time}',
       x = 'Name',
       y = 'Cumulative Visitors') +
  scale_y_continuous(labels = scales::comma) +
  transition_time(Date) +
  ease_aes('linear') +
  theme_minimal(base_size = 16)

# Animate
animate(plot, nframes = 100, fps = 10, width = 800, height = 600)

R Part 3: Basic Data Tidying

Emma Garlock

July 10th, 2025