The webpage for this session is available at:
In this session, we are going to learn some basics about cleaning data in R. The folder for this session is available at https://tinyurl.com/45vxsawu.
For Session 2 you will need:
- FileA_RMarkdown_uOttawabiblio.rmd
- This is the same notebook that I will be showing with the code removed
- It’s not necessary for you to use this file, you can also do it in a completely new notebook or R script
- data/
- SciHub_SampleData.csv
- SciHubDOI.csv
There are other files
- FileB_MarkDown_uOttawabiblio.rmd
- this is the same file as above, but with the code already there
- FileB_MarkDown_uOttawabiblio.nb.html
- this is this the html file of the completed notebook
- notebook_images/
- this is just the images that are in the notebook
For Session 3 you will need:
- data_visualization.rmd
- This is the same notebook that I will be showing with some of the code removed
- data/
- scatter.csv
- bar_plot.csv
- synthetic_gatecounts.csv
- We won’t actually work with this one, but it’s in the file
At the in person session, I would now give an overview of R Studio. If you are going through this at a later date, you can watch this video.
When you first open R you should see this:
Once you
open a file, you should see this.
The above
images are from the RDM Jumpstart Program. They also have introductory
lessons on R, which are available here.
There’s 3 key features of R
125+65
## [1] 190
45*76
## [1] 3420
8959/32
## [1] 279.9688
x=3
y=6
x*y
## [1] 18
test_string="uOttawaBiblio"
print(test_string)
## [1] "uOttawaBiblio"
test_number_list=c(2,4,6,7,8,3)
test_character_list=c("Spring","Summer","Fall","Winter")
df=read.csv("data/testfile.csv")
We have seen a function already. print()
and
read.csv()
are baseR functions (aka default). The function
is the thing outside the brackets, and you perform the function on the
argument, which is inside the bracket.
So, for the example above, the function was print()
, and
the argument was "test_string"
.
To get extra functions, you need to download packages. Read more about functions and packages here.
First, we are going to set ourselves up in a working directory.
Note: if you downloaded the whole folder, and you opened one of the provided files, ignore the advice about where to save things. It should all be organized already
Save the R notebook or R Script file to somewhere that makes
sense, this should be the same location where you have the data stored
for this session. See the example below.
Select "Session"
from the top menu bar, then
"Set working directory"
then
"to Source file location"
. The directory should now be
printed on the top of the console. See the example below.
The following examples are going to be done using functions from
tidyverse
. tidyverse
is a collection of
packages that contain functions that are so commonly used for analyses,
that people decided to just makes sure that you could download these all
at once AND that they would be highly inter operable.You can learn more
about tidyverse
here
There are two ways to get a package for the first time, the first is
to run install.packages()
with the package name in the
brackets, the second is to go over to the panel on the lower right, hit
the "Packages"
tab, then install and type
"tidyverse"
You do not have to install packages every time, but you
do need to load them every time using
library()
Lets load our package:
#this is how you install using code, this is equivalent to going through the Packages panel. I've commented it out since I don't actually need to install
#install.packages("tidyverse")
#Loading the package
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Now, we can load our data, and assign it the name
scihub_df
Then take a look at the first few rows using the
head()
function.There is also a tail()
function to see the last rows. For more info on uploading data and the
different formats you can use, check out this
I have elected to locate my data by specifying a file path. You could
also do it like scihub_df=read.csv(file.choose())
to open
up a file explorer.
#upload the dataset, its located in the data file
scihub_df=read.csv("data/SciHub_SampleData.csv")
#show the first 6 rows
head(scihub_df)
## Timestamp DOI IP.identifier
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 7809386
## 2 2017-09-07 22:58:24 10.1080/14786430601032386 1358764
## 3 2017-05-02 09:59:00 10.1021/la501330j 6039317
## 4 2017-07-09 09:07:20 10.1063/1.4913415 5997924
## 5 2017-05-03 08:40:56 10.1021/jp809992g 858831
## 6 2017-05-03 22:11:34 10.1021/ja025109g 858831
## User.identifier Country.according.to.GeoIP City.according.to.GeoIP Latitude
## 1 16866302 Canada Boucherville 45.59137
## 2 33577860 Canada Toronto 43.65323
## 3 9158745 Canada Toronto 43.65323
## 4 19896736 Canada Toronto 43.65323
## 5 9278539 Canada Toronto 43.65323
## 6 9370108 Canada Toronto 43.65323
## Longitude
## 1 -73.43641
## 2 -79.38318
## 3 -79.38318
## 4 -79.38318
## 5 -79.38318
## 6 -79.38318
Looks good, but from experience, those titles column names might make life difficult later, lets rename them to something without spaces. We can then check to make sure the names were changed properly and we didn’t mess anything up.
For more examples of how to rename columns check out this link.
We can then use the names()
function to see what the
names of the columns are.
#change the names of scihub_df. The list needs to be the same length as the number of columns
colnames(scihub_df)=c("Timestamp",
"DOI",
"IP_ID",
"User_ID",
"Country_GeoIP",
"City_GeoIP",
"Latitude",
"Longitude")
#just print the names of columns to confirm they are the new names
names(scihub_df)
## [1] "Timestamp" "DOI" "IP_ID" "User_ID"
## [5] "Country_GeoIP" "City_GeoIP" "Latitude" "Longitude"
tidyverse
uses something called "pipes"
,
which look like %>%
or |>
, which tells R
to automatically use the last output as the input for the next function.
Lets see an example.
Let’s say we only want a subset of the columns in
"scihub_df",
not all 8. We can use the
select()
function to get those
#create new dataframe based on scihub_df, just selecting the 3 columns we cant
scihub_df_reduced=scihub_df%>%
select(Timestamp,DOI,City_GeoIP)#just selecting these three columns
#preview the first 6 rows so we can see if it did what we think it did
head(scihub_df_reduced)
## Timestamp DOI City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24 10.1080/14786430601032386 Toronto
## 3 2017-05-02 09:59:00 10.1021/la501330j Toronto
## 4 2017-07-09 09:07:20 10.1063/1.4913415 Toronto
## 5 2017-05-03 08:40:56 10.1021/jp809992g Toronto
## 6 2017-05-03 22:11:34 10.1021/ja025109g Toronto
We could also go the other way, and only take certain rows. Let’s say
we only wanted rows where the city was “Ottawa”, we can use the
filter()
function to find those. We can then use the
print()
function so see our new dataframe in the
console.
Note: this is case sensitive
#making a df that is just for Ottawa
scihub_df_ottawa=scihub_df%>% #using the same original dataset
filter(City_GeoIP=="Ottawa") #select only the rows with "Ottawa" (case sensitive) int he City_GeoIP column
#print the whole dataset since it's small
print(scihub_df_ottawa)
## Timestamp DOI IP_ID User_ID Country_GeoIP
## 1 2017-03-26 03:00:42 10.2307/1547968 4587502 6727298 Canada
## 2 2017-07-21 16:20:54 10.1017/S1049096516001633 10172999 23057469 Canada
## City_GeoIP Latitude Longitude
## 1 Ottawa 45.42153 -75.69719
## 2 Ottawa 45.42153 -75.69719
There are a lot of basic things we can do. Lets just try getting a
summary of how many time each city appears in the dataset. We’re going
to use the "scihub_df_reduced"
set (the one where we used
select()
to pick cetain columns).
We’re going to start by using the group_by()
function.
The group_by()
functions creates groups based on a certain
column, and then all subsequent operations (eg. summing, averaging,
counting) are done on a per group basis. Learn more about
group_by()
here.
city_summary=scihub_df_reduced%>% #using the dataset with 3 columns
group_by(City_GeoIP)%>% #make the groups based on city
count() #count how many went into each group
#see first 6 rows (they are automatically sorted alphabetically by grouping variable (aka City_GeoIP))
head(city_summary)
## # A tibble: 6 × 2
## # Groups: City_GeoIP [6]
## City_GeoIP n
## <chr> <int>
## 1 Ajax 12
## 2 Baddeck 2
## 3 Baie-Comeau 2
## 4 Beaconsfield 15
## 5 Boucherville 10
## 6 Bracebridge 1
If you want to do a little sanity check, the sum of everything in
column n should be 1000.
We can double check like this using the sum()
function:
sum(city_summary$n)
## [1] 1000
Did anyone notice anything about the summarized data?
Yes, we have two different spellings for Montréal.
Lets fix it.
We’re not going to actually make a new dataset, we’re just going to
edit what we already did. By adding a new line before the
group_by()
where we use a function called
mutate()
. mutate()
is a very versatile
function and can be used for a lot of different applications. You can
read more about that here.
One thing you can do with mutate()
is called a
"nested function"
this is where you have a function
inside another function. In this case we are going to use the
replace()
function.
The replace()
function is formatted like this:
replace("column that we need to edit","what values in the column need to be edited,"What we want the new value to be")
Note: there are a lot of different ways to fix typos in data sets, this is just one of many.
city_summary=scihub_df_reduced%>% #3 column dataset
mutate(City_GeoIP = replace(City_GeoIP, City_GeoIP == "Montréal", "Montreal"))%>% #fixing the error
group_by(City_GeoIP)%>%#set groups based on the city, same process as above :)
count()
If you remember, before we had 76 observations, now we have 75.
Notice that we have a timestamp
column, this has both
date and the time. Could be useful, but maybe we just want the date. To
do this, we are going to load a new package, called
lubridate
which is specifically used for working with date
formats.
library(lubridate) #loading a package
We actually have a few ways we could do this.
1. Use lubridate
functions
2. Separate using the space as a delimiter.
3. Extract the first 10 characters of each row into it’s own column
Let’s do the 1st option. We are going to do another nested function
with mutate()
using the ymd_hms()
function
from lubridate
scihub_df_reduced_date=scihub_df%>% #start with the original dataset
select(Timestamp,DOI,City_GeoIP)%>% #select the columns we need
mutate(Timestamp=ymd_hms(Timestamp))%>% #make sure the time is interpreted in the correct format
mutate(Date=date(Timestamp)) #extract the date
head(scihub_df_reduced_date) #preview the top 6
## Timestamp DOI City_GeoIP Date
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville 2017-06-26
## 2 2017-09-07 22:58:24 10.1080/14786430601032386 Toronto 2017-09-07
## 3 2017-05-02 09:59:00 10.1021/la501330j Toronto 2017-05-02
## 4 2017-07-09 09:07:20 10.1063/1.4913415 Toronto 2017-07-09
## 5 2017-05-03 08:40:56 10.1021/jp809992g Toronto 2017-05-03
## 6 2017-05-03 22:11:34 10.1021/ja025109g Toronto 2017-05-03
Lets try it using the separate()
function to get the
time (Option 2)
scihub_df_reduced_time=scihub_df%>% #same selection procedure as above
select(Timestamp,DOI,City_GeoIP)%>%
separate(Timestamp, c("Date", "Time"), " ") #separate the date and time based on the space (the blank in between the quotes) and call the two new columns "Date" and "time"
head(scihub_df_reduced_time)
## Date Time DOI City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24 10.1080/14786430601032386 Toronto
## 3 2017-05-02 09:59:00 10.1021/la501330j Toronto
## 4 2017-07-09 09:07:20 10.1063/1.4913415 Toronto
## 5 2017-05-03 08:40:56 10.1021/jp809992g Toronto
## 6 2017-05-03 22:11:34 10.1021/ja025109g Toronto
There is also a paste()
function in R. It’s very similar
to the concatenate in Excel, and you can learn more about it here.
Finally, you will probably want to save your work after everything.
To do this, we can use the write.csv()
function.
The format for this is write.csv(data, filepath)
. After
running this, you can check the file location to see if a new file has
appeared.
write.csv(scihub_df_reduced,"data/scihub_df_reduced.csv")
So, we have this information about DOI, but what if we want more information? Luckily we have the title and other publication information available from Zotero, and we can export a csv from Zotero and “join” it to our existing dataset.
This csv is going to have a lot of columns. But maybe we only want DOI (Column 9), Title (Column 5) and Publication Year (Column 3). Before when we selected, we used the names of the columns, but we can also select based on the column number.
Notice that we were able to pipe the read.csv
immediately into the select()
zotero=read.csv("data/SciHubDOI.csv")%>%
select(9,5,3) #selecting based on position rather than name
head(zotero)
## DOI
## 1 10.1021/jp809992g
## 2 10.1093/beheco/arx008
## 3 10.1149/1.2069301
## 4 10.1002/dap.30253
## 5 10.1126/science.aaa9092
## 6 10.1002/anie.201605430
## Title
## 1 Spectroscopic Studies of Pristine and Fluorinated Nano-ZrO<sub>2</sub> in Photostimulated Heterogeneous Processes
## 2 Why is the giant panda black and white?
## 3 Solid‐State NMR Studies of Ions in Protective Coatings: II . Lithium and Cesium Ions in Polybutadiene Coatings
## 4 How to learn and use your institution's student voting rates
## 5 Boreal forest health and global change
## 6 From Alkanes to Carboxylic Acids: Terminal Oxygenation by a Fungal Peroxygenase
## Publication.Year
## 1 2009
## 2 2017
## 3 1992
## 4 2016
## 5 2015
## 6 2016
Now, lets join the datasets together. We are using
left_join()
here, but there are lots of different types of
joins that you can learn more about here.
scihub_zotero=scihub_df_reduced%>%
left_join(zotero,by="DOI") #telling it to join the dataset zotero by the values in column DOI
head(scihub_zotero)
## Timestamp DOI City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24 10.1080/14786430601032386 Toronto
## 3 2017-05-02 09:59:00 10.1021/la501330j Toronto
## 4 2017-07-09 09:07:20 10.1063/1.4913415 Toronto
## 5 2017-05-03 08:40:56 10.1021/jp809992g Toronto
## 6 2017-05-03 22:11:34 10.1021/ja025109g Toronto
## Title
## 1 Breast stimulation for cervical ripening and induction of labour
## 2 Adsorption characteristics of parent and copper-sputtered RD silica gels
## 3 Micropatterned Ferrocenyl Monolayers Covalently Bound to Hydrogen-Terminated Silicon Surfaces: Effects of Pattern Size on the Cyclic Voltammetry and Capacitance Characteristics
## 4 Conduction of molecular electronic devices: Qualitative insights through atom-atom polarizabilities
## 5 Spectroscopic Studies of Pristine and Fluorinated Nano-ZrO<sub>2</sub> in Photostimulated Heterogeneous Processes
## 6 Structural Basis for BABIM Inhibition of Botulinum Neurotoxin Type B Protease [ <i>J. Am. Chem. Soc.</i> <b>2000</b> , <i>122</i> , 11268−11269].
## Publication.Year
## 1 2005
## 2 2007
## 3 2014
## 4 2015
## 5 2009
## 6 2002
We’re going to combine a few things we have seen so far. 1. making
lists.
2. group_by()
, but this time we will have TWO
groupings.
3. filter
, but this time with a list of options and not
just one.
We’re going to start with our reduced set. Let’s refresh on what it looks like.
head(scihub_df_reduced)
## Timestamp DOI City_GeoIP
## 1 2017-06-26 21:46:59 10.1002/14651858.CD003392.pub2 Boucherville
## 2 2017-09-07 22:58:24 10.1080/14786430601032386 Toronto
## 3 2017-05-02 09:59:00 10.1021/la501330j Toronto
## 4 2017-07-09 09:07:20 10.1063/1.4913415 Toronto
## 5 2017-05-03 08:40:56 10.1021/jp809992g Toronto
## 6 2017-05-03 22:11:34 10.1021/ja025109g Toronto
We have 3 columns: Timestamp, DOI and City_GeoIP. But maybe we want to see how often each DOI comes up in each city and the organize the information so we have 1 column for each city.
For the sake of not creating a huge dataset, we’re going to only include certain cities. Lets define those using a list.
cities_list=c("Ottawa","Toronto","Montreal","Burnaby")
Now we know what we’re working with, we can string everything
together. The final line is pivot_wider
, it will be easier
to explain what it does after you have seen the final product.
scihub_pivot=scihub_df_reduced%>%
group_by(City_GeoIP,DOI)%>% #group by city and DOI, so we'll get a summary of the doi count per city
count()%>%
filter(City_GeoIP %in% cities_list)%>% #filter, but only keep values that appear in cities_list
pivot_wider(id_cols=DOI,names_from=City_GeoIP,values_from=n)#here is the pivot, we say that the rows should be based on DOI, the new column names are going to be the city, and the values in the cells are the counts of that DOI in that city
head(scihub_pivot)
## # A tibble: 6 × 5
## # Groups: DOI [6]
## DOI Burnaby Montreal Ottawa Toronto
## <chr> <int> <int> <int> <int>
## 1 10.1021/jp011934s 1 NA NA 1
## 2 10.1126/science.197.4307.967 1 NA NA 3
## 3 10.1002/wcc.81 NA 2 NA 4
## 4 10.1016/0006-8993(77)90423-1 NA 1 NA NA
## 5 10.1016/S2214-109X(16)30188-7 NA 3 NA NA
## 6 10.1037/a0017364 NA 1 NA NA
This is going to be an overview of making some basic plots in ggplot. We will cover
On this page, I have gone through and collected material that I have found to be the most helpful when learning ggplot. All of these will linked as we go through.If you want any more information on a particular topic, those would be great places to start.
To start, I cannot recommend the R Graph Gallery enough. It is the first place I go when I need inspiration, and is one of the most extensive resources for R graphing on the internet.
First step will be to load all the libraries we might need. Make sure these are installed (if you don’t know how to install packages look here) or see the earlier part of this lesson here
ggplot2
is part of the tidyverse
suite of
packages, I like to install the whole thing at once in case I need to do
any data tidying before plotting. An optional install for this lesson is
patchwork
, it is a nice package for laying out multiple
plots,but it’s not necessary for today’s lesson.
library(tidyverse)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.3
geom_point()
geom_bar()
geom_boxplot()
geom_text()
geom_violin()
Much like how we could assign values to variables, and then call up
those variables and perform operations on those variables, we can also
assign plots to variables. You do this in the same way, we see below how
I assign the plot to the variable p
. If you assign to a
variable, you will have to call the variable in order for the plot to
appear.
p=ggplot(aes(aes1,aes2))+ #these are global aesthetics that will apply to all the points (required)
geom_X(aes(aes1,aes2))+ #X=point|bar|violin|etc, you can have many `geom`s in one plot (1 required)
theme() # a lot of your specifications will go here (not required)
p #this is how you get your plot to show up
You could also just get the plot to show up automatically if you don’t set it to an object
ggplot(aes(aes1,aes2))+
geom_X(aes(aes1,aes2))+
theme()
# a plot would appear here if there was actually any data here
Scatter plots are an excellent first plot to start off with. There are lots of ways to manipulate scatter plots to give very informative figures-which you will see farther down on this page.
Further information on making scatter plots can be found here.
First thing first, load the data. What I have written in the chunk
below may not work for you if you have downloaded the data separately
and stored in in a different folder. You can also do
scatter=read.csv(file.choose())
to open a file navigation
window and select from there.
As with anything, its always a good idea to look at the data and make
sure it uploaded properly before you start plotting. This also makes
sure you know what the column names are. You can see more about using
head()
in the code from session
2
This is a synthetic dataset looking at Universities, their enrollment numbers, library budget and collection size. Note: I was asked how you could incorporate AI into learning R, one thing I decided to try was if it could make me a data set to practice with, so that is where this came from
scatter=read.csv(("data/scatter.csv"))
head(scatter)
## Uni_Name Country Enrollment Lib_Budget Collection_Size
## 1 University of A0 Netherlands 35828 768.49 94965
## 2 University of A1 Japan 20711 394.83 108214
## 3 University of A2 India 5420 61.16 16128
## 4 University of A3 Sweden 33216 841.94 161352
## 5 University of A4 Italy 2301 44.13 9833
## 6 University of A5 UK 47236 1384.20 282199
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+# set my x and y values to the appropriate columns
geom_point()#specify that I want it shown as a scatter plot
Sometimes you want to add a trend line, or line of best fit.For more
information on how to get a line of best fit see the documentation for
geom_smooth
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point()+#the first two lines are the same as above
geom_smooth(method="lm")#now I add another geom to specify that I would like a trendline added. method="lm" is an argument I needed to specify to say what type of trend line I wanted.
## `geom_smooth()` using formula = 'y ~ x'
When you have a lot of nice metadata associated with the variables you are plotting. It is nice to incorporate these into your figures. You can normally change:
Let’s change the colour of the points based on Country
and size of the points based on Enrollment
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))#the top line is the same, but now in the geom_point() i have added asthetics, one for colour and one for size, and specified the column that should be used for both.
Now, let’s change the shape of the points based on
Country
, but keep the size the same.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(shape=Country,size=Enrollment))# same as above, but I have changed colour to shape.
You can see we get a lot of strange shapes. You can specify the shapes you want by using a number code. You can find all those here
If you want to change the shape of a point, but don’t want the shape
to be dependent on the value of a variable, then you can place the
argument outside of the aes()
for that geom.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment),shape=15) #notice that all the other asthetics had been inside the aes(), but shape=15 is outside, so it will apply to all points. I decided on shape 15 after consulting the link I added above.
Whats the difference between fill and colour? Also why are there different versions of the same shape?
Basically, there are some shapes that have fill AND colour, so you can change the colour of the outline AND the entire shape, others may only have one or the other. Again, the link above explains which shapes have what properties.
For this section we need the file bar_plot.csv
. Again to
upload it, what I have written in this chunk may not work for you. You
may have to do something along the lines of
bar=read.csv(file.choose())
and then select the
bar_plot.csv
from wherever you saved it on your
computer.
The data set we will be using is looking at how much time (in minutes) students in different years of school spend in the library.
More information on making bar charts can be found here
bar=read.csv(("data/bar_plot.csv"))
head(bar)
## student year time minutes
## 1 1 year4 week2 225
## 2 2 year4 week2 450
## 3 3 year4 week2 255
## 4 4 year4 week2 75
## 5 5 year4 week2 75
## 6 6 year4 week2 210
You’ll notice that this data is showing individual observations, but we have the option to automatically plot the mean of each group. As we see below, plotting individually is not very helpful or easy to read.
ggplot(bar, aes(x=time,y=minutes,fill=year))+
geom_bar(position="dodge2",stat = "identity") #"dodge2" is telling R to plot individual values side by size, "identity" is saying to leave the values as they appear in the csv and don't do any calculations on them.
Luckily, this kind of thing is really easy to fix, and we can have it automatically plot summary statistics for us, and we will do this for the rest of the plots in this section.
All of these will have stat="summary"
and
fun="mean"
in the geom_bar()
, this is how we
can make sure we are plotting the means of each category.
ggplot(bar, aes(x=time,y=minutes,fill=year))+ # same first line as the others
geom_bar(position="dodge",stat = "summary",fun="mean")# in here, I have specified the position, the fact I want a summary statistic plotted, and the specific statistic I want run, in this case the mean.
Changing the position will allow us to have different types of bar
charts. To get it stacked, use position="stack"
.
ggplot(bar, aes(x=time,y=minutes,fill=year))+
geom_bar(position="stack",stat = "summary", fun= "mean")# I changed the position to stack from dodge, so now the bars will stack on top of each other.
ggplot(bar, aes(x=time,y=minutes,fill=year))+
geom_bar(position="fill",stat = "summary", fun= "mean")# again, the only thing I changed is the position.
To summarize:
position="dodge"
position="stack"
position="fill"
Box and Whisker plots are considered an improvement over the bar plot because they give a better idea of the spread of the data. You can see the mean, quartiles and outliers, these are not evident with the bar plot.
More information on making box and whisker plots can be found here
ggplot(bar, aes(x=time,y=minutes,fill=year))+ #notice that this is the exact same first line as the barplot
geom_boxplot()
Violin plots are sometimes considered another level up from the box and whisker (so to keep track bar<box<violin) since it gives a better (more visual) idea of how the points are distributed.
More information on violin plots can be found here
ggplot(bar, aes(x=time,y=minutes,fill=year))+
geom_violin()#again, this is the only thing we have changed
You may have noticed that for all of these, the axis are in the order final, midterm, week2. While not a big deal, it would be nice if they were week2, midterm, final. We are going to get into how to change that later. For now, we will stick to the basic plots.
You can also combine different types of plots and and layer them on top of each other. For example, maybe we like the overview that a violin plot gives, but still want to see the actual numbers.
ggplot(bar, aes(x=time,y=minutes))+ # I removed the fill argument here, just so there is not too much going on and I can more clearly demonstrate my point
geom_violin(alpha=0.8)+ #this is the first appearance of the transparency (alpha) argument, I set it to 0.8 (which can be understood as 80%) so we can better see the impact of the order in which we specify the geoms
geom_jitter(width = 0.15)# geom_jitter is very similar to geom_point(), except it ensure that no points are plotted on top of each other. The width is just how big of an area I want them to be plotted over
Lets look at the impact of the order in which we called the layers. In
the above example, we called the
geom_violin()
before we
called geom_point()
, lets try the other way around.
ggplot(bar, aes(x=time,y=minutes))+
geom_jitter(width = 0.15)+
geom_violin(alpha=0.8)
Because I made the violins transparent (thats what the alpha is for), we can see that the violins were placed on top of the points, instead of the points being on top of the violins.
This is the same plot we made above, we haven’t added any customizations. Lets start by just renaming the x and y axis. By default, they just take the name of the column the data comes from.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")
themes are a quick way to change a lot of the visual aspects of you plot at one time. There are a lot of different themes you can use when making your plots. You can see a description of them all here, and I’ll show you some examples below.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
theme_bw()#this is the "black and white" theme
We can see that the background it no longer grey, and the gridlines are
light grey.
Now I’m going to show a bunch of different themes, and I’m also going to add a title to each of them so that we know which plot is which. Notice here that all the plots are assigned to objects, so they are not going to automatically populate.
scatter_bw=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle(("theme_bw"))+
theme_bw()+
theme(legend.position = "none")
scatter_light=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle("theme_classic")+
theme_classic()+
theme(legend.position = "none")
scatter_dark=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle("theme_dark")+
theme_dark()+
theme(legend.position = "none")
scatter_minimal=ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle("theme_minimal")+
theme_minimal()+
theme(legend.position = "none")
To compare all these at once, I am now going to show you a library called patchwork.
Patchwork
is a lovely package that allows you to very
simply arrange your plots in whatever manner you like. See the
documentation here.
You can see an example of how you would use it below.
You’ll notice that I used the objects that each plot was assigned to.
(scatter_bw|scatter_light)/(scatter_dark|scatter_minimal)
Now we’ve seen patchwork, we’ll go back to customizing stuff. Let’s saw
we want to make more specific changes. Let’s say we want to remove the
gridlines in the background.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle("Removing All Gridlines")+
theme_bw()+
theme(panel.grid = element_blank())# we have a general theme function where we can specify that we don't want any grids
We can also be more specific, and only remove the vertical lines by
specify the
"x"
grid lines. The same could be done for the
horizontal ones by specifying "y"
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle("Removing All Vertical Lines")+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x=element_blank())
### Modifying the legend
There are a lot of ways you can change the legend. You will probably
need to do some research for your specific use case. There are options
in theme()
and options in guide()
and a lot of
others.
This is normally where I go when I first start troubleshooting my legend problems.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
ggtitle("Renaming the Legends")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+ # I have added a guide() function to specify the titles of both of my legends. As default they just do the column name.
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
There are A LOT of different ways you can colour your plots. So I advise you explore this when the time comes, but there are a lot of pre-set colour schemes that look great and are easy to implement. You can find resources on them here
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
scale_colour_viridis_d()+ #this is where I specified the colours I wanted
ggtitle("Scatterplot with Viridis Colouring")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
scale_colour_brewer(palette = "Paired")+ # I have changed from viridis to another palette, and selected which palette I wanted from a list.
ggtitle("Scatterplot with Colour Brewer")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
If you’re going to get into manual colouring look here to figure out what kind of colours are available.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
scale_colour_manual(values=c("orangered3","slateblue","lightseagreen","orchid3","sienna2","dodgerblue"))+
ggtitle("Scatterplot with Manual Colours")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
If you are hoping to apply a specific colour to a specific category you can do it like this.
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
scale_colour_manual(values=c(India="orangered3",Italy="slateblue",Japan="lightseagreen",Netherlands="orchid3",Sweden="sienna2",UK="dodgerblue"))+
ggtitle("Scatterplot with Manual Colours-Assigned to Continent")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
Sometimes when you have a lot of data it can be useful to facet your plots. This is really easy! As you can see below. The different options for facets can be seen here
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())+
facet_grid(~Country)
You can see that the facet made our x-axis labels difficult to see.
Luckily this is one of the many elements we can fix in
theme()
ggplot(scatter,aes(x=Lib_Budget,y=Collection_Size))+
geom_point(aes(colour=Country,size=Enrollment))+
xlab("Library Budget")+
ylab("Collection Size")+
guides(colour=guide_legend(title="Country"),size=guide_legend(title="Enrollment"))+
theme_bw()+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle=45,vjust=0.5))+
facet_grid(~Country)
Remember up above with our bar plot we saw that the categories for the sleep dataset weren’t in the right order. We are going to reset the factor levels and then it should plot properly.
We’ll start with a basic plot of the old factor levels
bar_ofl=ggplot(bar, aes(x=time,y=minutes,fill=year))+geom_bar(position="dodge",stat="identity")
Now lets reset the factor levels, then do the same plot again, and then compare the two
bar$time <- factor(bar$time,levels = c("week2","midterm","final"))
bar_rfl=ggplot(bar, aes(x=time,y=minutes,fill=year))+geom_bar(position="dodge",stat="identity")
(bar_ofl|bar_rfl)
class()
levels()
table(df$col)
R also allows you to make animated vizualizations. For example we can look at the number of library visitors over a period of time, using synthetic data created by Melissa and Chantal for the Power BI session.
This was another opportunity for me to explore whether AI could code. I had tried a few years ago and it wasn’t perfect, so I tried with this application. The code ChatGPT gave me looked ok, but didn’t actually produce the image, I still had to troubleshoot, and it was not an issue I would have been able to fix quickly if I was a beginner.
data=read.csv("data/synthetic_gatecounts.csv")%>%
dplyr::select(name,date,visitors)%>%
mutate(Timestamp=ymd_hm(date))%>% #make sure the time is interpreted in the correct format
mutate(Date=date(Timestamp))%>% #extract the date
dplyr::select(name,Date,visitors)
# Calculate cumulative visitors
data_cumulative <- data %>%
arrange(name, Date) %>%
group_by(name) %>%
mutate(cumulative_visitors = cumsum(visitors)) %>%
ungroup()
# Create animated bar plot
plot <- ggplot(data_cumulative, aes(x = reorder(name, -cumulative_visitors),
y = cumulative_visitors,
fill = name)) +
geom_col(show.legend = FALSE) +
labs(title = 'Cumulative Visitors by Place',
subtitle = 'Date: {frame_time}',
x = 'Name',
y = 'Cumulative Visitors') +
scale_y_continuous(labels = scales::comma) +
transition_time(Date) +
ease_aes('linear') +
theme_minimal(base_size = 16)
# Animate
animate(plot, nframes = 100, fps = 10, width = 800, height = 600)