Several staff have commented that the UG teaching materials can be time consuming to use as a reference when supporting project students.
This page is intended as a quick reference when helping students with their data handling and analysis.
Included on this page are short examples of everything we teach students in stage 1 to 4, including the project option workshops, plus some extensions/additions which you may find useful to introduce project students to. Extensions (i.e. parts not taught in the programme) are clearly marked.
Before you start
Most of the code here requires loading the tidyverse. If your student has a could not find function x
error it’s almost certainly that.
read_csv('x')
You need to load tidyverse first:
library(tidyverse)
The code above would also produce the error: Error: 'x' does not exist in current working directory ('/Users/ben/dev/discourse/r-guides').
Our students often struggle with path-related issues, so make sure they are either:
- Using an RStududio project and have their data in the same directory as their script, or
- Use an R Markdown (.Rmd) file, and again have their data in the same directory as their script.
Code review
When reviewing student’s scripts ensure:
- They load
tidyverse
(and BayesFactor) at the top of the file - They don’t load it again further down (this can cause weirdness)
- They are including sufficient vertical white space to make the file readable
- They break up long lines of code (e.g. > 100 characters, and after every pipe
%>%
) with a line break
Reading data
Read CSV
Read a CSV file stored in the current working directory:
read_csv('filename.sav') mydata <-
Download a file from the internet
#extension
This downloads supplementary data from a paper in PlosOne (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212482#sec016)
download.file('https://ndownloader.figstatic.com/files/14532419', 'egger.sav')
Read all the csv files stored in a directory:
If you have a folder full of data files you can list them like so:
list.files('multifile-data/')
## [1] "participant1.csv" "participant10.csv" "participant2.csv" "participant3.csv" "participant4.csv"
## [6] "participant5.csv" "participant6.csv" "participant7.csv" "participant8.csv" "participant9.csv"
You can read all the csv files like so:
tibble(file = list.files('multifile-data/')) %>%
combined.data <- rowwise() %>%
do(
{ .$file;
fn =read_csv(paste0("multifile-data/", fn)) %>%
mutate(file = fn)
} )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_double(),
## rt = col_double(),
## condition = col_double()
## )
# by-participant summary
%>%
combined.data group_by(participant) %>%
summarise(mean(rt), sd(rt))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 3
## participant `mean(rt)` `sd(rt)`
## <dbl> <dbl> <dbl>
## 1 1 250. 198.
## 2 2 243. 189.
## 3 3 250. 200.
## 4 4 244. 190.
## 5 5 251. 200.
## 6 6 249. 204.
## 7 7 243. 197.
## 8 8 257. 199.
## 9 9 246. 205.
## 10 10 259. 202.
Read from SPSS, Excel
#extension
We don’t yet teach students this, but the rio
package has a flexible interface for importing SPSS/SAV files, Excel and many other formats.
rio::import('egger.sav')
egger <-
%>%
egger select(ID:height_cm) %>%
glimpse
## Rows: 142
## Columns: 6
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25…
## $ vpn <dbl> 3, 4, 5, 6, 7, 8, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,…
## $ condition <dbl> 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,…
## $ age_in_years <dbl> 7.916667, 7.583333, 7.416667, 7.916667, 8.083333, 7.666667, 8.250000, 7.333333, 7.333333,…
## $ gender <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1,…
## $ height_cm <dbl> 127.0, 121.0, 119.0, 123.0, 129.0, 127.5, 122.0, 121.0, 132.0, 135.0, 130.0, 126.0, 132.0…
Reading of different file type works, based on the file extension:
::import('sleep.xlsx') %>%
rio glimpse
## New names:
## * `` -> ...1
## Rows: 241
## Columns: 12
## $ ...1 <chr> "1", "2", …
## $ uniqueid <chr> "60120734e…
## $ `Start time` <dttm> 2020-11-2…
## $ `Completion time` <dttm> 2020-11-2…
## $ `My sleep is affected by my study commitments` <chr> "Agree", "…
## $ `I achieve good quality sleep` <chr> "Agree", "…
## $ `My electronic device usage negatively affects my sleep` <chr> "Disagree"…
## $ `Tiredness interferes with my concentration` <chr> "Strongly …
## $ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> "Strongly …
## $ `I often achieve eight hours of sleep` <chr> "Disagree"…
## $ `I regularly stay up past 11pm` <chr> "Disagree"…
## $ `My other personal commitments affect my sleep` <chr> "Disagree"…
If reading an existing SPSS file which has factor (category) labels that you would like to preserve you can also use the haven
package. The code below shows how to converts all labelled category data to factors. This avoids some errors later when pivoting or mutating:
haven:::read_sav('egger.sav') %>%
egger.labelled <- mutate_if(haven::is.labelled, factor)
%>%
egger.labelled select(ID:height_cm) %>%
glimpse
## Rows: 142
## Columns: 6
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25…
## $ vpn <dbl> 3, 4, 5, 6, 7, 8, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,…
## $ condition <fct> 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,…
## $ age_in_years <dbl> 7.916667, 7.583333, 7.416667, 7.916667, 8.083333, 7.666667, 8.250000, 7.333333, 7.333333,…
## $ gender <fct> 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1,…
## $ height_cm <dbl> 127.0, 121.0, 119.0, 123.0, 129.0, 127.5, 122.0, 121.0, 132.0, 135.0, 130.0, 126.0, 132.0…
Visually there is not much difference, but the variables of type <dbl+lbl>
(haven’s labelled data type) are now <fct>
(boring old R factors). If you have any other trouble with SPSS labels read the Haven documentation.
Fix file path errors
read_csv('doesntexist.csv')
If the file you want to read in can’t be found, check the current working directory (also displayed in the error message), or list the files in the directory to check what is actually there
getwd()
## [1] "/Users/ben/dev/cheatR"
list.files()
## [1] "_build.R" "cheatR.Rproj"
## [3] "complete-cheat.rmd" "docs"
## [5] "egger.sav" "index.rmd"
## [7] "lifesavr_cheatsheet.rmd" "multifile-data"
## [9] "README.md" "references.bib"
## [11] "sleep.xlsx" "staff-cheatsheet_files"
## [13] "staff-project-students-cheatsheet.rmd"
Renaming columns
Use rename
for simple cases
%>%
mtcars select(cyl, mpg) %>%
rename(MPG = mpg, Cylinders = cyl) %>%
head()
## Cylinders MPG
## Mazda RX4 6 21.0
## Mazda RX4 Wag 6 21.0
## Datsun 710 4 22.8
## Hornet 4 Drive 6 21.4
## Hornet Sportabout 8 18.7
## Valiant 6 18.1
Use back-ticks to wrap the new name if you want to include spaces:
%>%
mtcars select(mpg) %>%
rename(`Miles per gallon` = mpg) %>%
head()
## Miles per gallon
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
Tidying messy names
#extension
The janitor package has a really great function to tidy messy names as they come from online survey systems. This is a real example:
::import('sleep.xlsx') %>%
rio glimpse
## New names:
## * `` -> ...1
## Rows: 241
## Columns: 12
## $ ...1 <chr> "1", "2", …
## $ uniqueid <chr> "60120734e…
## $ `Start time` <dttm> 2020-11-2…
## $ `Completion time` <dttm> 2020-11-2…
## $ `My sleep is affected by my study commitments` <chr> "Agree", "…
## $ `I achieve good quality sleep` <chr> "Agree", "…
## $ `My electronic device usage negatively affects my sleep` <chr> "Disagree"…
## $ `Tiredness interferes with my concentration` <chr> "Strongly …
## $ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> "Strongly …
## $ `I often achieve eight hours of sleep` <chr> "Disagree"…
## $ `I regularly stay up past 11pm` <chr> "Disagree"…
## $ `My other personal commitments affect my sleep` <chr> "Disagree"…
::import('sleep.xlsx') %>%
rio janitor::clean_names() %>%
glimpse
## New names:
## * `` -> ...1
## Rows: 241
## Columns: 12
## $ x1 <chr> "1", "2", "3", "4",…
## $ uniqueid <chr> "60120734ec", "d0a9…
## $ start_time <dttm> 2020-11-27 14:21:2…
## $ completion_time <dttm> 2020-11-27 14:22:4…
## $ my_sleep_is_affected_by_my_study_commitments <chr> "Agree", "Somewhat …
## $ i_achieve_good_quality_sleep <chr> "Agree", "Agree", "…
## $ my_electronic_device_usage_negatively_affects_my_sleep <chr> "Disagree", "Strong…
## $ tiredness_interferes_with_my_concentration <chr> "Strongly agree", "…
## $ my_sleep_is_disturbed_by_external_factors_e_g_loud_cars_housemates_lights_children <chr> "Strongly agree", "…
## $ i_often_achieve_eight_hours_of_sleep <chr> "Disagree", "Somewh…
## $ i_regularly_stay_up_past_11pm <chr> "Disagree", "Disagr…
## $ my_other_personal_commitments_affect_my_sleep <chr> "Disagree", "Neithe…
This makes working with the dataset much easier:
- Everything is lower case, so no guessing whether you need to capitalise names
- No spaces (so no escaping of spaces needed with back ticks)
- No special characters which can mess things up
Renaming programatically
#extension
You can do similar things yourself by doing string replacements and the set_names
function.
# here the dot refers to the data coming from the pipe
::import('sleep.xlsx') %>% names(.) rio
## New names:
## * `` -> ...1
## [1] "...1"
## [2] "uniqueid"
## [3] "Start time"
## [4] "Completion time"
## [5] "My sleep is affected by my study commitments"
## [6] "I achieve good quality sleep"
## [7] "My electronic device usage negatively affects my sleep"
## [8] "Tiredness interferes with my concentration"
## [9] "My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children..."
## [10] "I often achieve eight hours of sleep"
## [11] "I regularly stay up past 11pm"
## [12] "My other personal commitments affect my sleep"
So this can be more flexible, but does still need some more work to remove special characters like parentheses:
::import('sleep.xlsx') %>%
rio set_names(tolower(str_replace_all(names(.), " ", "_"))) %>%
glimpse
## New names:
## * `` -> ...1
## Rows: 241
## Columns: 12
## $ ...1 <chr> "1", "2", …
## $ uniqueid <chr> "60120734e…
## $ start_time <dttm> 2020-11-2…
## $ completion_time <dttm> 2020-11-2…
## $ my_sleep_is_affected_by_my_study_commitments <chr> "Agree", "…
## $ i_achieve_good_quality_sleep <chr> "Agree", "…
## $ my_electronic_device_usage_negatively_affects_my_sleep <chr> "Disagree"…
## $ tiredness_interferes_with_my_concentration <chr> "Strongly …
## $ `my_sleep_is_disturbed_by_external_factors_e.g._loud_cars,_housemates,_lights,_children...` <chr> "Strongly …
## $ i_often_achieve_eight_hours_of_sleep <chr> "Disagree"…
## $ i_regularly_stay_up_past_11pm <chr> "Disagree"…
## $ my_other_personal_commitments_affect_my_sleep <chr> "Disagree"…
Tables/descriptive statistics
In stage 1 and 2 students learn how to use group_by
and summarise
to make tables of descriptive statistics, and the better tables worksheet recommends the same:
%>%
mtcars group_by(cyl) %>%
summarise(MPG = mean(mpg), SD=sd(mpg))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## cyl MPG SD
## <dbl> <dbl> <dbl>
## 1 4 26.7 4.51
## 2 6 19.7 1.45
## 3 8 15.1 2.56
Summarising multiple variables
#extension
It’s also possible to use pivot_longer
to simplify summarising multiple variables. We don’t show this technique explicitly, but we do teach pivot_longer
and pivot_wider
, so students could combine the ideas for themselves:
%>%
mtcars select(cyl, mpg, wt, disp, drat) %>%
pivot_longer(-cyl) %>%
group_by(name) %>%
summarise(M=mean(value), SD=sd(value), Med = median(value), IQR = IQR(value))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 5
## name M SD Med IQR
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 disp 231. 124. 196. 205.
## 2 drat 3.60 0.535 3.70 0.840
## 3 mpg 20.1 6.03 19.2 7.38
## 4 wt 3.22 0.978 3.32 1.03
Comparing variables in tables
#extension
Using pivot_longer
and pivot_wider
you can compare variables in a table side by side:
%>%
mtcars select(cyl, am, mpg, wt) %>%
# note here we have to select multiple columns to exclude (cyl and am) by adding a hyhen in front of the name
pivot_longer(c(-cyl, -am)) %>%
# recode am to have nice labels in the table below
mutate(am=factor(am, levels=c(0,1), labels=c("Manual", "Auto"))) %>%
group_by(am, name) %>%
summarise(M=mean(value), SD=sd(value)) %>%
pivot_wider(names_from=am, values_from=c(M, SD)) %>%
# replace underscores in column names with spaces
set_names(str_replace(names(.), "_", " "))
## `summarise()` regrouping output by 'am' (override with `.groups` argument)
## # A tibble: 2 x 5
## name `M Manual` `M Auto` `SD Manual` `SD Auto`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 mpg 17.1 24.4 3.83 6.17
## 2 wt 3.77 2.41 0.777 0.617
See reshaping data below for a guide to pivot_longer
and pivot_wider
. See also various techniques for renaming columns.
Reshaping data
We cover reshaping in the within-subject differences worksheet, but only briefly. For MSC students we cover it here: [TODO ADD LINK]
There are two functions to reshape data:
pivot_longer
pivot_wider
Reshaping to long form
Select only what you need
It’s best to select only the data you need before reshaping. Although not strictly necessary it does:
- make it easier to check the output is as desired
- avoids an error where columns are of different types
If you try and
pivot_longer
a mix of numeric and character (text) data this will fail*. If you want to keep text columns, exclude from the pivoting (see below for how).
Simplest case: make everything long
%>%
mtcars select(mpg, wt) %>%
pivot_longer(cols=everything())
## # A tibble: 64 x 2
## name value
## <chr> <dbl>
## 1 mpg 21
## 2 wt 2.62
## 3 mpg 21
## 4 wt 2.88
## 5 mpg 22.8
## 6 wt 2.32
## 7 mpg 21.4
## 8 wt 3.22
## 9 mpg 18.7
## 10 wt 3.44
## # … with 54 more rows
In the code above writing cols=
is optional, but makes things explicit and should be encouraged.
Exclude some variables but keep them as index columns
Imagine we have one column which we’d like to keep as an index. We can exclude from pivoting by writing a hyphen in front of the variable name:
%>%
mtcars select(cyl, mpg, wt) %>%
pivot_longer(cols = -cyl) %>%
sample_n(6)
## # A tibble: 6 x 3
## cyl name value
## <dbl> <chr> <dbl>
## 1 8 mpg 18.7
## 2 4 wt 1.62
## 3 6 mpg 21.4
## 4 8 wt 3.84
## 5 4 mpg 24.4
## 6 8 wt 3.17
If we have two columns to exclude we need to specify them in a vector using c()
:
%>%
mtcars select(cyl, am, mpg, wt) %>%
pivot_longer(cols = c(-cyl, -am)) %>%
sample_n(6)
## # A tibble: 6 x 4
## cyl am name value
## <dbl> <dbl> <chr> <dbl>
## 1 8 1 mpg 15
## 2 8 0 mpg 18.7
## 3 4 1 mpg 32.4
## 4 4 0 mpg 22.8
## 5 8 0 wt 3.52
## 6 8 0 wt 3.44
You can also do this the other way, by specifying only which columns you do want to pivot:
%>%
mtcars select(cyl, am, mpg, wt) %>%
pivot_longer(cols = c(mpg, wt)) %>%
sample_n(6)
## # A tibble: 6 x 4
## cyl am name value
## <dbl> <dbl> <chr> <dbl>
## 1 4 1 wt 1.84
## 2 8 0 mpg 16.4
## 3 4 0 wt 3.15
## 4 8 1 mpg 15.8
## 5 8 0 wt 3.73
## 6 8 0 mpg 15.2
Reshape to wider form
Use pivot_wider
. With a simple table like this:
mtcars %>%
mpg.summary <- group_by(cyl) %>%
summarise(M = mean(mpg))
## `summarise()` ungrouping output (override with `.groups` argument)
mpg.summary
## # A tibble: 3 x 2
## cyl M
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
We can spread the data wide to enable comparison (and the table is more in line with expected APA format):
%>%
mpg.summary pivot_wider(names_from=cyl, values_from=M)
## # A tibble: 1 x 3
## `4` `6` `8`
## <dbl> <dbl> <dbl>
## 1 26.7 19.7 15.1
This is especially useful when you have more rows in the table:
%>%
diamonds # this is a quick way of renaming color to be capitalised
group_by(cut, Color=color) %>%
summarise(M = mean(price)) %>%
pivot_wider(names_from=cut, values_from=M) %>%
pander::pander()
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)
Color | Fair | Good | Very Good | Premium | Ideal |
---|---|---|---|---|---|
D | 4291 | 3405 | 3470 | 3631 | 2629 |
E | 3682 | 3424 | 3215 | 3539 | 2598 |
F | 3827 | 3496 | 3779 | 4325 | 3375 |
G | 4239 | 4123 | 3873 | 4501 | 3721 |
H | 5136 | 4276 | 4535 | 5217 | 3889 |
I | 4685 | 5079 | 5256 | 5946 | 4452 |
J | 4976 | 4574 | 5104 | 6295 | 4918 |
Data Visualisation
In stage 1 and 2 students are taught scatter, density and boxplots:
Scatter plots
%>%
iris ggplot(aes(x = Sepal.Length, y=Petal.Length)) +
geom_point()
#extension
Adding jitter to a plot can be useful. We don’t teach it, but I recommend using geom_jitter
rather than geom_point
. This is especially helpful with survey data which is not often truly continuous; adding jitter helps show points which overlap on scale boundaries:
%>%
iris ggplot(aes(x = Sepal.Length, y=Petal.Length)) +
geom_jitter() +
ggtitle("Geom jitter reveals overlapping points in non-scale data.")
Color dimensions
With a categorical color dimension:
%>%
diamonds ggplot(aes(x = carat, y=price, color=clarity)) +
geom_point()
Or a continuous color dimension:
%>%
iris ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Petal.Length)) +
geom_jitter()
Smoothed lines
We show students how to add a smoothed line:
%>%
iris ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Species)) +
geom_jitter() +
geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And also how to use a linear fit for the lines:
%>%
iris ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Species)) +
geom_jitter() +
geom_smooth(se=F, method=lm)
## `geom_smooth()` using formula 'y ~ x'
Facets/panels
Create a grid of plots using a single faceting variable:
%>%
diamonds ggplot(aes(x = carat, y=price)) +
geom_point() +
facet_wrap(~clarity)
#extension
Or a two-way grid of facets using two variables:
%>%
diamonds ggplot(aes(x = carat, y=price)) +
# note adding alpha=.1 makes points partly transparent and makes it easier to see
# where most of the density is in large datasets. The term alpha refers to the alpha-channel
# in computer graphics which controls the transparency of images
geom_point(alpha=.1) +
facet_grid(cut~clarity)
Combining two plots
#extension
You can use the cowplot
package if you want to combine 2 plots and can’t use faceting.
diamonds %>%
plot1 <- ggplot(aes(x = carat, y=price)) +
geom_point(alpha=.1) +
facet_wrap(~clarity)
iris %>%
plot2 <- ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Species)) +
geom_jitter(alpha=.7) +
geom_smooth(se=F, method=lm)
::plot_grid(plot1, plot2) cowplot
## `geom_smooth()` using formula 'y ~ x'
Boxplots (and similar)
Where the x axis is a categorical variable we teach students to use a boxplot (in preference to bar plots):
%>%
diamonds ggplot(aes(x = cut, y=log(price))) +
geom_boxplot()
#extension
If you don’t like boxplots you can use a point-range plot. Bar charts are not recommended because of the ‘within-bar bias’, whereby readers perceive points inside the bar (e.g. the bottom of an error bar) as more likely than those outside the bar (e.g. the top of an error bar), even though this is not the case.
%>%
diamonds ggplot(aes(x = cut, y=price)) +
# set the error bars to be 95% CI. Default is the SE.
# could also use the median_hilow function here
stat_summary(fun.data=mean_cl_normal)
#extension
Combining boxplots and facets can be useful for experimental data:
%>%
warpbreaks ggplot(aes(wool, breaks)) +
geom_boxplot() +
# note the dot here means 'skip the horizontal-facet`
facet_grid(.~tension)
#extension
Some people like violin plots + boxplots. Day 9 in this data is a good example of why:
::sleepstudy %>%
lme4 ggplot(aes(factor(Days), Reaction)) +
geom_violin() +
geom_boxplot(width=.2)
Labelling axes
#extension
xlab()
andylab()
change x and y axis labels- Color labels can also be changed
- Labels in facets are fiddly but also possible
Statistics
Students are taught Bayesian techniques first and supervisors should tend to encourage this too. Start by loading the BayesFactor package:
library(BayesFactor)
Model ‘formula’
Most model functions in R accept a formula which describes the outcome and predictor variables (we avoid using dependent and independent variable nomenclature because researchers frequently misuse it when some or all variables are observed rather than manipulated).
Some functions, like t-test, have an alternative method for wide-format data, but this is relatively rare and it is better for students to get used to using model formulae.
A formula always has three parts:
y ~ x
- The ‘left hand side’,
y
, which is the outcome - The tilde symbol,
~
, which means “is predicted by” - The ‘right hand side’,
x
, which is the predictor(s)
You can add multiple additive predictors using the +
symbol:
y ~ x1 + x2 ...
Or add interactions between variables with the *
:
y ~ x1 * x2 * x3
If you write it this way all 2nd and 3rd-level interactions would be included.
Some more complex models require more than one variable on the left hand side, or can combine more than once formula (e.g. for SEM) but we don’t teach these on any of the programmes.
Compare two groups
Two independent samples:
Bayesian
ttestBF(formula= mpg ~ am, data=mtcars)
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 86.58973 ±0%
##
## Against denominator:
## Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS
Frequentist
t.test(mpg ~ am, data=mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
Compare two paired samples
#extension
Note: students aren’t taught paired-samples t-test and are suggested to use a within-Anova with 2 levels of a single factor. This is statistically equivalent and will give you the same answer. See below for how to do it.
If you really want to do a ttest, we need to make an example dataset in the right format first (if you know a built in example in R let me know):
lme4::sleepstudy %>%
paired.data <- filter(Days==1 | Days==9) %>%
mutate(Days=paste0('day', Days)) %>%
pivot_wider(names_from=Days, values_from=Reaction)
paired.data
## # A tibble: 18 x 3
## Subject day1 day9
## <fct> <dbl> <dbl>
## 1 308 259. 466.
## 2 309 205. 237.
## 3 310 194. 248.
## 4 330 300. 354.
## 5 331 285 372.
## 6 332 243. 254.
## 7 333 290. 362.
## 8 334 276. 377.
## 9 335 274. 237.
## 10 337 314. 459.
## 11 349 230. 352.
## 12 350 243. 389.
## 13 351 300. 348.
## 14 352 298. 389.
## 15 369 268. 367.
## 16 370 235. 372.
## 17 371 272. 369.
## 18 372 273. 364.
Bayesian
with(paired.data, ttestBF(day1, day9, paired=TRUE))
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 3966.803 ±0%
##
## Against denominator:
## Null, mu = 0
## ---
## Bayes factor type: BFoneSample, JZS
Frequentist
with(paired.data, t.test(day1, day9, paired=TRUE))
##
## Paired t-test
##
## data: day1 and day9
## t = -6.5205, df = 17, p-value = 5.236e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -114.29724 -58.41369
## sample estimates:
## mean of the differences
## -86.35547
More than 2 groups, everything between subjects
Optionally with interactions too:
Bayesian
anovaBF(breaks ~ wool * tension, data=warpbreaks) wool.anova <-
##
|
| | 0%
|
|========================== | 25%
|
|==================================================== | 50%
|
|============================================================================= | 75%
|
|=======================================================================================================| 100%
wool.anova
## Bayes factor analysis
## --------------
## [1] wool : 0.8184672 ±0.01%
## [2] tension : 21.451 ±0.01%
## [3] wool + tension : 22.42743 ±1.19%
## [4] wool + tension + wool:tension : 69.26559 ±1.71%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS
- We can read lines 1 and 2 of the output as ‘main effects’.
- Line 3 is the combined effect of wool and tension
Line 4 should not be interpreted on it’s own. Instead, compare it to line 3:
4]/wool.anova[3] wool.anova[
## Bayes factor analysis
## --------------
## [1] wool + tension + wool:tension : 3.088432 ±2.08%
##
## Against denominator:
## breaks ~ wool + tension
## ---
## Bayes factor type: BFlinearModel, JZS
This is the BF10 in favour of the interaction.
Frequentist
If you don’t have interactions it won’t matter, but if you have interactions you probably want to report type 3 sums of squares.
::Anova(lm(breaks ~ wool * tension, data=warpbreaks), type=3) car
## Anova Table (Type III tests)
##
## Response: breaks
## Sum Sq Df F value Pr(>F)
## (Intercept) 17866.8 1 149.2757 2.426e-16 ***
## wool 1200.5 1 10.0301 0.0026768 **
## tension 2468.5 2 10.3121 0.0001881 ***
## wool:tension 1002.8 2 4.1891 0.0210442 *
## Residuals 5745.1 48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note, although most people don’t you should probably correct for multiple comparisons after completing 2 way Anova. From Cramer et al, 2016:
Many psychologists do not realize that exploratory use of the popular multiway analysis of variance harbors a multiple-comparison problem. In the case of two factors, three separate null hypotheses are subject to test (i.e., two main effects and one interaction). Consequently, the probability of at least one Type I error (if all null hypotheses are true) is 14 % rather than 5 %, if the three tests are independent. We explain the multiple-comparison problem and demonstrate that researchers almost never correct for it. To mitigate the problem, we describe four remedies: the omnibus F test, control of the familywise error rate, control of the false discovery rate, and preregistration of the hypotheses.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic bulletin & review, 23(2), 640-647.
Gelman, Hill and Yajima (2012) explain why this isn’t a problem for the Bayesian approach above.
Repeated measures/Within subjects Anova
Bayesian
The BayesFactor package is fussy about a few things:
- Missing data in the outcome (remove non complete rows with
filter
) - Non-factor variables (convert numbers or strings to factors or ordered factors first)
lme4::sleepstudy %>%
sleep <- mutate(Days=ordered(Days))
The BF for an effect of Days
is very large. Also note the ordering of days is being ignored in this model:
anovaBF(Reaction ~ Days + Subject, whichRandom="Subject", data=sleep)
##
|
| | 0%
|
|==================================================== | 50%
|
|=======================================================================================================| 100%
## Bayes factor analysis
## --------------
## [1] Days + Subject : 5.10746e+17 ±0.31%
##
## Against denominator:
## Reaction ~ Subject
## ---
## Bayes factor type: BFlinearModel, JZS
Adding interactions etc. works as for between-subject models shown above. Make sure you test the interaction properly: you probbaaly want to compare it to the next most complicated model, not the null model .
Frequentist
For our stage 4 guide to traditional Anova see: https://www.andywills.info/rminr/more-on-anova.html (this page is quite concise already so not repeated here).
#extension
We don’t teach this, but this alternative method can be useful for calculating effect size measures:
::ezANOVA(sleep, dv = Reaction, wid = Subject, within=Days) %>%
ez pander::pander()
ANOVA:
Effect DFn DFd F p p<.05 ges 2 Days 9 153 18.7 8.995e-21 * 0.2927 Mauchly’s Test for Sphericity:
Effect W p p<.05 2 Days 0.000219 5.155e-08 * Sphericity Corrections:
Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05 2 Days 0.369 5.463e-09 * 0.4694 7.077e-11 *
See the ezANOVA
help file for the horrible details of specifying more complex designs.
Regression and Ancova
Ancova: Between subjects factor with continuous covariate
Bayesian
This assumes you want to test the effect of a factor (Species
) conditional on a covariate (Sepal.Width
):
# calculate a base model with the covariate
lmBF(Petal.Length ~ Sepal.Width , data=iris)
h0 <-# an model of interest with the factor added
lmBF(Petal.Length ~ Sepal.Width * Species, data=iris) h1 <-
To get the BF10 for effect of the factor, conditional on the covariate:
/ h0 h1
## Bayes factor analysis
## --------------
## [1] Sepal.Width * Species : 7.447974e+86 ±3.43%
##
## Against denominator:
## Petal.Length ~ Sepal.Width
## ---
## Bayes factor type: BFlinearModel, JZS
Frequentist
lm(Petal.Length ~ Sepal.Width * Species, data=iris)
iris.ancova <-::Anova(iris.ancova) car
## Anova Table (Type II tests)
##
## Response: Petal.Length
## Sum Sq Df F value Pr(>F)
## Sepal.Width 3.89 1 26.1913 9.726e-07 ***
## Species 355.76 2 1198.2894 < 2.2e-16 ***
## Sepal.Width:Species 1.96 2 6.5973 0.001814 **
## Residuals 21.38 144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#extension
If you want to see the covariate slopes and simple-effects (i.e. the dummy coded parameters):
%>%
iris.ancova broom::tidy() %>%
pander::pander(caption="Ancova covariate and dummy coded parameters.")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.183 | 0.5007 | 2.362 | 0.01949 |
Sepal.Width | 0.08141 | 0.1452 | 0.5607 | 0.5759 |
Speciesversicolor | 0.752 | 0.6998 | 1.075 | 0.2844 |
Speciesvirginica | 2.328 | 0.7151 | 3.256 | 0.001411 |
Sepal.Width:Speciesversicolor | 0.758 | 0.2277 | 3.329 | 0.001108 |
Sepal.Width:Speciesvirginica | 0.6049 | 0.2241 | 2.699 | 0.007776 |
Multiple regression
Multiple predictors and hierarchial steps
Bayesian
We need to compare two models to test a single parameter. First run models with and then without the parameter you’re interested in:
lmBF(mpg ~ wt + hp, data=mtcars)
h1 <- lmBF(mpg ~ wt, data=mtcars) h0 <-
Then test if it improves the model (it does here) by dividing their Bayes Factors to get the Bayes Factor for the difference in the models:
/h0 h1
## Bayes factor analysis
## --------------
## [1] wt + hp : 17.27075 ±0.01%
##
## Against denominator:
## mpg ~ wt
## ---
## Bayes factor type: BFlinearModel, JZS
#extension
This is an easy to test blocks of predictors too — this is hierarchical regression (regression in ‘steps’ if you’re coming from SPSS):
lmBF(rating ~ complaints, data=attitude)
step1 <- lmBF(rating ~ complaints + learning + raises + critical, data=attitude)
step2 <-
/ step1 step2
## Bayes factor analysis
## --------------
## [1] complaints + learning + raises + critical : 0.02517264 ±0.01%
##
## Against denominator:
## rating ~ complaints
## ---
## Bayes factor type: BFlinearModel, JZS
This shows evidence against the block of three additional predictors added in step 2.
Frequentist
lm(mpg ~ wt + disp, data=mtcars) mpg.m <-
Tests of individual parameters:
summary(mpg.m)
##
## Call:
## lm(formula = mpg ~ wt + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4087 -2.3243 -0.7683 1.7721 6.3484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.96055 2.16454 16.151 4.91e-16 ***
## wt -3.35082 1.16413 -2.878 0.00743 **
## disp -0.01773 0.00919 -1.929 0.06362 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658
## F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10
Or get test results as a dataframe:
# the `statistic` column is the t statstic here
::tidy(mpg.m) broom
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 35.0 2.16 16.2 4.91e-16
## 2 wt -3.35 1.16 -2.88 7.43e- 3
## 3 disp -0.0177 0.00919 -1.93 6.36e- 2
#extension
Testing steps:
lm(rating ~ complaints, data=attitude)
step1 <- lm(rating ~ complaints + learning + raises + critical, data=attitude)
step2 <-
anova(step2, step1)
## Analysis of Variance Table
##
## Model 1: rating ~ complaints + learning + raises + critical
## Model 2: rating ~ complaints
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 1253.2
## 2 28 1369.4 -3 -116.16 0.7724 0.5204
Interactions in regression
Bayesian
# note + vs * in formula
lmBF(mpg ~ wt * hp, data=mtcars)
h1 <- lmBF(mpg ~ wt + hp, data=mtcars)
h0 <-
/h0 h1
## Bayes factor analysis
## --------------
## [1] wt * hp : 30.43604 ±0%
##
## Against denominator:
## mpg ~ wt + hp
## ---
## Bayes factor type: BFlinearModel, JZS
Frequentist
# note + vs * in formula
lm(mpg ~ wt * hp, data=mtcars)
h1 <- lm(mpg ~ wt + hp, data=mtcars)
h0 <-
anova(h1, h0)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt * hp
## Model 2: mpg ~ wt + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 129.76
## 2 29 195.05 -1 -65.286 14.088 0.0008108 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Or just:
anova(h1)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## wt 1 847.73 847.73 182.923 8.428e-14 ***
## hp 1 83.27 83.27 17.969 0.0002207 ***
## wt:hp 1 65.29 65.29 14.088 0.0008108 ***
## Residuals 28 129.76 4.63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Quadratic/polynomial terms in regression
#extension
This looks a bit non-linear so we might want to test polynomial effects:
%>%
mtcars ggplot(aes(wt, mpg)) +
geom_jitter()
Bayesian
# it's annoying but we have to add the column to the df first
# because lmBF won't allow arithmetic inside it's formula
mtcars %>%
mtcars.poly <- mutate(wt_2 = poly(wt,2))
lmBF(mpg ~ wt + wt_2, data=mtcars.poly)
h1 <- lmBF(mpg ~ wt, data=mtcars.poly)
h0 <-
/h0 h1
## Bayes factor analysis
## --------------
## [1] wt + wt_2 : 9.502005 ±0.01%
##
## Against denominator:
## mpg ~ wt
## ---
## Bayes factor type: BFlinearModel, JZS
Frequentist
lm(mpg ~ poly(wt, 2), data=mtcars)
h1 <-summary(h1) # look at the p value for the second poly term
##
## Call:
## lm(formula = mpg ~ poly(wt, 2), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.483 -1.998 -0.773 1.462 6.238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0906 0.4686 42.877 < 2e-16 ***
## poly(wt, 2)1 -29.1157 2.6506 -10.985 7.52e-12 ***
## poly(wt, 2)2 8.6358 2.6506 3.258 0.00286 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.651 on 29 degrees of freedom
## Multiple R-squared: 0.8191, Adjusted R-squared: 0.8066
## F-statistic: 65.64 on 2 and 29 DF, p-value: 1.715e-11
Model statistics
Extract R2 and other statistics from any linear model:
%>%
iris.ancova broom::glance() %>%
pivot_longer(everything())
## # A tibble: 11 x 2
## name value
## <chr> <dbl>
## 1 r.squared 9.54e- 1
## 2 adj.r.squared 9.52e- 1
## 3 sigma 3.85e- 1
## 4 statistic 5.97e+ 2
## 5 p.value 2.44e-94
## 6 df 6.00e+ 0
## 7 logLik -6.67e+ 1
## 8 AIC 1.47e+ 2
## 9 BIC 1.69e+ 2
## 10 deviance 2.14e+ 1
## 11 df.residual 1.44e+ 2
Plotting model results
Bayesian
#extension
You can plot the posterior density of model parameters. It can help to standardise first:
# standardise first
scale(mtcars) %>% as_tibble()
mtcars.z <- lmBF(mpg ~ wt + hp, data=mtcars.z) h1.z <-
## Warning: data coerced from tibble to data frame
posterior(h1.z, iterations = 5000, progress = FALSE) %>% as_tibble()
chains =
%>%
chains pivot_longer(c(wt, hp)) %>%
ggplot(aes(value)) +
geom_density(aes(y=..scaled..)) +
facet_wrap(~name, ncol=1) +
geom_vline(xintercept = 0)
Or plot intervals:
%>%
chains pivot_longer(c(wt, hp)) %>%
ggplot(aes(name, value)) +
# 95th highest density posterior interval, see Krushke book but
# where these don't cross zero the parameter is 'significant'
stat_summary(fun.data=tidybayes::mean_hdci) +
coord_flip() +
geom_hline(yintercept = 0)
To see more of this kind of exploration/plotting/inference with parameter estimates see also this guide (TODO based on MSc materials).
Frequentist
TODO
Other data handling tasks
Converting continuous and categorical data
Chopping up a continuous variable into segments (e.g. for ages)
#extension
%>%
mtcars select(wt) %>%
mutate(wt_categ = cut(wt, 5)) %>%
sample_n(10)
## wt wt_categ
## 1 3.440 (3.08,3.86]
## 2 5.424 (4.64,5.43]
## 3 2.320 (2.3,3.08]
## 4 3.845 (3.08,3.86]
## 5 2.780 (2.3,3.08]
## 6 3.840 (3.08,3.86]
## 7 3.460 (3.08,3.86]
## 8 2.770 (2.3,3.08]
## 9 1.513 (1.51,2.3]
## 10 3.570 (3.08,3.86]
Or with pre-set breaks:
%>%
mtcars select(wt) %>%
mutate(wt_categ = cut(wt, breaks = c(-1,1,3,5,Inf))) %>%
sample_n(10)
## wt wt_categ
## 1 1.935 (1,3]
## 2 1.513 (1,3]
## 3 2.465 (1,3]
## 4 3.190 (3,5]
## 5 3.440 (3,5]
## 6 3.570 (3,5]
## 7 2.320 (1,3]
## 8 2.620 (1,3]
## 9 1.615 (1,3]
## 10 3.730 (3,5]