Several staff have commented that the UG teaching materials can be time consuming to use as a reference when supporting project students.

This page is intended as a quick reference when helping students with their data handling and analysis.

Included on this page are short examples of everything we teach students in stage 1 to 4, including the project option workshops, plus some extensions/additions which you may find useful to introduce project students to. Extensions (i.e. parts not taught in the programme) are clearly marked.

Before you start

Most of the code here requires loading the tidyverse. If your student has a could not find function x error it’s almost certainly that.

read_csv('x')

You need to load tidyverse first:

library(tidyverse)

The code above would also produce the error: Error: 'x' does not exist in current working directory ('/Users/ben/dev/discourse/r-guides'). Our students often struggle with path-related issues, so make sure they are either:

Using an RStududio project and have their data in the same directory as their script, or
Use an R Markdown (.Rmd) file, and again have their data in the same directory as their script.

Code review

When reviewing student’s scripts ensure:

They load tidyverse (and BayesFactor) at the top of the file
They don’t load it again further down (this can cause weirdness)
They are including sufficient vertical white space to make the file readable
They break up long lines of code (e.g. > 100 characters, and after every pipe %>%) with a line break

Reading data

Read CSV

Read a CSV file stored in the current working directory:

mydata <- read_csv('filename.sav')

Download a file from the internet

#extension

This downloads supplementary data from a paper in PlosOne (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212482#sec016)

download.file('https://ndownloader.figstatic.com/files/14532419', 'egger.sav')

Read all the csv files stored in a directory:

If you have a folder full of data files you can list them like so:

list.files('multifile-data/')

##  [1] "participant1.csv"  "participant10.csv" "participant2.csv"  "participant3.csv"  "participant4.csv" 
##  [6] "participant5.csv"  "participant6.csv"  "participant7.csv"  "participant8.csv"  "participant9.csv"

You can read all the csv files like so:

combined.data <- tibble(file = list.files('multifile-data/')) %>% 
  rowwise() %>% 
  do(
    {
      fn = .$file;
      read_csv(paste0("multifile-data/", fn)) %>% 
        mutate(file = fn)
    }
  )

## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )
## Parsed with column specification:
## cols(
##   participant = col_double(),
##   rt = col_double(),
##   condition = col_double()
## )

# by-participant summary
combined.data %>% 
  group_by(participant) %>% 
  summarise(mean(rt), sd(rt))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 10 x 3
##    participant `mean(rt)` `sd(rt)`
##          <dbl>      <dbl>    <dbl>
##  1           1       250.     198.
##  2           2       243.     189.
##  3           3       250.     200.
##  4           4       244.     190.
##  5           5       251.     200.
##  6           6       249.     204.
##  7           7       243.     197.
##  8           8       257.     199.
##  9           9       246.     205.
## 10          10       259.     202.

Read from SPSS, Excel

#extension We don’t yet teach students this, but the rio package has a flexible interface for importing SPSS/SAV files, Excel and many other formats.

egger <- rio::import('egger.sav')

egger %>%
  select(ID:height_cm) %>%
  glimpse

## Rows: 142
## Columns: 6
## $ ID           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25…
## $ vpn          <dbl> 3, 4, 5, 6, 7, 8, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,…
## $ condition    <dbl> 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,…
## $ age_in_years <dbl> 7.916667, 7.583333, 7.416667, 7.916667, 8.083333, 7.666667, 8.250000, 7.333333, 7.333333,…
## $ gender       <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1,…
## $ height_cm    <dbl> 127.0, 121.0, 119.0, 123.0, 129.0, 127.5, 122.0, 121.0, 132.0, 135.0, 130.0, 126.0, 132.0…

Reading of different file type works, based on the file extension:

rio::import('sleep.xlsx') %>% 
  glimpse

## New names:
## * `` -> ...1

## Rows: 241
## Columns: 12
## $ ...1                                                                                        <chr> "1", "2", …
## $ uniqueid                                                                                    <chr> "60120734e…
## $ `Start time`                                                                                <dttm> 2020-11-2…
## $ `Completion time`                                                                           <dttm> 2020-11-2…
## $ `My sleep is affected by my study commitments`                                              <chr> "Agree", "…
## $ `I achieve good quality sleep`                                                              <chr> "Agree", "…
## $ `My electronic device usage negatively affects my sleep`                                    <chr> "Disagree"…
## $ `Tiredness interferes with my concentration`                                                <chr> "Strongly …
## $ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> "Strongly …
## $ `I often achieve eight hours of sleep`                                                      <chr> "Disagree"…
## $ `I regularly stay up past 11pm`                                                             <chr> "Disagree"…
## $ `My other personal commitments affect my sleep`                                             <chr> "Disagree"…

If reading an existing SPSS file which has factor (category) labels that you would like to preserve you can also use the haven package. The code below shows how to converts all labelled category data to factors. This avoids some errors later when pivoting or mutating:

egger.labelled <- haven:::read_sav('egger.sav') %>%
  mutate_if(haven::is.labelled, factor)
  
egger.labelled %>%
  select(ID:height_cm) %>%
  glimpse

## Rows: 142
## Columns: 6
## $ ID           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25…
## $ vpn          <dbl> 3, 4, 5, 6, 7, 8, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,…
## $ condition    <fct> 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,…
## $ age_in_years <dbl> 7.916667, 7.583333, 7.416667, 7.916667, 8.083333, 7.666667, 8.250000, 7.333333, 7.333333,…
## $ gender       <fct> 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1,…
## $ height_cm    <dbl> 127.0, 121.0, 119.0, 123.0, 129.0, 127.5, 122.0, 121.0, 132.0, 135.0, 130.0, 126.0, 132.0…

Visually there is not much difference, but the variables of type <dbl+lbl> (haven’s labelled data type) are now <fct> (boring old R factors). If you have any other trouble with SPSS labels read the Haven documentation.

Fix file path errors

read_csv('doesntexist.csv')

If the file you want to read in can’t be found, check the current working directory (also displayed in the error message), or list the files in the directory to check what is actually there

getwd()

## [1] "/Users/ben/dev/cheatR"

list.files()

##  [1] "_build.R"                              "cheatR.Rproj"                         
##  [3] "complete-cheat.rmd"                    "docs"                                 
##  [5] "egger.sav"                             "index.rmd"                            
##  [7] "lifesavr_cheatsheet.rmd"               "multifile-data"                       
##  [9] "README.md"                             "references.bib"                       
## [11] "sleep.xlsx"                            "staff-cheatsheet_files"               
## [13] "staff-project-students-cheatsheet.rmd"

Renaming columns

Use rename for simple cases

mtcars %>%
  select(cyl, mpg) %>%
  rename(MPG = mpg, Cylinders = cyl) %>%
  head()

##                   Cylinders  MPG
## Mazda RX4                 6 21.0
## Mazda RX4 Wag             6 21.0
## Datsun 710                4 22.8
## Hornet 4 Drive            6 21.4
## Hornet Sportabout         8 18.7
## Valiant                   6 18.1

Use back-ticks to wrap the new name if you want to include spaces:

mtcars %>%
  select(mpg) %>%
  rename(`Miles per gallon` = mpg) %>%
  head()

##                   Miles per gallon
## Mazda RX4                     21.0
## Mazda RX4 Wag                 21.0
## Datsun 710                    22.8
## Hornet 4 Drive                21.4
## Hornet Sportabout             18.7
## Valiant                       18.1

Tidying messy names

#extension

The janitor package has a really great function to tidy messy names as they come from online survey systems. This is a real example:

rio::import('sleep.xlsx') %>% 
  glimpse

## New names:
## * `` -> ...1

## Rows: 241
## Columns: 12
## $ ...1                                                                                        <chr> "1", "2", …
## $ uniqueid                                                                                    <chr> "60120734e…
## $ `Start time`                                                                                <dttm> 2020-11-2…
## $ `Completion time`                                                                           <dttm> 2020-11-2…
## $ `My sleep is affected by my study commitments`                                              <chr> "Agree", "…
## $ `I achieve good quality sleep`                                                              <chr> "Agree", "…
## $ `My electronic device usage negatively affects my sleep`                                    <chr> "Disagree"…
## $ `Tiredness interferes with my concentration`                                                <chr> "Strongly …
## $ `My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children...` <chr> "Strongly …
## $ `I often achieve eight hours of sleep`                                                      <chr> "Disagree"…
## $ `I regularly stay up past 11pm`                                                             <chr> "Disagree"…
## $ `My other personal commitments affect my sleep`                                             <chr> "Disagree"…

rio::import('sleep.xlsx') %>% 
  janitor::clean_names() %>%
  glimpse

## New names:
## * `` -> ...1

## Rows: 241
## Columns: 12
## $ x1                                                                                 <chr> "1", "2", "3", "4",…
## $ uniqueid                                                                           <chr> "60120734ec", "d0a9…
## $ start_time                                                                         <dttm> 2020-11-27 14:21:2…
## $ completion_time                                                                    <dttm> 2020-11-27 14:22:4…
## $ my_sleep_is_affected_by_my_study_commitments                                       <chr> "Agree", "Somewhat …
## $ i_achieve_good_quality_sleep                                                       <chr> "Agree", "Agree", "…
## $ my_electronic_device_usage_negatively_affects_my_sleep                             <chr> "Disagree", "Strong…
## $ tiredness_interferes_with_my_concentration                                         <chr> "Strongly agree", "…
## $ my_sleep_is_disturbed_by_external_factors_e_g_loud_cars_housemates_lights_children <chr> "Strongly agree", "…
## $ i_often_achieve_eight_hours_of_sleep                                               <chr> "Disagree", "Somewh…
## $ i_regularly_stay_up_past_11pm                                                      <chr> "Disagree", "Disagr…
## $ my_other_personal_commitments_affect_my_sleep                                      <chr> "Disagree", "Neithe…

This makes working with the dataset much easier:

Everything is lower case, so no guessing whether you need to capitalise names
No spaces (so no escaping of spaces needed with back ticks)
No special characters which can mess things up

Renaming programatically

#extension You can do similar things yourself by doing string replacements and the set_names function.

# here the dot refers to the data coming from the pipe
rio::import('sleep.xlsx') %>% names(.)

## New names:
## * `` -> ...1

##  [1] "...1"                                                                                     
##  [2] "uniqueid"                                                                                 
##  [3] "Start time"                                                                               
##  [4] "Completion time"                                                                          
##  [5] "My sleep is affected by my study commitments"                                             
##  [6] "I achieve good quality sleep"                                                             
##  [7] "My electronic device usage negatively affects my sleep"                                   
##  [8] "Tiredness interferes with my concentration"                                               
##  [9] "My sleep is disturbed by external factors e.g. loud cars, housemates, lights, children..."
## [10] "I often achieve eight hours of sleep"                                                     
## [11] "I regularly stay up past 11pm"                                                            
## [12] "My other personal commitments affect my sleep"

So this can be more flexible, but does still need some more work to remove special characters like parentheses:

rio::import('sleep.xlsx') %>%
  set_names(tolower(str_replace_all(names(.), " ", "_"))) %>%
  glimpse

## New names:
## * `` -> ...1

## Rows: 241
## Columns: 12
## $ ...1                                                                                        <chr> "1", "2", …
## $ uniqueid                                                                                    <chr> "60120734e…
## $ start_time                                                                                  <dttm> 2020-11-2…
## $ completion_time                                                                             <dttm> 2020-11-2…
## $ my_sleep_is_affected_by_my_study_commitments                                                <chr> "Agree", "…
## $ i_achieve_good_quality_sleep                                                                <chr> "Agree", "…
## $ my_electronic_device_usage_negatively_affects_my_sleep                                      <chr> "Disagree"…
## $ tiredness_interferes_with_my_concentration                                                  <chr> "Strongly …
## $ `my_sleep_is_disturbed_by_external_factors_e.g._loud_cars,_housemates,_lights,_children...` <chr> "Strongly …
## $ i_often_achieve_eight_hours_of_sleep                                                        <chr> "Disagree"…
## $ i_regularly_stay_up_past_11pm                                                               <chr> "Disagree"…
## $ my_other_personal_commitments_affect_my_sleep                                               <chr> "Disagree"…

Tables/descriptive statistics

In stage 1 and 2 students learn how to use group_by and summarise to make tables of descriptive statistics, and the better tables worksheet recommends the same:

mtcars %>%
  group_by(cyl) %>%
  summarise(MPG = mean(mpg), SD=sd(mpg))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 3
##     cyl   MPG    SD
##   <dbl> <dbl> <dbl>
## 1     4  26.7  4.51
## 2     6  19.7  1.45
## 3     8  15.1  2.56

Summarising multiple variables

#extension It’s also possible to use pivot_longer to simplify summarising multiple variables. We don’t show this technique explicitly, but we do teach pivot_longer and pivot_wider, so students could combine the ideas for themselves:

mtcars %>%
  select(cyl, mpg, wt, disp, drat) %>%
  pivot_longer(-cyl) %>%
  group_by(name) %>%
  summarise(M=mean(value), SD=sd(value), Med = median(value), IQR = IQR(value))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 5
##   name       M      SD    Med     IQR
##   <chr>  <dbl>   <dbl>  <dbl>   <dbl>
## 1 disp  231.   124.    196.   205.   
## 2 drat    3.60   0.535   3.70   0.840
## 3 mpg    20.1    6.03   19.2    7.38 
## 4 wt      3.22   0.978   3.32   1.03

Comparing variables in tables

#extension Using pivot_longer and pivot_wider you can compare variables in a table side by side:

mtcars %>%
  select(cyl, am, mpg, wt) %>%
  # note here we have to select multiple columns to exclude (cyl and am) by adding a hyhen in front of the name
  pivot_longer(c(-cyl, -am)) %>%
  # recode am to have nice labels in the table below
  mutate(am=factor(am, levels=c(0,1), labels=c("Manual", "Auto"))) %>%
  group_by(am, name) %>%
  summarise(M=mean(value), SD=sd(value)) %>%
  pivot_wider(names_from=am, values_from=c(M, SD)) %>%
  # replace underscores in column names with spaces
  set_names(str_replace(names(.), "_", " "))

## `summarise()` regrouping output by 'am' (override with `.groups` argument)

## # A tibble: 2 x 5
##   name  `M Manual` `M Auto` `SD Manual` `SD Auto`
##   <chr>      <dbl>    <dbl>       <dbl>     <dbl>
## 1 mpg        17.1     24.4        3.83      6.17 
## 2 wt          3.77     2.41       0.777     0.617

See reshaping data below for a guide to pivot_longer and pivot_wider. See also various techniques for renaming columns.

Reshaping data

We cover reshaping in the within-subject differences worksheet, but only briefly. For MSC students we cover it here: [TODO ADD LINK]

There are two functions to reshape data:

pivot_longer
pivot_wider

Reshaping to long form

Select only what you need

It’s best to select only the data you need before reshaping. Although not strictly necessary it does:

make it easier to check the output is as desired
avoids an error where columns are of different types

If you try and pivot_longer a mix of numeric and character (text) data this will fail*. If you want to keep text columns, exclude from the pivoting (see below for how).

Simplest case: make everything long

mtcars %>%
  select(mpg, wt) %>%
  pivot_longer(cols=everything())

## # A tibble: 64 x 2
##    name  value
##    <chr> <dbl>
##  1 mpg   21   
##  2 wt     2.62
##  3 mpg   21   
##  4 wt     2.88
##  5 mpg   22.8 
##  6 wt     2.32
##  7 mpg   21.4 
##  8 wt     3.22
##  9 mpg   18.7 
## 10 wt     3.44
## # … with 54 more rows

In the code above writing cols= is optional, but makes things explicit and should be encouraged.

Exclude some variables but keep them as index columns

Imagine we have one column which we’d like to keep as an index. We can exclude from pivoting by writing a hyphen in front of the variable name:

mtcars %>%
  select(cyl, mpg, wt) %>%
  pivot_longer(cols = -cyl) %>%
  sample_n(6)

## # A tibble: 6 x 3
##     cyl name  value
##   <dbl> <chr> <dbl>
## 1     8 mpg   18.7 
## 2     4 wt     1.62
## 3     6 mpg   21.4 
## 4     8 wt     3.84
## 5     4 mpg   24.4 
## 6     8 wt     3.17

If we have two columns to exclude we need to specify them in a vector using c():

mtcars %>%
  select(cyl, am, mpg, wt) %>%
  pivot_longer(cols = c(-cyl, -am)) %>%
  sample_n(6)

## # A tibble: 6 x 4
##     cyl    am name  value
##   <dbl> <dbl> <chr> <dbl>
## 1     8     1 mpg   15   
## 2     8     0 mpg   18.7 
## 3     4     1 mpg   32.4 
## 4     4     0 mpg   22.8 
## 5     8     0 wt     3.52
## 6     8     0 wt     3.44

You can also do this the other way, by specifying only which columns you do want to pivot:

mtcars %>%
  select(cyl, am, mpg, wt) %>%
  pivot_longer(cols = c(mpg, wt)) %>%
  sample_n(6)

## # A tibble: 6 x 4
##     cyl    am name  value
##   <dbl> <dbl> <chr> <dbl>
## 1     4     1 wt     1.84
## 2     8     0 mpg   16.4 
## 3     4     0 wt     3.15
## 4     8     1 mpg   15.8 
## 5     8     0 wt     3.73
## 6     8     0 mpg   15.2

Reshape to wider form

Use pivot_wider. With a simple table like this:

mpg.summary <- mtcars %>%
  group_by(cyl) %>%
  summarise(M = mean(mpg))

## `summarise()` ungrouping output (override with `.groups` argument)

mpg.summary

## # A tibble: 3 x 2
##     cyl     M
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1

We can spread the data wide to enable comparison (and the table is more in line with expected APA format):

mpg.summary %>%
  pivot_wider(names_from=cyl, values_from=M)

## # A tibble: 1 x 3
##     `4`   `6`   `8`
##   <dbl> <dbl> <dbl>
## 1  26.7  19.7  15.1

This is especially useful when you have more rows in the table:

diamonds %>%
  # this is a quick way of renaming color to be capitalised
  group_by(cut, Color=color) %>%
  summarise(M = mean(price)) %>%
  pivot_wider(names_from=cut, values_from=M) %>% 
  pander::pander()

## `summarise()` regrouping output by 'cut' (override with `.groups` argument)

Color	Fair	Good	Very Good	Premium	Ideal
D	4291	3405	3470	3631	2629
E	3682	3424	3215	3539	2598
F	3827	3496	3779	4325	3375
G	4239	4123	3873	4501	3721
H	5136	4276	4535	5217	3889
I	4685	5079	5256	5946	4452
J	4976	4574	5104	6295	4918

Data Visualisation

In stage 1 and 2 students are taught scatter, density and boxplots:

Scatter plots

iris %>%
  ggplot(aes(x = Sepal.Length, y=Petal.Length)) +
  geom_point()

#extension Adding jitter to a plot can be useful. We don’t teach it, but I recommend using geom_jitter rather than geom_point. This is especially helpful with survey data which is not often truly continuous; adding jitter helps show points which overlap on scale boundaries:

iris %>%
  ggplot(aes(x = Sepal.Length, y=Petal.Length)) +
  geom_jitter() + 
  ggtitle("Geom jitter reveals overlapping points in non-scale data.")

Color dimensions

With a categorical color dimension:

diamonds %>%
  ggplot(aes(x = carat, y=price, color=clarity)) +
  geom_point()

Or a continuous color dimension:

iris %>%
  ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Petal.Length)) +
  geom_jitter()

Smoothed lines

We show students how to add a smoothed line:

iris %>%
  ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Species)) +
  geom_jitter() +
  geom_smooth(se=F)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And also how to use a linear fit for the lines:

iris %>%
  ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Species)) +
  geom_jitter() +
  geom_smooth(se=F, method=lm)

## `geom_smooth()` using formula 'y ~ x'

Facets/panels

Create a grid of plots using a single faceting variable:

diamonds %>%
  ggplot(aes(x = carat, y=price)) +
  geom_point() +
  facet_wrap(~clarity)

#extension Or a two-way grid of facets using two variables:

diamonds %>%
  ggplot(aes(x = carat, y=price)) +
  # note adding alpha=.1 makes points partly transparent and makes it easier to see
  # where most of the density is in large datasets. The term alpha refers to the alpha-channel
  # in computer graphics which controls the transparency of images
  geom_point(alpha=.1) +
  facet_grid(cut~clarity)

Combining two plots

#extension

You can use the cowplot package if you want to combine 2 plots and can’t use faceting.

plot1 <- diamonds %>%
  ggplot(aes(x = carat, y=price)) +
  geom_point(alpha=.1) +
  facet_wrap(~clarity)

plot2 <- iris %>%
  ggplot(aes(x = Sepal.Length, y=Sepal.Width, color=Species)) +
  geom_jitter(alpha=.7) +
  geom_smooth(se=F, method=lm)

cowplot::plot_grid(plot1, plot2)

## `geom_smooth()` using formula 'y ~ x'

Boxplots (and similar)

Where the x axis is a categorical variable we teach students to use a boxplot (in preference to bar plots):

diamonds %>%
  ggplot(aes(x = cut, y=log(price))) +
  geom_boxplot()

#extension If you don’t like boxplots you can use a point-range plot. Bar charts are not recommended because of the ‘within-bar bias’, whereby readers perceive points inside the bar (e.g. the bottom of an error bar) as more likely than those outside the bar (e.g. the top of an error bar), even though this is not the case.

diamonds %>%
  ggplot(aes(x = cut, y=price)) +
  # set the error bars to be 95% CI. Default is the SE.
  # could also use the median_hilow function here
  stat_summary(fun.data=mean_cl_normal)

#extension Combining boxplots and facets can be useful for experimental data:

warpbreaks %>%
  ggplot(aes(wool, breaks)) +
  geom_boxplot() +
  # note the dot here means 'skip the horizontal-facet`
  facet_grid(.~tension)

#extension Some people like violin plots + boxplots. Day 9 in this data is a good example of why:

lme4::sleepstudy %>%
  ggplot(aes(factor(Days), Reaction)) +
  geom_violin() +
  geom_boxplot(width=.2)

Labelling axes

#extension

xlab() and ylab() change x and y axis labels
Color labels can also be changed
Labels in facets are fiddly but also possible

Statistics

Students are taught Bayesian techniques first and supervisors should tend to encourage this too. Start by loading the BayesFactor package:

library(BayesFactor)

Model ‘formula’

Most model functions in R accept a formula which describes the outcome and predictor variables (we avoid using dependent and independent variable nomenclature because researchers frequently misuse it when some or all variables are observed rather than manipulated).

Some functions, like t-test, have an alternative method for wide-format data, but this is relatively rare and it is better for students to get used to using model formulae.

A formula always has three parts:

y ~ x

The ‘left hand side’, y, which is the outcome
The tilde symbol, ~, which means “is predicted by”
The ‘right hand side’, x, which is the predictor(s)

You can add multiple additive predictors using the + symbol:

y ~ x1 + x2 ...

Or add interactions between variables with the *:

y ~ x1 * x2 * x3

If you write it this way all 2nd and 3rd-level interactions would be included.

Some more complex models require more than one variable on the left hand side, or can combine more than once formula (e.g. for SEM) but we don’t teach these on any of the programmes.

Compare two groups

Two independent samples:

Bayesian

ttestBF(formula= mpg ~ am, data=mtcars)

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 86.58973 ±0%
## 
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

Frequentist

t.test(mpg ~ am, data=mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Compare two paired samples

#extension Note: students aren’t taught paired-samples t-test and are suggested to use a within-Anova with 2 levels of a single factor. This is statistically equivalent and will give you the same answer. See below for how to do it.

If you really want to do a ttest, we need to make an example dataset in the right format first (if you know a built in example in R let me know):

paired.data <- lme4::sleepstudy %>%
  filter(Days==1 | Days==9) %>%
  mutate(Days=paste0('day', Days)) %>%
  pivot_wider(names_from=Days, values_from=Reaction)

paired.data

## # A tibble: 18 x 3
##    Subject  day1  day9
##    <fct>   <dbl> <dbl>
##  1 308      259.  466.
##  2 309      205.  237.
##  3 310      194.  248.
##  4 330      300.  354.
##  5 331      285   372.
##  6 332      243.  254.
##  7 333      290.  362.
##  8 334      276.  377.
##  9 335      274.  237.
## 10 337      314.  459.
## 11 349      230.  352.
## 12 350      243.  389.
## 13 351      300.  348.
## 14 352      298.  389.
## 15 369      268.  367.
## 16 370      235.  372.
## 17 371      272.  369.
## 18 372      273.  364.

Bayesian

with(paired.data, ttestBF(day1, day9, paired=TRUE))

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 3966.803 ±0%
## 
## Against denominator:
##   Null, mu = 0 
## ---
## Bayes factor type: BFoneSample, JZS

Frequentist

with(paired.data, t.test(day1, day9, paired=TRUE))

## 
##  Paired t-test
## 
## data:  day1 and day9
## t = -6.5205, df = 17, p-value = 5.236e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -114.29724  -58.41369
## sample estimates:
## mean of the differences 
##               -86.35547

More than 2 groups, everything between subjects

Optionally with interactions too:

Bayesian

wool.anova <- anovaBF(breaks ~ wool * tension, data=warpbreaks)

## 
  |                                                                                                             
  |                                                                                                       |   0%
  |                                                                                                             
  |==========================                                                                             |  25%
  |                                                                                                             
  |====================================================                                                   |  50%
  |                                                                                                             
  |=============================================================================                          |  75%
  |                                                                                                             
  |=======================================================================================================| 100%

wool.anova

## Bayes factor analysis
## --------------
## [1] wool                          : 0.8184672 ±0.01%
## [2] tension                       : 21.451    ±0.01%
## [3] wool + tension                : 22.42743  ±1.19%
## [4] wool + tension + wool:tension : 69.26559  ±1.71%
## 
## Against denominator:
##   Intercept only 
## ---
## Bayes factor type: BFlinearModel, JZS

We can read lines 1 and 2 of the output as ‘main effects’.
Line 3 is the combined effect of wool and tension

Line 4 should not be interpreted on it’s own. Instead, compare it to line 3:

wool.anova[4]/wool.anova[3]

## Bayes factor analysis
## --------------
## [1] wool + tension + wool:tension : 3.088432 ±2.08%
## 
## Against denominator:
##   breaks ~ wool + tension 
## ---
## Bayes factor type: BFlinearModel, JZS

This is the BF10 in favour of the interaction.

Frequentist

If you don’t have interactions it won’t matter, but if you have interactions you probably want to report type 3 sums of squares.

car::Anova(lm(breaks ~ wool * tension, data=warpbreaks), type=3)

## Anova Table (Type III tests)
## 
## Response: breaks
##               Sum Sq Df  F value    Pr(>F)    
## (Intercept)  17866.8  1 149.2757 2.426e-16 ***
## wool          1200.5  1  10.0301 0.0026768 ** 
## tension       2468.5  2  10.3121 0.0001881 ***
## wool:tension  1002.8  2   4.1891 0.0210442 *  
## Residuals     5745.1 48                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note, although most people don’t you should probably correct for multiple comparisons after completing 2 way Anova. From Cramer et al, 2016:

Many psychologists do not realize that exploratory use of the popular multiway analysis of variance harbors a multiple-comparison problem. In the case of two factors, three separate null hypotheses are subject to test (i.e., two main effects and one interaction). Consequently, the probability of at least one Type I error (if all null hypotheses are true) is 14 % rather than 5 %, if the three tests are independent. We explain the multiple-comparison problem and demonstrate that researchers almost never correct for it. To mitigate the problem, we describe four remedies: the omnibus F test, control of the familywise error rate, control of the false discovery rate, and preregistration of the hypotheses.

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic bulletin & review, 23(2), 640-647.

Gelman, Hill and Yajima (2012) explain why this isn’t a problem for the Bayesian approach above.

Repeated measures/Within subjects Anova

Bayesian

The BayesFactor package is fussy about a few things:

Missing data in the outcome (remove non complete rows with filter)
Non-factor variables (convert numbers or strings to factors or ordered factors first)

sleep <- lme4::sleepstudy %>%
  mutate(Days=ordered(Days))

The BF for an effect of Days is very large. Also note the ordering of days is being ignored in this model:

anovaBF(Reaction ~ Days + Subject, whichRandom="Subject", data=sleep)

## 
  |                                                                                                             
  |                                                                                                       |   0%
  |                                                                                                             
  |====================================================                                                   |  50%
  |                                                                                                             
  |=======================================================================================================| 100%

## Bayes factor analysis
## --------------
## [1] Days + Subject : 5.10746e+17 ±0.31%
## 
## Against denominator:
##   Reaction ~ Subject 
## ---
## Bayes factor type: BFlinearModel, JZS

Adding interactions etc. works as for between-subject models shown above. Make sure you test the interaction properly: you probbaaly want to compare it to the next most complicated model, not the null model .

Frequentist

For our stage 4 guide to traditional Anova see: https://www.andywills.info/rminr/more-on-anova.html (this page is quite concise already so not repeated here).

#extension We don’t teach this, but this alternative method can be useful for calculating effect size measures:

ez::ezANOVA(sleep, dv = Reaction, wid = Subject, within=Days) %>%
  pander::pander()

ANOVA:

Effect DFn DFd F p p<.05 ges

2 Days 9 153 18.7 8.995e-21 * 0.2927
Mauchly’s Test for Sphericity:

Effect W p p<.05

2 Days 0.000219 5.155e-08 *
Sphericity Corrections:

Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05

2 Days 0.369 5.463e-09 * 0.4694 7.077e-11 *

	Effect	DFn	DFd	F	p	p<.05	ges
2	Days	9	153	18.7	8.995e-21	*	0.2927

	Effect	W	p	p<.05
2	Days	0.000219	5.155e-08	*

	Effect	GGe	p[GG]	p[GG]<.05	HFe	p[HF]	p[HF]<.05
2	Days	0.369	5.463e-09	*	0.4694	7.077e-11	*

See the ezANOVA help file for the horrible details of specifying more complex designs.

Regression and Ancova

Ancova: Between subjects factor with continuous covariate

Bayesian

This assumes you want to test the effect of a factor (Species) conditional on a covariate (Sepal.Width):

# calculate a base model with the covariate
h0 <- lmBF(Petal.Length ~ Sepal.Width ,  data=iris)
# an model of interest with the factor added
h1 <- lmBF(Petal.Length ~ Sepal.Width * Species,  data=iris)

To get the BF10 for effect of the factor, conditional on the covariate:

h1 / h0

## Bayes factor analysis
## --------------
## [1] Sepal.Width * Species : 7.447974e+86 ±3.43%
## 
## Against denominator:
##   Petal.Length ~ Sepal.Width 
## ---
## Bayes factor type: BFlinearModel, JZS

Frequentist

iris.ancova <- lm(Petal.Length ~ Sepal.Width * Species,  data=iris)
car::Anova(iris.ancova)

## Anova Table (Type II tests)
## 
## Response: Petal.Length
##                     Sum Sq  Df   F value    Pr(>F)    
## Sepal.Width           3.89   1   26.1913 9.726e-07 ***
## Species             355.76   2 1198.2894 < 2.2e-16 ***
## Sepal.Width:Species   1.96   2    6.5973  0.001814 ** 
## Residuals            21.38 144                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#extension If you want to see the covariate slopes and simple-effects (i.e. the dummy coded parameters):

iris.ancova %>%
  broom::tidy() %>%
  pander::pander(caption="Ancova covariate and dummy coded parameters.")

Ancova covariate and dummy coded parameters.
term	estimate	std.error	statistic	p.value
(Intercept)	1.183	0.5007	2.362	0.01949
Sepal.Width	0.08141	0.1452	0.5607	0.5759
Speciesversicolor	0.752	0.6998	1.075	0.2844
Speciesvirginica	2.328	0.7151	3.256	0.001411
Sepal.Width:Speciesversicolor	0.758	0.2277	3.329	0.001108
Sepal.Width:Speciesvirginica	0.6049	0.2241	2.699	0.007776

Multiple regression

Multiple predictors and hierarchial steps

Bayesian

We need to compare two models to test a single parameter. First run models with and then without the parameter you’re interested in:

h1 <- lmBF(mpg ~ wt + hp, data=mtcars)
h0 <- lmBF(mpg ~ wt, data=mtcars)

Then test if it improves the model (it does here) by dividing their Bayes Factors to get the Bayes Factor for the difference in the models:

h1/h0

## Bayes factor analysis
## --------------
## [1] wt + hp : 17.27075 ±0.01%
## 
## Against denominator:
##   mpg ~ wt 
## ---
## Bayes factor type: BFlinearModel, JZS

#extension This is an easy to test blocks of predictors too — this is hierarchical regression (regression in ‘steps’ if you’re coming from SPSS):

step1 <- lmBF(rating ~ complaints, data=attitude)
step2 <- lmBF(rating ~ complaints + learning + raises + critical, data=attitude)

step2 / step1

## Bayes factor analysis
## --------------
## [1] complaints + learning + raises + critical : 0.02517264 ±0.01%
## 
## Against denominator:
##   rating ~ complaints 
## ---
## Bayes factor type: BFlinearModel, JZS

This shows evidence against the block of three additional predictors added in step 2.

Frequentist

mpg.m <- lm(mpg ~ wt + disp, data=mtcars)

Tests of individual parameters:

summary(mpg.m)

## 
## Call:
## lm(formula = mpg ~ wt + disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4087 -2.3243 -0.7683  1.7721  6.3484 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.96055    2.16454  16.151 4.91e-16 ***
## wt          -3.35082    1.16413  -2.878  0.00743 ** 
## disp        -0.01773    0.00919  -1.929  0.06362 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared:  0.7809, Adjusted R-squared:  0.7658 
## F-statistic: 51.69 on 2 and 29 DF,  p-value: 2.744e-10

Or get test results as a dataframe:

# the `statistic` column is the t statstic here
broom::tidy(mpg.m)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  35.0      2.16        16.2  4.91e-16
## 2 wt           -3.35     1.16        -2.88 7.43e- 3
## 3 disp         -0.0177   0.00919     -1.93 6.36e- 2

#extension Testing steps:

step1 <- lm(rating ~ complaints, data=attitude)
step2 <- lm(rating ~ complaints + learning + raises + critical, data=attitude)

anova(step2, step1)

## Analysis of Variance Table
## 
## Model 1: rating ~ complaints + learning + raises + critical
## Model 2: rating ~ complaints
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     25 1253.2                           
## 2     28 1369.4 -3   -116.16 0.7724 0.5204

Interactions in regression

Bayesian

# note + vs * in formula
h1 <- lmBF(mpg ~ wt * hp, data=mtcars)
h0 <- lmBF(mpg ~ wt + hp, data=mtcars)

h1/h0

## Bayes factor analysis
## --------------
## [1] wt * hp : 30.43604 ±0%
## 
## Against denominator:
##   mpg ~ wt + hp 
## ---
## Bayes factor type: BFlinearModel, JZS

Frequentist

# note + vs * in formula
h1 <- lm(mpg ~ wt * hp, data=mtcars)
h0 <- lm(mpg ~ wt + hp, data=mtcars)

anova(h1, h0)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt * hp
## Model 2: mpg ~ wt + hp
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     28 129.76                                  
## 2     29 195.05 -1   -65.286 14.088 0.0008108 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or just:

anova(h1)

## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## wt         1 847.73  847.73 182.923 8.428e-14 ***
## hp         1  83.27   83.27  17.969 0.0002207 ***
## wt:hp      1  65.29   65.29  14.088 0.0008108 ***
## Residuals 28 129.76    4.63                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Quadratic/polynomial terms in regression

#extension This looks a bit non-linear so we might want to test polynomial effects:

mtcars %>%
  ggplot(aes(wt, mpg)) +
  geom_jitter()

Bayesian

# it's annoying but we have to add the column to the df first
# because lmBF won't allow arithmetic inside it's formula

mtcars.poly <- mtcars %>%
  mutate(wt_2 = poly(wt,2))

h1 <- lmBF(mpg ~ wt + wt_2, data=mtcars.poly)
h0 <- lmBF(mpg ~ wt, data=mtcars.poly)

h1/h0

## Bayes factor analysis
## --------------
## [1] wt + wt_2 : 9.502005 ±0.01%
## 
## Against denominator:
##   mpg ~ wt 
## ---
## Bayes factor type: BFlinearModel, JZS

Frequentist

h1 <- lm(mpg ~ poly(wt, 2), data=mtcars)
summary(h1) # look at the p value for the second poly term

## 
## Call:
## lm(formula = mpg ~ poly(wt, 2), data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.483 -1.998 -0.773  1.462  6.238 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.0906     0.4686  42.877  < 2e-16 ***
## poly(wt, 2)1 -29.1157     2.6506 -10.985 7.52e-12 ***
## poly(wt, 2)2   8.6358     2.6506   3.258  0.00286 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.651 on 29 degrees of freedom
## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8066 
## F-statistic: 65.64 on 2 and 29 DF,  p-value: 1.715e-11

Model statistics

Extract R² and other statistics from any linear model:

iris.ancova %>%
  broom::glance() %>%
  pivot_longer(everything())

## # A tibble: 11 x 2
##    name              value
##    <chr>             <dbl>
##  1 r.squared      9.54e- 1
##  2 adj.r.squared  9.52e- 1
##  3 sigma          3.85e- 1
##  4 statistic      5.97e+ 2
##  5 p.value        2.44e-94
##  6 df             6.00e+ 0
##  7 logLik        -6.67e+ 1
##  8 AIC            1.47e+ 2
##  9 BIC            1.69e+ 2
## 10 deviance       2.14e+ 1
## 11 df.residual    1.44e+ 2

Plotting model results

Bayesian

#extension You can plot the posterior density of model parameters. It can help to standardise first:

# standardise first
mtcars.z <- scale(mtcars) %>% as_tibble()
h1.z <- lmBF(mpg ~ wt + hp, data=mtcars.z)

## Warning: data coerced from tibble to data frame

chains = posterior(h1.z, iterations = 5000, progress = FALSE) %>% as_tibble()

chains %>%
  pivot_longer(c(wt, hp)) %>%
  ggplot(aes(value)) +
  geom_density(aes(y=..scaled..)) +
  facet_wrap(~name, ncol=1) +
  geom_vline(xintercept = 0)

Or plot intervals:

chains %>%
  pivot_longer(c(wt, hp)) %>%
  ggplot(aes(name, value)) +
  # 95th highest density posterior interval, see Krushke book but
  # where these don't cross zero the parameter is 'significant'
  stat_summary(fun.data=tidybayes::mean_hdci) +
  coord_flip() +
  geom_hline(yintercept = 0)

To see more of this kind of exploration/plotting/inference with parameter estimates see also this guide (TODO based on MSc materials).

Frequentist

TODO

Other data handling tasks

Recoding categorical data

TODO

Taught in stage 2 here: https://benwhalley.github.io/rmip/data.html

Converting continuous and categorical data

Chopping up a continuous variable into segments (e.g. for ages)

#extension

mtcars %>%
  select(wt) %>%
  mutate(wt_categ = cut(wt, 5)) %>%
  sample_n(10)

##       wt    wt_categ
## 1  3.440 (3.08,3.86]
## 2  5.424 (4.64,5.43]
## 3  2.320  (2.3,3.08]
## 4  3.845 (3.08,3.86]
## 5  2.780  (2.3,3.08]
## 6  3.840 (3.08,3.86]
## 7  3.460 (3.08,3.86]
## 8  2.770  (2.3,3.08]
## 9  1.513  (1.51,2.3]
## 10 3.570 (3.08,3.86]

Or with pre-set breaks:

mtcars %>%
  select(wt) %>%
  mutate(wt_categ = cut(wt, breaks = c(-1,1,3,5,Inf))) %>%
  sample_n(10)

##       wt wt_categ
## 1  1.935    (1,3]
## 2  1.513    (1,3]
## 3  2.465    (1,3]
## 4  3.190    (3,5]
## 5  3.440    (3,5]
## 6  3.570    (3,5]
## 7  2.320    (1,3]
## 8  2.620    (1,3]
## 9  1.615    (1,3]
## 10 3.730    (3,5]

Everything you needed to know to help your project students but were too busy to find out last year.

Ben Whalley

Before you start

Code review

Reading data

Read CSV

Download a file from the internet

Read all the csv files stored in a directory:

Read from SPSS, Excel

Fix file path errors

Renaming columns

Tidying messy names

Renaming programatically

Tables/descriptive statistics

Summarising multiple variables

Comparing variables in tables

Reshaping data

Reshaping to long form

Select only what you need

Simplest case: make everything long

Exclude some variables but keep them as index columns

Reshape to wider form

Data Visualisation

Scatter plots

Color dimensions

Smoothed lines

Facets/panels

Combining two plots

Boxplots (and similar)

Labelling axes

Statistics

Model ‘formula’

Compare two groups

Two independent samples:

Bayesian

Frequentist

Compare two paired samples

Bayesian

Frequentist

More than 2 groups, everything between subjects

Bayesian

Frequentist

Repeated measures/Within subjects Anova

Bayesian

Frequentist

Regression and Ancova

Ancova: Between subjects factor with continuous covariate

Bayesian

Frequentist

Multiple regression

Multiple predictors and hierarchial steps

Bayesian

Frequentist

Interactions in regression

Bayesian

Frequentist

Quadratic/polynomial terms in regression

Bayesian

Frequentist

Model statistics

Plotting model results

Bayesian

Frequentist

Other data handling tasks

Recoding categorical data

Converting continuous and categorical data

Chopping up a continuous variable into segments (e.g. for ages)