9  Intro to R

R is an open source (i.e. free!) statistical analysis and graphic software that can be run on any operating system - Windows, Mac, and even Linux. R is also a syntax-only program; there are no drop-down menus. An added benefit to the open source nature of the program is that anyone can contribute to it and it is what I term ‘infinitely Google-able’. Since so many people have openly worked on and with it, when it throws an error code, simply pasting that error message in your search bar will yield very helpful results.

R is maintained by the R Core development team, and is consistently updated. You’ll want to periodically update your R version to make sure you stay up-to-date, but it will still run on older versions.

9.1 R and R Studio

R is a programming language, and has its own software. However, many users find it much easier to interface with R using R Studio. It is a GUI wrapper for R that allows you to save script files, view plots, and see what objects you have created, among other things.

To download R, visit https://cran.r-project.org/, and select the correct link for your operating system. Windows users will select the “base” option. R Studio can be found at https://posit.co/download/rstudio-desktop/. You will see on this page that it directs you to install R if you have not already. If you have, proceed on to installing R Studio.

See the videos https://youtu.be/0FUB-CGRFR4 and https://youtu.be/DQe-kNqm3L4 for download instructions for R and R studio.

9.2 R Studio Interface

The R Studio interface has four panels. The ‘main’ panel (upper left-hand side of the screen) is the script file you are working on. This can also hold a number of other files (e.g. .qmd, text, or even Python). The Console is directly below your script file. This is where output will appear if you are running an inference test, or an error message will pop up if R has issue with your syntax. On the right hand side is your Environment (upper right) and your Plots (lower right). The Environment shows all the objects you have created, either by reading in data and assigning to an object, or simply saying x <- 12 (x = 12). The plots will likely be empty, unless you create a plot. When you do, it will appear in the plot window, and can then be exported or modified further. These two sections have other tabs, but they are not immediately pertinent to our class, so I am omitting them for now.

Video: https://youtu.be/Ur1kUAsfAU4

9.3 Syntax Rules

A few rules of R syntax:

  • R doesn’t really care about spacing; it will interpret x<-2 the same as x <- 2. I personally prefer spacing for human readability, but every person has their own syntax style.
  • R is case-sensitive: id and ID are two different things
  • Comments start with a “#” symbol (more on that below)
  • Commands do not need a specific punctuation to end
  • You can run a line of code by using “ctrl” + “Enter” with your cursor anywhere in the line. Or, you can highlight a series of commands, and run it all at once.

9.3.1 Comments

When you start a line with a “#” symbol, it is a comment to yourself or anyone reading your code - R does not run these lines. You should ALWAYS comment your code with what you are doing and why, so when you go back later, you’ll remember what you were doing. This also makes it easy for someone else to follow your code.

#A comment always starts with the pound (or hashtag) symbol
sum(x) #A comment can also come after a section of code and in the same line

Note: R does not require any specific punctuation to signify the end of a comment!

9.4 Packages

While base R can do most things that you need, there are additional packages that have been developed that have functions beyond what base R can do. These can be a specialized set of functions for a specific task (e.g. haven), a nicer way to visualize your data (e.g. ggplot2), or a collection of functions that make wrangling data easier (e.g. tidyverse). Packages need to be installed from CRAN using the install.packages() function:

#To install the package tidyverse:
install.packages("tidyverse")

Packages only need to be installed once on a single machine, but if you switch between machines (say, between a desktop and a laptop, or between a lab computer and a personal computer), you will need to have the packages you are using installed on each computer.

Once you have installed a package, you need to let R know you want to use the functions from that package. You do this by using the library function: library(tidyverse). It can be useful to have all the packages you will be using at the top of your code file; this lets you and anyone using your code know what is needed. However, there is nothing ‘wrong’ with simply calling the package before the first instance of its use in your code.

If you wanted more information about a package, you can type ?tidyverse after you have called it (for more information about tidyverse; if you are looking up a different package, substitute the package name you are looking up for “tidyverse”). Information about the package, its authors, and helpful links will show up in the ‘Help’ pane.

9.5 Operators

9.5.1 Math and Equality

In the table below are some of the most common math and equality/logic operators. Of note, “square root” can be accomplished in two ways: 1) by raising the number to the power of 0.5 (4^0.5 = 2) or by using the sqrt() function (sqrt(4) = 2). Also, to preserve order of operations, parenthesis should be used liberally - to signify the numerator and denominator of a fraction, for example. If you have two parenthetical statements, R will not automatically know to multiply them; you will need to have them separated by a * if you want them to be multiplied. For example: (5-1)*(4-1) rather than (5-1)(4-1).

Symbol Function
+ Addition
- Subtraction
* Multiplication
/ Division
^ Exponent
& And
| Or
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
! Not
!= Not equal
== Is equal

When you have a list of things, you can combine them into a single object by using c(). As an example, if I wanted the numbers 1, 3, and 5 in a list, I would combine them with c(1, 3, 5).

9.5.2 Assignment

One symbol you will frequently see is <-. The <- is an assignment operator, and you can use it to assign a dataset to an object, a number to a variable (N <- 250), a list to an object (fruits <- c("apple", "orange", "banana")) or many other operations. The way you would literally read the <- operator is “NAME” <- (assigned to) “ITEM or FUNCTION”. Or, more understandably, you can think of it as “ITEM or FUNCTION” is called “NAME”.

NOTE: the name you choose for your object should be something that you can remember what it is (eg: “fruits”), something short (eg: “df” for “dataframe”), and NOT a function in R (eg: “mean” would be a bad choice). It should also either be all one word, or separated by underscores (all_students).

When you run the line of code that assigns something to an object, it looks as though nothing has happened. There is no output in your console, no new tabs open, etc. However, if you look over in your environment window you will see your new object, and some information about it.

9.6 Strings vs. Numbers

R differentiates between strings (ie, words) and numbers by surrounding strings in quotes, as seen in install.packages("ggplot2"). If ‘ggplot2’ was not surrounded by quotes, R would not recognize it as a package name. Another example is in the creation of a list: fruits <- c("apples", "oranges", "bananas"). This is a list of words. If you run that line of code and look in your environment window, it will tell you that it is a list of characters (“chr”), and give you a preview. Strings are a different color in your code to visually differentiate them from code, comments, and numbers. While it is possible to use either single or double quotes to denote strings, best practice is to use double quotes.

Numbers, on the other hand, are not surrounded by quotes. A list of numbers would be created like this: odd <- c(1, 3, 5, 7). After running that line of code, you will see that ‘odd’ is in your environment as an integer (“int”; only whole numbers), and it gives you a preview of what is contained in the list. You can perform operations on numbers, which you cannot do with characters. Numbers are a different color in your code to visually differentiate them from code, comments, and strings.

9.7 Object Orientation

R is different from SPSS in that it is an object-oriented language. What that means is that pretty much everything is an object. Object names do not have any length requrirements, but must start with a letter. They can contain any combination of letters, numbers, periods, or underscores (but no other special characters). As with everything in R, object names are case sensitive! If you named your column Score but are trying to find the mean of score, R will throw an error (or an incorrect value!).

9.8 Data Types

There are a number of different ways to store data as well as multiple types of data; we will discuss just a few here. As an important note: in R, NA indicates a missing value.

The main types of data we will work with are number, integer, logical, and character. The data type will be indicated in the environment pane using abbreviations: num, int, chr. We can use the functions is.numeric(), is.integer(), is.character(), and is.logical() to determine what type of data we have, when we are working with data of all one type. These functions return TRUE or FALSE, depending on if it is the requested type or not. The function class() will identify the type of data we are working with. We can also turn one type of data into another (in some cases) by using as.numeric(), etc.

9.8.1 Vectors

A vector is a sequence of elements that are all the same type: all numbers, all characters, etc. As an example, look at the code below. Note the missing value signified by NA. Also note that NA is not in quotes, even in a character vector.

#Create a vector
gend <- c("M", "F", "F", NA, "M")

#Check type
is.character(gend) #Asking if it is character (TRUE = yes)
[1] TRUE
is.numeric(gend) #Asking if it numeric (FALSE = no)
[1] FALSE
class(gend)
[1] "character"
is.vector(gend) #Asking if it is a vector
[1] TRUE

9.8.2 Lists

Lists are like vectors, but can contain a mix of data types.

#Make a list
ex <- list(1, 4, "purple", TRUE, 8)

#Print the list to the console
ex
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] "purple"

[[4]]
[1] TRUE

[[5]]
[1] 8

9.8.3 Factors

Specifying data as a factor is indicating that it represents levels or categories of a variable. Going back to our gend vector, we can see that it is a vector, but we can coerce it into a factor. This can be useful for future statistical testing where you need levels of a factor to perform the test.

#Turn a vector into a factor
gend_factor <- factor(gend)

#See what those levels are
gend_factor
[1] M    F    F    <NA> M   
Levels: F M

9.8.4 Matrices

A matrix is a collection of values arranged in a two-dimensional layout. Importantly, however, matrices are NOT dataframes. The main difference lies in the fact that all values in a matrix must be of the same data type - all logical, numeric, character, etc.

9.8.5 Dataframes

Dataframes are the most common data format you will be working with. There are a wide range of things that can be done with them, but we will focus on just a few below. As we’ve seen before, we can load in a dataset from either a pre-existing R dataset or an external source, and assign that to an object in R:

#Assign a pre-existing dataset to a dataframe object
df <- women 

#Assign a file to a dataframe object
data1 <- read.csv("file\\path\\here.csv")

You can also create a dataframe from a series of vectors using a data.frame() function. This is often handy when you have created or calculated a series of vectors, and wish to combine them into a dataframe.

#Make some lists first
id <- seq(1:5)
fruit <- c("apple", "orange", "banana", "kiwi", "watermelon")
age <- c(4, 9, 11, 8, 7)

#Combine them into a dataframe
peds <- data.frame(id, fruit, age)

This can also work to combine different columns from different dataframes. Of note: they have to be the same length, or R is liable to get cranky. You will also want to rename them, so R doesn’t just use the default ugly name. In the example below, if we did not specify id = peds$id and instead just said peds$id, R would literally name the column peds$id. The $ symbol is described in the section below.

example <- data.frame(id = peds$id, #id is the new column name
                      minutes = treatment$min, 
                      allergy = history$alg)

9.9 Functions

A function tells R to do something. You can see a number of functions in the text above, with a variety of outcomes. Functions can install packages (install.packages()), set a working directory (setwd()), calculate a mean (mean()), or perform a t-test (t.test()). Notice with all of the functions given as examples, and indeed all functions, there is the function name, followed by a set of parenthesis. Every function takes at least one argument. Some only take one, like install.packages(). Some take more than one, like t.test(). When there are multiple arguments a function can take, you will see the function and its arguments written out generically. Arguments are parameters you can set for the function, like which column of a dataframe to use to calculate the mean, or if you want a t-test to be one- or two-tailed.

9.10 Useful Functions to Examine Data

Earlier, I assigned a dataset to an object df. It’s a toy dataset, with only two columns and 15 entries. However, it will be used to illustrate a number of different ways to look at your data outside of clicking on it in the environment pane and visually looking at the dataset itself.

9.10.1 names()

If you forget (or just don’t know) the column names in your dataframe, you can get them using the names() function:

#Get column names
names(df)
[1] "height" "weight"

From the output, we see that we have two columns, and they are named “height” and “weight”. This can be especially useful when you think you are using the right column name, but R is throwing an error.

9.10.2 str()

The str() function provides the structure of a dataset. It returns information about the type of data we have, number of observations, the number of variables, lists the column names, and gives a preview of their entries.

#Look at the structure of our dataframe
str(df)
'data.frame':   15 obs. of  2 variables:
 $ height: num  58 59 60 61 62 63 64 65 66 67 ...
 $ weight: num  115 117 120 123 126 129 132 135 139 142 ...

9.10.3 head() and tail()

Some dataset operations that come in handy after first loading in data are looking at the first or last 6 rows. After performing an operation or creating a variable, it is wise to check that what you think you did actually worked correctly. If you wanted to look at the first 6 rows, you would use the head() function, whereas if you wanted to look at the last 6 rows you would use the tail() function. These are both used in place of printing your entire dataset to the console.

#Looking at the first 6 rows of the dataset
head(df)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
#Looking at the last 6 rows of the dataset
tail(df)
   height weight
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

9.11 Selecting Elements

9.11.1 Specific Columns

Sometimes, you want to perform an operation on just one column of your dataframe. To reference a specific column, you will make use of the $ operator: df$name would be interpreted as you want the column “name” from the dataframe “df”. We can also reference a column by it’s place in the dataframe: column 1, column 2, etc. We would do this using the following df[row,column] convention. That is to say, if we wanted all rows of the first column, we would do df[,1]. We are referencing the dataframe df, saying we want all rows by leaving that part blank, and saying we want column 1. Both of these column selection options perform equally, and it is often a matter of personal preference which you choose when selecting a single column.

#Select the height column
df$height
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#Select the first column.  
df[,1]
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

9.11.2 which()

The which() function can be useful when subsetting data and you want to specify certain qualifiers.

#Subset df, but only if height is greater than 60 inches
df1 <- df[which(df$height > 60),]

#We only want height greater than 60 and weight greater than 140
df2 <- df[which(df$height > 60 & df$weight > 140),]

9.11.3 subset()

Subset is a function that does what it advertises - subsets data! This time, subset() is a function that takes the dataframe as one of the arguments:

#Subset data as above
#First, only want height greater than 60
df3 <- subset(df, subset = height > 60, select = c("height", "weight"))

Notice the arguments: first is the dataframe we want to subset, then what variable(s) and condition(s), and lastly what variables we wish to retain.

9.12 Sequencing

If you have many values to input, typing them out individually can be time-consuming and error prone. For example, you wouldn’t want to have to type out the numbers 1 through 5000 counting by ones individually! If you had a case like that, you could make use of the seq() function, which creates a sequence of numbers.

#Make a sequence of numbers by using the seq() function
numbers <- seq(1:10)

numbers2 <- seq(1, 10)

The above will create a list called numbers of the numbers 1 through 10, inclusive of both 1 and 10. For the example above, if we needed to go from 1 to 5000, we would simply adjust our ending number: seq(1:5000) or seq(1, 5000). You can also use the seq() function to count by a value other than one: by 10s, or only odd or even numbers (counting by 2). We accomplish this by adding an additional argument to the seq() function: by = x. In the parenthesis after seq, we would give our starting value, ending value, and by what interval we want R to generate numbers: seq(start, end, by = interval).

#Count by 10s
numbers_v2 <- seq(10, 100, by = 10)
numbers_v2
 [1]  10  20  30  40  50  60  70  80  90 100
#Count by 2s
odd_v2 <- seq(1, 197, by = 2) #Not reading this out - perhaps for obvious reasons!

While the numbers_v2 was output as an example, you will typically not print your list to the console, but rather perform an operation on it, add it to your dataframe, or just save it for later calculations. ## Functions