10  Setting a Working Directory

Setting a working directory can be nice; you won’t have to type out file paths for everything you want to read in as long as they are all in the same folder. For example, if all your files for Homework 1 are stored in a “HW1” folder, you can set your working directory to that folder, and just type in file names rather than full file paths. You can set a working directory at the beginning of your session, and it will stay the same until you change it.

#Set a working directory
setwd("file\\path\\goes\\here")

A shortcut to get the file path for PC users is to navigate to your folder, hold down shift, then right-click on the folder. On the menu that pops up, select “copy as path”. You can then paste this into your setwd() function. PC users will need to add the double \ in order for R to be able to navigate to the location; Mac users can use the /. PC users also have the option of changing all the \ to /.

If you’re not sure what your working directory is set to and would like to check, you can use getwd(), and the file path will print to your console.

#Check working directory
getwd()

10.1 Reading in Outside Files

It is quite easy to bring in a variety of different data types to R from common programs such as Excel, SAS, and SPSS. Some file types (eg. SAS) will require packages to be installed first, while others (eg. .csv) can be read in by base R. Below are the most common file types you will encounter, and how to read them in. The easiest way to do this is to have the data files in your working directory. If they are not in your working directory, you will need to type in the full file path.

10.1.1 Excel (.xlsx, .xls)

install.packages("readxl") #Install package if you have not done so
library(readxl) #Call the package

#Pay attention to file extension!  Be sure what you type matches your file.
#.xlsx
df_excel_new <- read_excel("your_filename.xlsx") 

#.xls (file not in working directory)
df_excel_old <- read_excel("file\\path\\goes\\here\\data.xls")  

Excel files can also be read in by a specific sheet, if you have multiple sheets in one Excel file:

#Read in by sheet number
df_sheet2 <- read_excel("filename.xlsx", sheet = 2)

#Read in by sheet number
#Make sure to exactly match capitalization!
df_stud <- read_excel("filename.xls", sheet = Students) 

10.1.2 Delimited files (.tsv, .csv)

By default, R will use the first row of your imported data as column names. If you do not want that, you would include col_names = FALSE after your file name (example given below).

#To read in .tsv files, you need the readr package
#This can also be used to read in .csv files
#NOTE: if you have installed tidyverse, readr is contained within that package, and you won't need to reinstall it!
install.packages("readr")
library(readr)

#Read in TSV file; file in working directory
df_TSV <- read_tsv("filename.tsv") 

#Reading in a .csv file using the readr package
#Telling it that the first row should NOT be treated as column names 
df_CSV <- read_csv("filename.csv", col_names = FALSE)

#.csv files can also be read in using base R, with no external packages needed
df_CSV_v2 <- read.csv("filename.csv") 

10.1.3 SAS

install.packages("haven")
library(haven)

df_SAS <- read_sas("filename.sas7bdat") 

10.1.4 SPSS

install.packages("haven")
library(haven)

df_SPSS <- read_sav("filename.sav")

10.1.5 Fixed Width Data

To read in fixed width data, you would use the read.fwf() function. As with other software programs, you also need to specify the widths of each column as well as column names. However, in R, specifying the column widths may look a little different. You not only specify the width of each variable, but also specify how many spaces to skip between each variable with a negative number (see example below). Notice also how both widths and col.names are arguments fed to the read.fwf() function, and since there is more than one argument fed to them, they use the c() designation, meaning collect all these things and send it on.

df_fwd <- read.fwf("filename.dat", 
                   widths = c(4, -2, 1, -2, 20),
                   col.names = c("ID", "gender", "practice"))

10.1.6 Space Delimited

Space delimited data is read in using read.table(), and specifying the separator (sep =), if there is a header, and what the column names are.

df_space <- read.table("filename.dat",
                       sep = " ",
                       header = FALSE,
                       col.names = c("ID", "gender", "practice"))

10.1.7 Missing Data in Delimited Files

By default, R recognizes NA as missing data, regardless of if it falls in character or numeric data. However, if missing data was represented using other symbols or numbers (e.g. -999 or .), then you would need to specify that.

#Read in data and specify missing
df_missing <- read.table(file = "filename.txt",
                         skip = 2, #Telling it to skip the first two lines of text
                         header = TRUE, #Variable names are included
                         sep = ",", #This is a comma-delimited file
                         na.strings = c(-999, ".")) #We want -999 and . to be recognized as missing values

10.1.8 Same Prefix Variables

Sometimes you may have a series of questions that have a similar pattern: dep1 through dep10, for example. Or Q1 through Q10, etc. Rather than type each one out individually when specifying your column names, you can use the paste0() function to create a character vector.

Using Q1 through Q10 as an example, we could do the following:

Qvars <- paste0("Q", 1:10)

Qvars
 [1] "Q1"  "Q2"  "Q3"  "Q4"  "Q5"  "Q6"  "Q7"  "Q8"  "Q9"  "Q10"

What happens is that “Q” is attached (pasted) to the numbers 1 through 10. We could then use the vector Qvars in our col.names() argument. Or, we can even use the paste0() function within the col.names() argument. Both are shown below:

#Using the vector of variable names within the argument
df_space2 <- read.table("filename.dat",
                       sep = " ",
                       header = FALSE,
                       col.names = c("ID", "gender", Qvars))

#Using paste0 within the column names argument
df_space3 <- read.table("filename.dat",
                       sep = " ",
                       header = FALSE,
                       col.names = c("ID", "gender", paste0("Q", 1:10)))

10.2 Change Variable Name

Sometimes you get in data, but you need (or want!) to change some things about it. Perhaps you need to change column names to match other data, or so you better remember what it represents. There are two ways to change a variable (ie: column) name: reference it by number or reference it by name. Both ways will make use of the names() function.

#Reminder of what df is
df <- women

#Rename height to 'height(in)'
names(df)[1] <- "height(in)"

#Check our work
head(df)
  height(in) weight
1         58    115
2         59    117
3         60    120
4         61    123
5         62    126
6         63    129

In the function, we are calling the names of the variables, as we did earlier, with names(df). We are then saying that the first entry in that vector([1]) should be replaced with “height(in)” (<- "height(in)").

We can also change variable names by referencing its name.

#Change 'height(in)' back to 'height'
names(df)[names(df) == "height(in)"] <- "height"

#Check our work
head(df)
  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

Just like when we changed the name using the column number, we start by calling the names of all the variables with names(df). Then, we are saying that within that vector of names ([names(df)) we want the column exactly named (==) ‘height(in)’ ("height(in)"]). Lastly, we now want that name to be replaced with ‘height’ (<- "height").

10.3 Basic Statistical Functions

Many basic functions are intuitively named: mean, median, minimum, maximum, and standard deviation for example. You can individually call these functions on a single column of a dataframe:

#Assign data
df <- women

#Mean
mean(df$height)
[1] 65
#Median
median(df$height)
[1] 65
#Minimum
min(df$height)
[1] 58
#Maximum
max(df$height)
[1] 72
#Standard Deviation
sd(df$height)
[1] 4.472136

You might run these to check your data, to verify coding of a new column, or because you just need that value for your analysis. We can also assign the output to an object rather than just having it print to the console, for later use.

m <- mean(df$height)

We could also get all of this information by using the summary() function. This takes as input your dataframe, and returns out a table with the minimum (Min.), 1st quantile (1st Qu.), Median, Mean, 3rd quantile (3rd Qu.), and the maximum (Max.) for each numeric column.

summary(df)
     height         weight     
 Min.   :58.0   Min.   :115.0  
 1st Qu.:61.5   1st Qu.:124.5  
 Median :65.0   Median :135.0  
 Mean   :65.0   Mean   :136.7  
 3rd Qu.:68.5   3rd Qu.:148.0  
 Max.   :72.0   Max.   :164.0  

10.4 Generating a Count Table

If you wanted to know the count of each unique entry of a variable, you would use the table() function. This generates a count of how many entries are the same in a given variable. To illustrate this, we can use another built-in dataset, mtcars.

#Load in dataset, assign to "cars"
cars <- mtcars

#See what the variables are
names(cars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
#One of the variables in this dataset is "cyl"
#Ask for a count of the cyl
table(cars$cyl)

 4  6  8 
11  7 14 

From this, we can see that 11 entries have 4 cylinders, 7 have 6 cylinders, and 14 have 8 cylinders.

10.5 Saving Files Out

R can save files out in a variety of formats as well. You will need to specify which dataframe you want to save out, where you’d like it saved, what the new file should be called, and in what format you’ll be saving it in. Note that unless you specify a file path, R will save the file in your working directory.

#Save as .csv
write.table(df,
            "filename.csv",
            sep = ",",
            quote = FALSE, #I do not want "" around my string entries
            col.names = TRUE, #I do want it to bring the column names
            row.names = FALSE) #But I don't want the row names

#Save as .tsv
write.table(df,
            "filename.csv",
            sep = "/t",
            quote = FALSE, #I do not want "" around my string entries
            col.names = FALSE, #I don't want it to bring the column names
            row.names = FALSE) #Or the row names

To save Excel files, you’ll need the writexl package.

#Call the package
library(writexl)

#Save the file
write_xlsx(df,
           "filename.xlsx")

And writing SAS and SPSS files both require the haven package.

#Call the package
library(haven)

#Save out as SAS file
write_sas(df,
          "filename.sas7bdat")

#Save as an SPSS file
write_sav(df,
          "filename.sav")