#Set a working directory
setwd("file\\path\\goes\\here")
10 Setting a Working Directory
Setting a working directory can be nice; you won’t have to type out file paths for everything you want to read in as long as they are all in the same folder. For example, if all your files for Homework 1 are stored in a “HW1” folder, you can set your working directory to that folder, and just type in file names rather than full file paths. You can set a working directory at the beginning of your session, and it will stay the same until you change it.
A shortcut to get the file path for PC users is to navigate to your folder, hold down shift, then right-click on the folder. On the menu that pops up, select “copy as path”. You can then paste this into your setwd()
function. PC users will need to add the double \ in order for R to be able to navigate to the location; Mac users can use the /. PC users also have the option of changing all the \ to /.
If you’re not sure what your working directory is set to and would like to check, you can use getwd()
, and the file path will print to your console.
#Check working directory
getwd()
10.1 Reading in Outside Files
It is quite easy to bring in a variety of different data types to R from common programs such as Excel, SAS, and SPSS. Some file types (eg. SAS) will require packages to be installed first, while others (eg. .csv) can be read in by base R. Below are the most common file types you will encounter, and how to read them in. The easiest way to do this is to have the data files in your working directory. If they are not in your working directory, you will need to type in the full file path.
10.1.1 Excel (.xlsx, .xls)
install.packages("readxl") #Install package if you have not done so
library(readxl) #Call the package
#Pay attention to file extension! Be sure what you type matches your file.
#.xlsx
<- read_excel("your_filename.xlsx")
df_excel_new
#.xls (file not in working directory)
<- read_excel("file\\path\\goes\\here\\data.xls") df_excel_old
Excel files can also be read in by a specific sheet, if you have multiple sheets in one Excel file:
#Read in by sheet number
<- read_excel("filename.xlsx", sheet = 2)
df_sheet2
#Read in by sheet number
#Make sure to exactly match capitalization!
<- read_excel("filename.xls", sheet = Students) df_stud
10.1.2 Delimited files (.tsv, .csv)
By default, R will use the first row of your imported data as column names. If you do not want that, you would include col_names = FALSE
after your file name (example given below).
#To read in .tsv files, you need the readr package
#This can also be used to read in .csv files
#NOTE: if you have installed tidyverse, readr is contained within that package, and you won't need to reinstall it!
install.packages("readr")
library(readr)
#Read in TSV file; file in working directory
<- read_tsv("filename.tsv")
df_TSV
#Reading in a .csv file using the readr package
#Telling it that the first row should NOT be treated as column names
<- read_csv("filename.csv", col_names = FALSE)
df_CSV
#.csv files can also be read in using base R, with no external packages needed
<- read.csv("filename.csv") df_CSV_v2
10.1.3 SAS
install.packages("haven")
library(haven)
<- read_sas("filename.sas7bdat") df_SAS
10.1.4 SPSS
install.packages("haven")
library(haven)
<- read_sav("filename.sav") df_SPSS
10.1.5 Fixed Width Data
To read in fixed width data, you would use the read.fwf()
function. As with other software programs, you also need to specify the widths of each column as well as column names. However, in R, specifying the column widths may look a little different. You not only specify the width of each variable, but also specify how many spaces to skip between each variable with a negative number (see example below). Notice also how both widths
and col.names
are arguments fed to the read.fwf()
function, and since there is more than one argument fed to them, they use the c()
designation, meaning collect all these things and send it on.
<- read.fwf("filename.dat",
df_fwd widths = c(4, -2, 1, -2, 20),
col.names = c("ID", "gender", "practice"))
10.1.6 Space Delimited
Space delimited data is read in using read.table()
, and specifying the separator (sep =
), if there is a header, and what the column names are.
<- read.table("filename.dat",
df_space sep = " ",
header = FALSE,
col.names = c("ID", "gender", "practice"))
10.1.7 Missing Data in Delimited Files
By default, R recognizes NA
as missing data, regardless of if it falls in character or numeric data. However, if missing data was represented using other symbols or numbers (e.g. -999 or .), then you would need to specify that.
#Read in data and specify missing
<- read.table(file = "filename.txt",
df_missing skip = 2, #Telling it to skip the first two lines of text
header = TRUE, #Variable names are included
sep = ",", #This is a comma-delimited file
na.strings = c(-999, ".")) #We want -999 and . to be recognized as missing values
10.1.8 Same Prefix Variables
Sometimes you may have a series of questions that have a similar pattern: dep1 through dep10, for example. Or Q1 through Q10, etc. Rather than type each one out individually when specifying your column names, you can use the paste0()
function to create a character vector.
Using Q1 through Q10 as an example, we could do the following:
<- paste0("Q", 1:10)
Qvars
Qvars
[1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7" "Q8" "Q9" "Q10"
What happens is that “Q” is attached (pasted) to the numbers 1 through 10. We could then use the vector Qvars in our col.names()
argument. Or, we can even use the paste0()
function within the col.names()
argument. Both are shown below:
#Using the vector of variable names within the argument
<- read.table("filename.dat",
df_space2 sep = " ",
header = FALSE,
col.names = c("ID", "gender", Qvars))
#Using paste0 within the column names argument
<- read.table("filename.dat",
df_space3 sep = " ",
header = FALSE,
col.names = c("ID", "gender", paste0("Q", 1:10)))
10.2 Change Variable Name
Sometimes you get in data, but you need (or want!) to change some things about it. Perhaps you need to change column names to match other data, or so you better remember what it represents. There are two ways to change a variable (ie: column) name: reference it by number or reference it by name. Both ways will make use of the names()
function.
#Reminder of what df is
<- women
df
#Rename height to 'height(in)'
names(df)[1] <- "height(in)"
#Check our work
head(df)
height(in) weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
In the function, we are calling the names of the variables, as we did earlier, with names(df)
. We are then saying that the first entry in that vector([1]
) should be replaced with “height(in)” (<- "height(in)"
).
We can also change variable names by referencing its name.
#Change 'height(in)' back to 'height'
names(df)[names(df) == "height(in)"] <- "height"
#Check our work
head(df)
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
Just like when we changed the name using the column number, we start by calling the names of all the variables with names(df)
. Then, we are saying that within that vector of names ([names(df)
) we want the column exactly named (==
) ‘height(in)’ ("height(in)"]
). Lastly, we now want that name to be replaced with ‘height’ (<- "height"
).
10.3 Basic Statistical Functions
Many basic functions are intuitively named: mean, median, minimum, maximum, and standard deviation for example. You can individually call these functions on a single column of a dataframe:
#Assign data
<- women
df
#Mean
mean(df$height)
[1] 65
#Median
median(df$height)
[1] 65
#Minimum
min(df$height)
[1] 58
#Maximum
max(df$height)
[1] 72
#Standard Deviation
sd(df$height)
[1] 4.472136
You might run these to check your data, to verify coding of a new column, or because you just need that value for your analysis. We can also assign the output to an object rather than just having it print to the console, for later use.
<- mean(df$height) m
We could also get all of this information by using the summary()
function. This takes as input your dataframe, and returns out a table with the minimum (Min.), 1st quantile (1st Qu.), Median, Mean, 3rd quantile (3rd Qu.), and the maximum (Max.) for each numeric column.
summary(df)
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
10.4 Generating a Count Table
If you wanted to know the count of each unique entry of a variable, you would use the table()
function. This generates a count of how many entries are the same in a given variable. To illustrate this, we can use another built-in dataset, mtcars
.
#Load in dataset, assign to "cars"
<- mtcars
cars
#See what the variables are
names(cars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
#One of the variables in this dataset is "cyl"
#Ask for a count of the cyl
table(cars$cyl)
4 6 8
11 7 14
From this, we can see that 11 entries have 4 cylinders, 7 have 6 cylinders, and 14 have 8 cylinders.
10.5 Saving Files Out
R can save files out in a variety of formats as well. You will need to specify which dataframe you want to save out, where you’d like it saved, what the new file should be called, and in what format you’ll be saving it in. Note that unless you specify a file path, R will save the file in your working directory.
#Save as .csv
write.table(df,
"filename.csv",
sep = ",",
quote = FALSE, #I do not want "" around my string entries
col.names = TRUE, #I do want it to bring the column names
row.names = FALSE) #But I don't want the row names
#Save as .tsv
write.table(df,
"filename.csv",
sep = "/t",
quote = FALSE, #I do not want "" around my string entries
col.names = FALSE, #I don't want it to bring the column names
row.names = FALSE) #Or the row names
To save Excel files, you’ll need the writexl
package.
#Call the package
library(writexl)
#Save the file
write_xlsx(df,
"filename.xlsx")
And writing SAS and SPSS files both require the haven
package.
#Call the package
library(haven)
#Save out as SAS file
write_sas(df,
"filename.sas7bdat")
#Save as an SPSS file
write_sav(df,
"filename.sav")