11 Working with Files in R

Setting a working directory can be nice; you won’t have to type out file paths for everything you want to read in as long as they are all in the same folder. For example, if all your files for Homework 1 are stored in a “HW1” folder, you can set your working directory to that folder, and just type in file names rather than full file paths. You can set a working directory at the beginning of your session, and it will stay the same until you change it.

#Set a working directory
setwd("file\\path\\goes\\here")

A shortcut to get the file path for PC users is to navigate to your folder, hold down shift, then right-click on the folder. On the menu that pops up, select “copy as path”. You can then paste this into your setwd() function. PC users will need to add the double \ in order for R to be able to navigate to the location; Mac users can use the /. PC users also have the option of changing all the \ to /.

If you’re not sure what your working directory is set to and would like to check, you can use getwd(), and the file path will print to your console.

#Check working directory
getwd()

11.1 Projects

Another way to manage your files, and arguably easier if the data is used only for one analysis, is to create a project within RStudio. A project is associated with a working directory (i.e., folder), and can be created within an existing folder or as a new folder. To create a new project within RStudio, you would go to “File” -> “New Project…” and select where you would like your project created. This will create a project file (.Rproj) within the new folder that you can use to open the project directly later on. It will also automatically load the project into a new RStudio session. From there, you can start a new script file. If you later add your data to that folder, you can call it within the project by referencing only the file name, rather than the whole path. You may, of course, bring in other data by using the full file path.

Please watch the video below for a walk-through of this process; it’s challenging to envision otherwise!

Video: Creating a Project

11.2 Reading in Outside Files

It is quite easy to bring in a variety of different data types to R from common programs such as Excel, SAS, and SPSS. Some file types (eg. SAS) will require packages to be installed first, while others (eg. .csv) can be read in by base R. Below are the most common file types you will encounter, and how to read them in. The easiest way to do this is to have the data files in your working directory. If they are not in your working directory, you will need to type in the full file path.

11.2.1 Excel (.xlsx, .xls)

install.packages("readxl") #Install package if you have not done so
library(readxl) #Call the package

#Pay attention to file extension!  Be sure what you type matches your file.
#.xlsx
df_excel_new <- read_excel("your_filename.xlsx") 

#.xls (file not in working directory)
df_excel_old <- read_excel("file\\path\\goes\\here\\data.xls")

Excel files can also be read in by a specific sheet, if you have multiple sheets in one Excel file:

#Read in by sheet number
df_sheet2 <- read_excel("filename.xlsx", sheet = 2)

#Read in by sheet number
#Make sure to exactly match capitalization!
df_stud <- read_excel("filename.xls", sheet = Students)

11.2.2 Delimited files (.tsv, .csv)

By default, R will use the first row of your imported data as column names. If you do not want that, you would include col_names = FALSE after your file name (example given below).

#To read in .tsv files, you need the readr package
#This can also be used to read in .csv files
#NOTE: if you have installed tidyverse, readr is contained within that package, and you won't need to reinstall it!
install.packages("readr")
library(readr)

#Read in TSV file; file in working directory
df_TSV <- read_tsv("filename.tsv") 

#Reading in a .csv file using the readr package
#Telling it that the first row should NOT be treated as column names 
df_CSV <- read_csv("filename.csv", col_names = FALSE)

#.csv files can also be read in using base R, with no external packages needed
df_CSV_v2 <- read.csv("filename.csv")

11.2.3 SAS

install.packages("haven")
library(haven)

df_SAS <- read_sas("filename.sas7bdat")

11.2.4 SPSS

install.packages("haven")
library(haven)

df_SPSS <- read_sav("filename.sav")

11.2.5 Fixed Width Data

To read in fixed width data, you would use the read.fwf() function. As with other software programs, you also need to specify the widths of each column as well as column names. However, in R, specifying the column widths may look a little different. You not only specify the width of each variable, but also specify how many spaces to skip between each variable with a negative number (see example below). Notice also how both widths and col.names are arguments fed to the read.fwf() function, and since there is more than one argument fed to them, they use the c() designation, meaning collect all these things and send it on.

df_fwd <- read.fwf("filename.dat", 
                   widths = c(4, -2, 1, -2, 20),
                   col.names = c("ID", "gender", "practice"))

11.2.6 Space Delimited

Space delimited data is read in using read.table(), and specifying the separator (sep =), if there is a header, and what the column names are.

df_space <- read.table("filename.dat",
                       sep = " ",
                       header = FALSE,
                       col.names = c("ID", "gender", "practice"))

11.2.7 Missing Data in Delimited Files

By default, R recognizes NA as missing data, regardless of if it falls in character or numeric data. However, if missing data was represented using other symbols or numbers (e.g. -999 or .), then you would need to specify that with na.strings = c().

#Read in data and specify missing
df_missing <- read.table(file = "filename.txt",
                         skip = 2, #Telling it to skip the first two lines of text
                         header = TRUE, #Variable names are included
                         sep = ",", #This is a comma-delimited file
                         na.strings = c(-999, ".")) #We want -999 and . to be recognized as missing values

11.2.8 Same Prefix Variables

Sometimes you may have a series of questions that have a similar pattern: dep1 through dep10, for example. Or Q1 through Q10, etc. Rather than type each one out individually when specifying your column names, you can use the paste0() function to create a character vector.

Using Q1 through Q10 as an example, we could do the following:

Qvars <- paste0("Q", 1:10)

Qvars

 [1] "Q1"  "Q2"  "Q3"  "Q4"  "Q5"  "Q6"  "Q7"  "Q8"  "Q9"  "Q10"

What happens is that “Q” is attached (pasted) to the numbers 1 through 10. We could then use the vector Qvars in our col.names() argument. Or, we can even use the paste0() function within the col.names() argument. Both are shown below:

#Using the vector of variable names within the argument
df_space2 <- read.table("filename.dat",
                       sep = " ",
                       header = FALSE,
                       col.names = c("ID", "gender", Qvars))

#Using paste0 within the column names argument
df_space3 <- read.table("filename.dat",
                       sep = " ",
                       header = FALSE,
                       col.names = c("ID", "gender", paste0("Q", 1:10)))

11.3 Change Variable Name

Sometimes you get in data, but you need (or want!) to change some things about it. Perhaps you need to change column names to match other data, or so you better remember what it represents. There are two ways to change a variable (ie: column) name: reference it by number or reference it by name. Both ways will make use of the names() function.

#Reminder of what df is
df <- women

#Rename height to 'height_in'
names(df)[1] <- "height_in"

#Check our work
head(df)

  height_in weight
1        58    115
2        59    117
3        60    120
4        61    123
5        62    126
6        63    129

In the function, we are calling the names of the variables, as we did earlier, with names(df). We are then saying that the first entry in that vector([1]) should be replaced with “height(in)” (<- "height_in").

We could also rename all the variables in the dataframe (not advised with large dataframes!):

#Rename all variables 
names(df) <- c("inches", "pounds")

We can also change variable names by referencing its name.

#Change 'height(in)' back to 'height'
names(df)[names(df) == "height_in"] <- "height"

#Check our work
head(df)

  height weight
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129

Just like when we changed the name using the column number, we start by calling the names of all the variables with names(df). Then, we are saying that within that vector of names ([names(df)) we want the column exactly named (==) ‘height(in)’ ("height_in"]). Lastly, we now want that name to be replaced with ‘height’ (<- "height").

11.3.1 Advanced renaming: `tidyverse`

As with subsetting, the tidyverse has a way to rename variables using the rename() function. rename() takes the arguments (new name = old name). For our toy dataframe, it might look something like this:

#Rename variables
df_rename <- df %>%
  rename(height_in = height,
         weight_lbs = weight)

Video: Renaming Variables

11.4 Variable and Value Labels

As mentioned in Chapter 5, variable and value labels help the humans behind the data understand what the variables mean, and what the levels of a factor are. This is also a place where special characters can be used that shouldn’t be used in variable names. For example, “height” is a fine variable name, but “height(in)” is not - you shouldn’t use parenthesis in variable names. However, adding a variable label of “height(in)” or even “height in inches” is perfectly acceptable! Value labels are, again, for human readability. R does not need you to recode “M” and “F” into “0” and “1” - it automatically creates the numbers behind the scenes. If your data was read in as 0/1 though, you may prefer to see M/F.

11.5 Variable Labels

Note

If you first do variable labels, then value labels, you will lose your variable labels!

Variable labels can be done with the Hmisc package (as can value labels). There are other packages that help you label your variables, but since Hmisc can do both, I am going to focus on that one.

There are two different approaches we can take to this. One is to only label the variables we need to. We will use the label() function within the Hmisc package to do this. Below, you can see that I am labeling mpg, disp, and qsec. To do this, I am using the label() function, and feeding into it the specific variable I am labeling. Then, I am assigning text to that label. There is no output for this function; you will instead see the labels when you click on the dataframe in the environment window.

#Read in data
cars <- mtcars

#Call package
library(Hmisc)

#Label variables
label(cars$mpg) <- "Miles per gallon (US)"
label(cars$disp) <- "Displacement (cu. in.)"
label(cars$qsec) <- "1/4 mile time"

Another option is to label the entire dataset (again, not recommended for big datasets!). We will use the same label() function, but this time pass it a list of labels. Notice how this time we are feeding the entire dataframe to label() rather than just a specific variable. Additionally, notice how the list of variable labels is within a list() function (so it is, indeed, a list) and that the variable name comes first, followed by the label.

#Read in data
cars <- mtcars

#Label variables
label(cars) <- list(mpg = "miles per gallon (US)", 
                    cyl = "cylinders", 
                    disp = "Displacement (cu. in.)", 
                    hp = "Gross horsepower",
                    drat = "Rear axle ratio", 
                    wt = "weight (1000lbs)", 
                    qsec = "1/4 mile time", 
                    vs = "Engine", 
                    am = "Transmission",
                    gear = "Number forward gears", 
                    carb = "Carburetors")

11.6 Value Labels

Value labels are handy when you have numbers representing levels of something. However, the way R handles data and labeling means that to label values, you are converting that variable to a factor. Sometimes this is just fine (i.e., gender, race, building, etc.). Other times, though, it is less okay (i.e., Likert scale responses). Once converted to a factor, R will not allow any mathematical operations to be performed on the variable. So, for example, if you labeled a skills inventory that captured responses on a Likert scale, but then wanted to create a sum score, R would not allow that. One way around this is to create a new variable that is the labeled factor version. This one can be used for visualizations and tables, but the original, numeric, variable can be used if a sum score or mean needs to be created.

As an example, let’s say that the number of cylinders a car has are levels of a categorical variable. We want to label them as “no”, “maybe”, and “yes” rather than 4, 6, and 8. We are going to assign these labels to a new variable, cyl_factor, because we also want to create a total sum of the cylinders later. We will use the factor() function, and provide three arguments. First, we are telling it where to look (cars$cyl). Then, we need to provide the levels of the variable with levels = c(4, 6, 8). Finally, we provide what labels we want the levels to have with labels = c("no", "maybe", "yes").

#Label values of cylinder as a new variable
cars$cyl_factor <- factor(cars$cyl,
                          levels = c(4, 6, 8),
                          labels = c("no", "maybe", "yes"))

After we run this, we can look at our dataframe and see that the values for cyl_factor are now no/maybe/yes rather than 4/6/8. We can use this to create more readable tables or graphs, while maintaining the original cyl variable for creating a sum score.

#Look at first few rows of dataframe
head(cars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl_factor
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4      maybe
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4      maybe
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1         no
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1      maybe
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2        yes
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1      maybe

#Make a table
table(cars$cyl_factor)


   no maybe   yes 
   11     7    14

Video: Variable and Value Labels

11.7 Basic Statistical Functions

Many basic functions are intuitively named: mean, median, minimum, maximum, and standard deviation for example. You can individually call these functions on a single column of a dataframe:

#Assign data
df <- women

#Mean
mean(df$height)

[1] 65

#Median
median(df$height)

[1] 65

#Minimum
min(df$height)

[1] 58

#Maximum
max(df$height)

[1] 72

#Standard Deviation
sd(df$height)

[1] 4.472136

You might run these to check your data, to verify coding of a new column, or because you just need that value for your analysis. We can also assign the output to an object rather than just having it print to the console, for later use.

m <- mean(df$height)

We could also get all of this information by using the summary() function. This takes as input your dataframe, and returns out a table with the minimum (Min.), 1st quantile (1st Qu.), Median, Mean, 3rd quantile (3rd Qu.), and the maximum (Max.) for each numeric column.

summary(df)

     height         weight     
 Min.   :58.0   Min.   :115.0  
 1st Qu.:61.5   1st Qu.:124.5  
 Median :65.0   Median :135.0  
 Mean   :65.0   Mean   :136.7  
 3rd Qu.:68.5   3rd Qu.:148.0  
 Max.   :72.0   Max.   :164.0

11.7.1 Using `Range` to find out-of-range values

As a quick note, the range() function, or even looking at the min/max in summary() will be very useful in determining if any of your variables have values outside of an expected or specified range. For example, if you were running a study on women between the ages of 50 and 75, you would expect the range of any age variable to fall between 50 and 75. If you find you have an age of 45 or 999, then you have some decisions to make. An age of 45 is perfectly plausible for humans, and that might mean someone entered the study who shouldn’t have. An age of 999, on the other hand, is not plausible for humans. This is likely a “missing” identifier that someone used from another statistical program. This should be changed to NA in R. We will discuss how to do that in later chapters.

11.7.2 Using Median to find Outliers

Another use of the summary statistics, particularly the median and quartiles, is to determine if a value is an outlier. Recall from Chapter 5, one way to calculate an outlier is using the IQR (interquartile range): if a data point is more than 1.5 * IQR above the third quartile or 1.5 * IQR below the first quartile it could be considered an outlier. The summary() function provides all necessary values to calculate this.

Video: Descriptive Statistics

11.8 Generating a Count Table

If you wanted to know the count of each unique entry of a variable, you would use the table() function. This generates a count of how many entries are the same in a given variable. To illustrate this, we can use another built-in dataset, mtcars.

#Load in dataset, assign to "cars"
cars <- mtcars

#See what the variables are
names(cars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

#One of the variables in this dataset is "cyl"
#Ask for a count of the cyl
table(cars$cyl)


 4  6  8 
11  7 14

From this, we can see that 11 entries have 4 cylinders, 7 have 6 cylinders, and 14 have 8 cylinders.

11.8.1 Proportions

Count tables are useful, but what if we would rather know the proportion of cars that have 6 cylinders? We could do the math ourself, or we could use a second function around table() to generate a proportion table. Notice how we still have to call table(), but it is now an argument in the prop.table() function.

#Get a proportion table
prop.table(table(cars$cyl))


      4       6       8 
0.34375 0.21875 0.43750

From the output, we see that 0.22 (after rounding) of cars have 6 cylinders.

11.8.2 Percents

We can take the above a step further, and get the percent of cars that has 6 cylinders - not that the math is terribly challenging, but it’s nice to have our data in the form we want! Additionally, if your data changes, you won’t have to re-do the math, you can just re-run your code. All we need to do to our prop.table() function from above is multiply it by 100:

#Get percents
prop.table(table(cars$cyl)) * 100


     4      6      8 
34.375 21.875 43.750

Now, our output is indicating the percent of the total dataset that has each number of cylinders. If this is something we need to use in other calculations, we can assign it to an object.

#Assign percents to object
car_cyl_percents <- prop.table(table(cars$cyl)) * 100

And, so we know, we can check how R saved that:

#Check structure of car_cyl_percents
str(car_cyl_percents)

 'table' num [1:3(1d)] 34.4 21.9 43.8
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:3] "4" "6" "8"

We see that is saves out as a table: ('table') with only the percents entered (num [1:3(1d)] 34.4 21.9 43.8). We may find more utility in that as a dataframe, where we can reference specific columns, rows, or even cells. To change it into a data frame, we will use the as.data.frame() function.

Note

You must assign this to an object - it can be a new one or reusing the same name. If you don’t, it will not reflect in your data. I have made this error more times than I care to admit, and only catch it when I’m trying to do an operation on what I think is a dataframe only to find out it is, in fact, something else entirely.

#Change table into a dataframe
car_cyl_percents_df <- as.data.frame(car_cyl_percents)

Video: Tables

11.9 Saving Files Out

R can save files out in a variety of formats as well. You will need to specify which dataframe you want to save out, where you’d like it saved, what the new file should be called, and in what format you’ll be saving it in. Note that unless you specify a file path, R will save the file in your working directory.

#Save as .csv
write.table(df,
            "filename.csv",
            sep = ",",
            quote = FALSE, #I do not want "" around my string entries
            col.names = TRUE, #I do want it to bring the column names
            row.names = FALSE) #But I don't want the row names

#Save as .tsv
write.table(df,
            "filename.csv",
            sep = "/t",
            quote = FALSE, #I do not want "" around my string entries
            col.names = FALSE, #I don't want it to bring the column names
            row.names = FALSE) #Or the row names

To save Excel files, you’ll need the writexl package.

#Call the package
library(writexl)

#Save the file
write_xlsx(df,
           "filename.xlsx")

And writing SAS and SPSS files both require the haven package.

#Call the package
library(haven)

#Save out as SAS file
write_sas(df,
          "filename.sas7bdat")

#Save as an SPSS file
write_sav(df,
          "filename.sav")

11.10 Citations

When you are writing up a methods or results section, you should always be citing your statistical software, and R is no different. What is potentially slightly different about R is that you should not only cite R (and the version), but also any packages you used to analyze your data.

Citing R is thankfully very easy; you can make use of the citation() function. No arguments needed, just run that function as-is. To get your R version, you will run version$version.string. To cite it, you might write something like this:

Data manipulation and statistical analyses were performed using R Statistical Software (v4.5.1; R Core Team 2025).

#Get R citation
citation()


To cite R in publications use:

  R Core Team (2022). R: A language and environment for statistical
  computing. R Foundation for Statistical Computing, Vienna, Austria.
  URL https://www.R-project.org/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2022},
    url = {https://www.R-project.org/},
  }

We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also 'citation("pkgname")' for
citing R packages.

#Get R version
version$version.string

[1] "R version 4.2.1 (2022-06-23 ucrt)"

To cite a package, you can also use citation(), you will just provide the package name as an argument. For example, if I wanted to cite tableone, since I used it to create a table 1, I would do the following to see how the developers want the package cited:

#Get citation for tableone
citation("tableone")


To cite package 'tableone' in publications use:

  Yoshida K, Bartel A (2022). _tableone: Create 'Table 1' to Describe
  Baseline Characteristics with or without Propensity Score Weights_. R
  package version 0.13.2,
  <https://CRAN.R-project.org/package=tableone>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {tableone: Create 'Table 1' to Describe Baseline Characteristics with or
without Propensity Score Weights},
    author = {Kazuki Yoshida and Alexander Bartel},
    year = {2022},
    note = {R package version 0.13.2},
    url = {https://CRAN.R-project.org/package=tableone},
  }

For a more comprehensive walk-though around citing R and R packages, please see How to Cite R and R Packages

11.1 Projects

11.2 Reading in Outside Files

11.2.1 Excel (.xlsx, .xls)

11.2.2 Delimited files (.tsv, .csv)

11.2.3 SAS

11.2.4 SPSS

11.2.5 Fixed Width Data

11.2.6 Space Delimited

11.2.7 Missing Data in Delimited Files

11.2.8 Same Prefix Variables

11.3 Change Variable Name

11.3.1 Advanced renaming: tidyverse

11.4 Variable and Value Labels

11.5 Variable Labels

11.6 Value Labels

11.7 Basic Statistical Functions

11.7.1 Using Range to find out-of-range values

11.7.2 Using Median to find Outliers

11.8 Generating a Count Table

11.8.1 Proportions

11.8.2 Percents

11.9 Saving Files Out

11.10 Citations

11.3.1 Advanced renaming: `tidyverse`

11.7.1 Using `Range` to find out-of-range values