13  Variable Manipulations

13.1 Creating New Variables

To create a new variable in an existing dataframe, you simply name it as part of the dataframe, and indicate what you would like that variable to be. Let’s read in a dataframe containing height and weight information for 15 women, and add a BMI variable.

#First, read in the data
#It's an existing dataset within R, so I'm just calling it in
df <- women

#Look at column names
names(df)
[1] "height" "weight"
#Make a new variable, BMI
df$BMI <- (df$weight*703)/(df$height^2)

You can see that all we did was name it (“BMI”), and indicate what we wanted in it. We could also add a fake ID column.

#Add an id column
df$id <- c(1:15)

13.2 Recoding

We can also take existing variables and change their type - numeric to character, for example. Below, we will go through a number of different recodings you might want to undertake.

13.2.1 Create a Numeric Variable from a Numeric Variable

If we wanted our id variable created above to start at 100 rather than 0, we can simply add 100 to it, and create a new variable in the process. Or, if we had a likert scale item that needed to be reverse scored, we could perform that operation as well.

#Create a dataframe as an example
#ID column plus a likert scale item
data_ex <- data.frame(
  id = c(1:10),
  i1 = sample(c(1:5), size = 10, replace = TRUE)
)

#Look at the dataframe
data_ex
   id i1
1   1  3
2   2  5
3   3  3
4   4  2
5   5  3
6   6  1
7   7  4
8   8  1
9   9  5
10 10  2
#Now, reverse score the item
data_ex$Ri1 <- 6 - data_ex$i1

#See our new variable
data_ex
   id i1 Ri1
1   1  3   3
2   2  5   1
3   3  3   3
4   4  2   4
5   5  3   3
6   6  1   5
7   7  4   2
8   8  1   5
9   9  5   1
10 10  2   4

We can see that this creates a new variable, “Ri1”, that is i1 but reverse scored. i1 was numeric, and Ri1 is also numeric.

13.2.1.1 Reverse Scoring

With Likert-type scales, any sum score created is interpreted as “higher numbers indicate higher amounts of the construct being measured”. If you had a depression scale, for example, higher values would indicate higher depression. However, sometimes items are worded in such a way that, initially, lower values indicate more of the construct. Using the depression example, if you had an item “I have no problem taking care of basic hygiene every day”, someone with severe depression might answer “1 - Strongly Disagree”. This item would need to be reverse scored before creating a sum score.

An easy way to reverse score items is to add 1 to the number of choices, and then subtract the actual answer from that number. For example, with a scale with 7 scale points, 7 + 1 = 8. So you would do 8 - actual answer to get the reverse scored items. Someone who had initially answered 7 would be reverse scored to a 1, and someone who had initially answered a 3 would be reverse scored to a 5.

When reverse scoring, always create a new variable. This both preserves the raw data and allows you to check your work with a crosstabs table. In this table, you should only have values in the cells that are reverse scored (in our 7 scale point example above, we would only expect values in 1/7, 2/6, 3/5, etc.). The crosstabs table should always be used to check your work to make sure your operation was performed as intended.

13.2.2 Create Categorical Numeric Variable from a Numeric Variable

We can also create a categorical numeric variable from a numeric variable. Going back to the height/weight dataset, we can create a new variable, wt2, that takes on a value of 0 if weight is less than 130, and a value of 1 if weight is over 130. Here, we make use of an ifelse() statement. We are saying “If weight is greater than or equal to 130 (ifelse(df$weight >= 130), then assign a 1 to our new variable (the 1). Otherwise, assign a zero (the 0).” We could have made it assign any number, or even a string.

#Create a categorical numeric variable
df$wt2 <- ifelse(df$weight >= 130, 1, 0)

#Look at the data
df
   height weight      BMI id wt2
1      58    115 24.03240  1   0
2      59    117 23.62856  2   0
3      60    120 23.43333  3   0
4      61    123 23.23811  4   0
5      62    126 23.04318  5   0
6      63    129 22.84883  6   0
7      64    132 22.65527  7   1
8      65    135 22.46272  8   1
9      66    139 22.43274  9   1
10     67    142 22.23791 10   1
11     68    146 22.19680 11   1
12     69    150 22.14871 12   1
13     70    154 22.09429 13   1
14     71    159 22.17358 14   1
15     72    164 22.23997 15   1
#Check our work
table(df$weight, df$wt2)
     
      0 1
  115 1 0
  117 1 0
  120 1 0
  123 1 0
  126 1 0
  129 1 0
  132 0 1
  135 0 1
  139 0 1
  142 0 1
  146 0 1
  150 0 1
  154 0 1
  159 0 1
  164 0 1

We can also nest ifelse() statements to categorize into 3 (or more!) different categories. Here, we are saying if weight is 140 or above, wt3 should be a 2. If not, then evaluate the second ifelse() statement, which is saying that if weight is 125 or above, assign a 1 to wt3, otherwise assign a 0.

#Create a second categorical variable
df$wt3 <- ifelse(df$weight >= 140, 2, ifelse(df$weight >= 125, 1, 0))

#Look at the data
df
   height weight      BMI id wt2 wt3
1      58    115 24.03240  1   0   0
2      59    117 23.62856  2   0   0
3      60    120 23.43333  3   0   0
4      61    123 23.23811  4   0   0
5      62    126 23.04318  5   0   1
6      63    129 22.84883  6   0   1
7      64    132 22.65527  7   1   1
8      65    135 22.46272  8   1   1
9      66    139 22.43274  9   1   1
10     67    142 22.23791 10   1   2
11     68    146 22.19680 11   1   2
12     69    150 22.14871 12   1   2
13     70    154 22.09429 13   1   2
14     71    159 22.17358 14   1   2
15     72    164 22.23997 15   1   2
#And check with a crosstab
table(df$weight, df$wt3)
     
      0 1 2
  115 1 0 0
  117 1 0 0
  120 1 0 0
  123 1 0 0
  126 0 1 0
  129 0 1 0
  132 0 1 0
  135 0 1 0
  139 0 1 0
  142 0 0 1
  146 0 0 1
  150 0 0 1
  154 0 0 1
  159 0 0 1
  164 0 0 1

13.2.3 Create Factor Variable from a Numeric Variable

We can use the same pattern to create a factor (string) variable.

#Sort on height
df$ht2 <- ifelse(df$height >= 70, "Tall", "Short")

#Coerce into a factor
df$ht2 <- as.factor(df$ht2)

#Check to see if the new variable is now a factor
str(df$ht2)
 Factor w/ 2 levels "Short","Tall": 1 1 1 1 1 1 1 1 1 1 ...

13.2.4 Create Character Variable from Integer Variable

We can use the as.factor() argument with numeric variables, too, if the numbers are representing levels of something. This is handy if you are performing an ANOVA or other test that requires your grouping variable to be a factor.

#Turn wt3 into a factor
df$wt3 <- as.factor(df$wt3)

13.2.5 Create Numeric Variable from Character Variable

We can create a numeric variable from a character variable pretty easily as well. This will be more common with things like numbers used to indicate gender or race groups, but we will use our height example here. I am showing two different approaches - both will give the same outcome.

#Turn tall and short into numbers
#Approach 1
df$ht3[df$ht2 == "Tall"] <- 1
df$ht3[df$ht2 == "Short"] <- 0

#Approach 2
df$ht4 <- ifelse(df$ht2 == "Tall", 1, 0)

#Look at the results
df
   height weight      BMI id wt2 wt3   ht2 ht3 ht4
1      58    115 24.03240  1   0   0 Short   0   0
2      59    117 23.62856  2   0   0 Short   0   0
3      60    120 23.43333  3   0   0 Short   0   0
4      61    123 23.23811  4   0   0 Short   0   0
5      62    126 23.04318  5   0   1 Short   0   0
6      63    129 22.84883  6   0   1 Short   0   0
7      64    132 22.65527  7   1   1 Short   0   0
8      65    135 22.46272  8   1   1 Short   0   0
9      66    139 22.43274  9   1   1 Short   0   0
10     67    142 22.23791 10   1   2 Short   0   0
11     68    146 22.19680 11   1   2 Short   0   0
12     69    150 22.14871 12   1   2 Short   0   0
13     70    154 22.09429 13   1   2  Tall   1   1
14     71    159 22.17358 14   1   2  Tall   1   1
15     72    164 22.23997 15   1   2  Tall   1   1

13.2.6 Create Character Variable from Character Variable

We can take the same approach to create character variables from character variables. Let’s say we’d rather have “T” and “S” in place of the spelled out words above.

#Create an abbreviated height variable
df$ht5 <- ifelse(df$ht2 == "Tall", "T", "S")

13.3 Create Sum or Mean Variable

Creating calculations using variables is also straightforward. We can make use of rowSums() and rowMeans() functions to create sums and means by row. This can be useful if you have a series of items that you are trying to sum for each person, or get the average response for. Below, I am creating adding some fictional responses to our earlier single item data, then illustrating the two functions mentioned above.

#Add some items
data_ex$i2 <- sample(c(1:5), size = 10, replace = TRUE)
data_ex$i3 <- sample(c(1:5), size = 10, replace = TRUE)
data_ex$i4 <- sample(c(1:5), size = 10, replace = TRUE)

#See what we have
data_ex
   id i1 Ri1 i2 i3 i4
1   1  3   3  2  1  4
2   2  5   1  4  2  1
3   3  3   3  5  3  4
4   4  2   4  5  3  1
5   5  3   3  1  2  2
6   6  1   5  4  5  2
7   7  4   2  1  5  4
8   8  1   5  3  5  1
9   9  5   1  5  3  3
10 10  2   4  4  4  2
#Create a sum variable
data_ex$sum <- rowSums(data_ex[,c("Ri1", "i2", "i3", "i4")])

#Could also do it like this
data_ex$sum2 <- rowSums(data_ex[,3:6])

#And a mean variable
data_ex$avg <- rowMeans(data_ex[,c("Ri1", "i2", "i3", "i4")])

Notice how when I use rowSums and rowMeans, I am specifing which variables I want the operation performed on. Since I do not have anything before the comma, I am saying please perform on all rows, but just the columns I have listed. You can list them by name, or by index number.

13.3.1 Missing Values

When using rowSums() or rowMeans(), you will only get a value when all columns have valid entries. If any columns have missing values (i.e., NA), then both rowSums() and rowMeans() will return NA. To have these functions return a value ignoring any missing data, you need to add the argument na.rm = TRUE. For example, let’s add some missing data to our data example from above and calculate two new sum columns: one only for complete data, one for any data we have.

#Make up some data
data_ex2 <- data.frame(
  id = c(1:10),
  i1 = sample(c(1:5), size = 10, replace = TRUE),
  i2 = sample(c(1:5), size = 10, replace = TRUE),
  i3 = sample(c(1:5), size = 10, replace = TRUE),
  i4 = sample(c(1:5, NA), size = 10, replace = TRUE))

#See what we have
data_ex2
   id i1 i2 i3 i4
1   1  4  1  4  1
2   2  3  3  5  5
3   3  2  1  1  2
4   4  3  1  4  1
5   5  5  5  5  5
6   6  4  4  2 NA
7   7  4  4  2  3
8   8  1  4  5  2
9   9  2  4  1  1
10 10  2  5  1  3
#Create a sum variable only for complete data
data_ex2$sum <- rowSums(data_ex2[,c("i1", "i2", "i3", "i4")])

#Create a sum variable for any data
data_ex2$sum2 <- rowSums(data_ex2[,c("i1", "i2", "i3", "i4")],
                         na.rm = TRUE)