#First, read in the data
#It's an existing dataset within R, so I'm just calling it in
<- women
df
#Look at column names
names(df)
[1] "height" "weight"
#Make a new variable, BMI
$BMI <- (df$weight*703)/(df$height^2) df
To create a new variable in an existing dataframe, you simply name it as part of the dataframe, and indicate what you would like that variable to be. Let’s read in a dataframe containing height and weight information for 15 women, and add a BMI variable.
#First, read in the data
#It's an existing dataset within R, so I'm just calling it in
<- women
df
#Look at column names
names(df)
[1] "height" "weight"
#Make a new variable, BMI
$BMI <- (df$weight*703)/(df$height^2) df
You can see that all we did was name it (“BMI”), and indicate what we wanted in it. We could also add a fake ID column.
#Add an id column
$id <- c(1:15) df
We can also take existing variables and change their type - numeric to character, for example. Belwo, we will go through a number of different recodings you might want to undertake.
If we wanted our id variable created above to start at 100 rather than 0, we can simply add 100 to it, and create a new variable in the process. Or, if we had a likert scale item that needed to be reverse scored, we could perform that operation as well.
#Create a dataframe as an example
#ID column plus a likert scale item
<- data.frame(
data_ex id = c(1:10),
i1 = sample(c(1:5), size = 10, replace = TRUE)
)
#Look at the dataframe
data_ex
id i1
1 1 5
2 2 2
3 3 4
4 4 5
5 5 5
6 6 5
7 7 3
8 8 3
9 9 1
10 10 3
#Now, reverse score the item
$Ri1 <- 6 - data_ex$i1
data_ex
#See our new variable
data_ex
id i1 Ri1
1 1 5 1
2 2 2 4
3 3 4 2
4 4 5 1
5 5 5 1
6 6 5 1
7 7 3 3
8 8 3 3
9 9 1 5
10 10 3 3
We can see that this creates a new variable, “Ri1”, that is i1 but reverse scored. i1 was numeric, and Ri1 is also numeric.
We can also create a categorical numeric variable from a numeric variable. Going back to the height/weight dataset, we can create a new variable, wt2, that takes on a value of 0 if weight is less than 130, and a value of 1 if weight is over 130. Here, we make use of an ifelse()
statement. We are saying “If weight is greater than or equal to 130 (ifelse(df$weight >= 130
), then assign a 1 to our new variable (the 1). Otherwise, assign a zero (the 0).” We could have made it assign any number, or even a string.
#Create a categorical numeric variable
$wt2 <- ifelse(df$weight >= 130, 1, 0)
df
#Look at the data
df
height weight BMI id wt2
1 58 115 24.03240 1 0
2 59 117 23.62856 2 0
3 60 120 23.43333 3 0
4 61 123 23.23811 4 0
5 62 126 23.04318 5 0
6 63 129 22.84883 6 0
7 64 132 22.65527 7 1
8 65 135 22.46272 8 1
9 66 139 22.43274 9 1
10 67 142 22.23791 10 1
11 68 146 22.19680 11 1
12 69 150 22.14871 12 1
13 70 154 22.09429 13 1
14 71 159 22.17358 14 1
15 72 164 22.23997 15 1
#Check our work
table(df$weight, df$wt2)
0 1
115 1 0
117 1 0
120 1 0
123 1 0
126 1 0
129 1 0
132 0 1
135 0 1
139 0 1
142 0 1
146 0 1
150 0 1
154 0 1
159 0 1
164 0 1
We can also next ifelse()
statements to categorize into 3 (or more!) different categories. Here, we are saying if weight is 140 or above, wt3 should be a 2. If not, then evaluate the second ifelse()
statement, which is saying that if weight is 125 or above, assign a 1 to wt3, otherwise assign a 0.
#Create a second categorical variable
$wt3 <- ifelse(df$weight >= 140, 2, ifelse(df$weight >= 125, 1, 0))
df
#Look at the data
df
height weight BMI id wt2 wt3
1 58 115 24.03240 1 0 0
2 59 117 23.62856 2 0 0
3 60 120 23.43333 3 0 0
4 61 123 23.23811 4 0 0
5 62 126 23.04318 5 0 1
6 63 129 22.84883 6 0 1
7 64 132 22.65527 7 1 1
8 65 135 22.46272 8 1 1
9 66 139 22.43274 9 1 1
10 67 142 22.23791 10 1 2
11 68 146 22.19680 11 1 2
12 69 150 22.14871 12 1 2
13 70 154 22.09429 13 1 2
14 71 159 22.17358 14 1 2
15 72 164 22.23997 15 1 2
#And check with a crosstab
table(df$weight, df$wt3)
0 1 2
115 1 0 0
117 1 0 0
120 1 0 0
123 1 0 0
126 0 1 0
129 0 1 0
132 0 1 0
135 0 1 0
139 0 1 0
142 0 0 1
146 0 0 1
150 0 0 1
154 0 0 1
159 0 0 1
164 0 0 1
We can use the same pattern to create a factor (string) variable.
#Sort on height
$ht2 <- ifelse(df$height >= 70, "Tall", "Short")
df
#Coerce into a factor
$ht2 <- as.factor(df$ht2)
df
#Check to see if the new variable is now a factor
str(df$ht2)
Factor w/ 2 levels "Short","Tall": 1 1 1 1 1 1 1 1 1 1 ...
We can use the as.factor()
argument with numeric variables, too, if the numbers are representing levels of something. This is handy if you are performing an ANOVA or other test that requires your grouping variable to be a factor.
#Turn wt3 into a factor
$wt3 <- as.factor(df$wt3) df
We can create a numeric variable from a character variable pretty easily as well. This will be more common with things like numbers used to indicate gender or race groups, but we will use our height example here. I am showing two different approaches - both will give the same outcome.
#Turn tall and short into numbers
#Approach 1
$ht3[df$ht2 == "Tall"] <- 1
df$ht3[df$ht2 == "Short"] <- 0
df
#Approach 2
$ht4 <- ifelse(df$ht2 == "Tall", 1, 0)
df
#Look at the results
df
height weight BMI id wt2 wt3 ht2 ht3 ht4
1 58 115 24.03240 1 0 0 Short 0 0
2 59 117 23.62856 2 0 0 Short 0 0
3 60 120 23.43333 3 0 0 Short 0 0
4 61 123 23.23811 4 0 0 Short 0 0
5 62 126 23.04318 5 0 1 Short 0 0
6 63 129 22.84883 6 0 1 Short 0 0
7 64 132 22.65527 7 1 1 Short 0 0
8 65 135 22.46272 8 1 1 Short 0 0
9 66 139 22.43274 9 1 1 Short 0 0
10 67 142 22.23791 10 1 2 Short 0 0
11 68 146 22.19680 11 1 2 Short 0 0
12 69 150 22.14871 12 1 2 Short 0 0
13 70 154 22.09429 13 1 2 Tall 1 1
14 71 159 22.17358 14 1 2 Tall 1 1
15 72 164 22.23997 15 1 2 Tall 1 1
We can take the same approach to create character variables from character variables. Let’s say we’d rather have “T” and “S” in place of the spelled out words above.
#Create an abbreviated height variable
$ht5 <- ifelse(df$ht2 == "Tall", "T", "S") df
Creating calculations using variables is also straightforward. We can make use of rowSums()
and rowMeans()
functions to create sums and means by row. This can be useful if you have a series of items that you are trying to sum for each person, or get the average response for. Below, I am creating adding some fictional responses to our earlier single item data, then illustrating the two functions mentioned above.
#Add some items
$i2 <- sample(c(1:5), size = 10, replace = TRUE)
data_ex$i3 <- sample(c(1:5), size = 10, replace = TRUE)
data_ex$i4 <- sample(c(1:5), size = 10, replace = TRUE)
data_ex
#See what we have
data_ex
id i1 Ri1 i2 i3 i4
1 1 5 1 4 3 5
2 2 2 4 3 2 3
3 3 4 2 1 3 1
4 4 5 1 1 5 2
5 5 5 1 1 2 2
6 6 5 1 2 5 3
7 7 3 3 3 1 2
8 8 3 3 3 2 5
9 9 1 5 4 4 2
10 10 3 3 5 3 4
#Create a sum variable
$sum <- rowSums(data_ex[,c("Ri1", "i2", "i3", "i4")])
data_ex
#Could also do it like this
$sum2 <- rowSums(data_ex[,3:6])
data_ex
#And a mean variable
$avg <- rowMeans(data_ex[,c("Ri1", "i2", "i3", "i4")]) data_ex
Notice how when I use rowSums and rowMeans, I am specifing which variables I want the operation performed on. Since I do not have anything before the comma, I am saying please perform on all rows, but just the columns I have listed. You can list them by name, or by index number.
We can rename a variable using the names()
function.
#Rename height to height (in)
names(df)[names(df) == "height"] <- "height (in)"