11  Basic Statistical Functions in R

#|echo: FALSE
df <- women

df2 <- mtcars

library(palmerpenguins)
Warning: package 'palmerpenguins' was built under R version 4.3.3
df3 <- palmerpenguins::penguins_raw

11.1 Descriptives

There are a number of different functions we can use to gather descriptive statistics; some are included in base R and some are associated with other packages.

Function Package? Action
table() No Obtains frequencies for categorical variables; can also generate cross-tabulations between multiple categorical variables
CrossTable() gmodels Gives cross-tabulations between multiple categorical variables
mean(), sd(), median(), min(), and max() No Performs advertised function
summary() No Gives descriptive statistics for continuous variables or frequencies for categorical variables.  Can be used on a single variable or entire dataframe.
describe() psych Gives descriptive statistics for variables - most appropriate with continuous variables.  Can be used on a single variable or entire dataframe.
describeBy() psych Similar to describe(), but allows to give descriptive statistics for different levels of a categorical variable.
CreateTableOne() tableone Gives frequencies/percents for categorical variables and means/sds for continuous variables.
cor(), cor.test() No Calculates the correlation between two continuous variables; cor.test() will give the significance test.
corr.test() psych Gives correlations and significance tests.

11.1.1 psych package

There is a package called psych that contains many useful functions, one of which is describe(). Similar to summary(), it provides a number of different values. However, it provides many more than summary: item name, item number, number of valid cases (n), mean, standard deviation, trimmed mean, median, median absolute deviation (mad), minimum, maximum, skew, kurtosis, and standard error.

#Call the package
library(psych)  

#Get descriptive information on a dataframe
describe(df)
       vars  n   mean    sd median trimmed   mad min max range skew kurtosis
height    1 15  65.00  4.47     65   65.00  5.93  58  72    14 0.00    -1.44
weight    2 15 136.73 15.50    135  136.31 17.79 115 164    49 0.23    -1.34
         se
height 1.15
weight 4.00

Also in the psych package is the function describeBy() which allows you to get descriptive statistics by different levels of a categorical variable.

#Using a different dataset: mtcars
#Asking for descriptive statistics for mpg and wt columns only
describeBy(df2[, c("mpg", "wt")], group = df2$cyl)

 Descriptive statistics by group 
group: 4
    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
mpg    1 11 26.66 4.51   26.0   26.44 6.52 21.40 33.90 12.50 0.26    -1.65 1.36
wt     2 11  2.29 0.57    2.2    2.27 0.54  1.51  3.19  1.68 0.30    -1.36 0.17
------------------------------------------------------------ 
group: 6
    vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
mpg    1 7 19.74 1.45  19.70   19.74 1.93 17.80 21.40  3.60 -0.16    -1.91 0.55
wt     2 7  3.12 0.36   3.21    3.12 0.36  2.62  3.46  0.84 -0.22    -1.98 0.13
------------------------------------------------------------ 
group: 8
    vars  n mean   sd median trimmed  mad   min   max range  skew kurtosis   se
mpg    1 14 15.1 2.56  15.20   15.15 1.56 10.40 19.20  8.80 -0.36    -0.57 0.68
wt     2 14  4.0 0.76   3.76    3.95 0.41  3.17  5.42  2.25  0.99    -0.71 0.20

11.2 Frequencies

To get frequencies for a single categorical variable, we can use the table() function. As an example, we will use the Palmer Penguins dataset, and ask for a table by species.

#Get frequency of penguin species
table(df3$Species)

      Adelie Penguin (Pygoscelis adeliae) 
                                      152 
Chinstrap penguin (Pygoscelis antarctica) 
                                       68 
        Gentoo penguin (Pygoscelis papua) 
                                      124 

If we wanted proportions instead, we could use the prop.table() function on the table created above. And then, we could round that to get percentages.

#Save the table as an object
table1 <- table(df3$Species)

#Get proportions
prop.table(table1)

      Adelie Penguin (Pygoscelis adeliae) 
                                0.4418605 
Chinstrap penguin (Pygoscelis antarctica) 
                                0.1976744 
        Gentoo penguin (Pygoscelis papua) 
                                0.3604651 
#We could also just nest the functions
prop.table(table(df3$Species))

      Adelie Penguin (Pygoscelis adeliae) 
                                0.4418605 
Chinstrap penguin (Pygoscelis antarctica) 
                                0.1976744 
        Gentoo penguin (Pygoscelis papua) 
                                0.3604651 
#Turn it into percents
round(100*prop.table(table1))

      Adelie Penguin (Pygoscelis adeliae) 
                                       44 
Chinstrap penguin (Pygoscelis antarctica) 
                                       20 
        Gentoo penguin (Pygoscelis papua) 
                                       36 

11.3 Crosstabs

The table() function also allows you to get cross-tabulations between multiple variables. Our penguin dataset also has an island variable - let’s get the cross-tabulation of species and island:

#Crosstabs between species and region
table(df3$Species, df3$Island)
                                           
                                            Biscoe Dream Torgersen
  Adelie Penguin (Pygoscelis adeliae)           44    56        52
  Chinstrap penguin (Pygoscelis antarctica)      0    68         0
  Gentoo penguin (Pygoscelis papua)            124     0         0

We also might just want the number of cases of Adelie penguins:

table(df3$Species == "Adelie Penguin (Pygoscelis adeliae)")

FALSE  TRUE 
  192   152 

Running the code above tells us that there are 152 (under the TRUE column) that are Adelie penguins, and 192 that are not.

The CrossTable() function from the gmodels package performs much the same operation as the table() function, but displays the results differently:

#Repeat the above with CrossTable
#Install the package if we don't have it:
#install.packages("gmodels")

#Then call the package
library(gmodels)

#Get a crosstab between species and region
CrossTable(df3$Species, df3$Island)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  344 

 
                                          | df3$Island 
                              df3$Species |    Biscoe |     Dream | Torgersen | Row Total | 
------------------------------------------|-----------|-----------|-----------|-----------|
      Adelie Penguin (Pygoscelis adeliae) |        44 |        56 |        52 |       152 | 
                                          |    12.313 |     0.027 |    36.661 |           | 
                                          |     0.289 |     0.368 |     0.342 |     0.442 | 
                                          |     0.262 |     0.452 |     1.000 |           | 
                                          |     0.128 |     0.163 |     0.151 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|
Chinstrap penguin (Pygoscelis antarctica) |         0 |        68 |         0 |        68 | 
                                          |    33.209 |    77.157 |    10.279 |           | 
                                          |     0.000 |     1.000 |     0.000 |     0.198 | 
                                          |     0.000 |     0.548 |     0.000 |           | 
                                          |     0.000 |     0.198 |     0.000 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|
        Gentoo penguin (Pygoscelis papua) |       124 |         0 |         0 |       124 | 
                                          |    66.463 |    44.698 |    18.744 |           | 
                                          |     1.000 |     0.000 |     0.000 |     0.360 | 
                                          |     0.738 |     0.000 |     0.000 |           | 
                                          |     0.360 |     0.000 |     0.000 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|
                             Column Total |       168 |       124 |        52 |       344 | 
                                          |     0.488 |     0.360 |     0.151 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|

 

11.4 Basic Statistical Operations

We can get basic descriptive statistics using a variety of functions, conveniently named for what they do. We’ll look at flipper length here, across all penguins. We can also add in na.rm = TRUE to allow calculation even with missing data. You’ll notice that since the variable name has a parenthetical, we are enclosing it in back-tick marks: `. This is the button next to the number 1 on your keyboard, and is only necessary due to the “(mm)” in the variable name. With other variables (e.g. “Species”), it is not necessary.

#Calculate a mean
mean(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 200.9152
#Calculate the standard deviation
sd(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 14.06171
#Calculate the variance
var(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 197.7318
#Calculate the median
median(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 197
#Calculate the minimum value
min(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 172
#Calculate the maximum value
max(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 231
#Calculate the range
range(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 172 231

11.4.1 Summary

We also might want those values, but don’t want to call them each individually. The summary() function will give us similar information, and can determine what values to give based on variable type.

#Continuous variable
summary(df3$`Flipper Length (mm)`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  172.0   190.0   197.0   200.9   213.0   231.0       2 
#And a categorical one
#First, tell R we'd like species to be a factor
df3$Species <- as.factor(df3$Species)

#Then run summary
summary(df3$Species)
      Adelie Penguin (Pygoscelis adeliae) 
                                      152 
Chinstrap penguin (Pygoscelis antarctica) 
                                       68 
        Gentoo penguin (Pygoscelis papua) 
                                      124 

11.4.2 CreateTableOne

Another useful function is CreateTableOne() from the tableone package. It also knows what information to provide based on variable type (though you may need to inform R that you’d like categorical variables treated as factors).

#Install the package if necessary
#install.packages("tableone")

#Call the package
library(tableone)
Warning: package 'tableone' was built under R version 4.3.3
#Get our variable names
names(df3)
 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"           
#Create a table
t <- CreateTableOne(data = df3,
                    vars = c("Species", "Flipper Length (mm)", "Body Mass (g)"))

#Print the table
t
                                              
                                               Overall         
  n                                                344         
  Species (%)                                                  
     Adelie Penguin (Pygoscelis adeliae)           152 (44.2)  
     Chinstrap penguin (Pygoscelis antarctica)      68 (19.8)  
     Gentoo penguin (Pygoscelis papua)             124 (36.0)  
  Flipper Length (mm) (mean (SD))               200.92 (14.06) 
  Body Mass (g) (mean (SD))                    4201.75 (801.95)
#We can also show the levels
print(t, showAllLevels = TRUE)
                                 
                                  level                                    
  n                                                                        
  Species (%)                     Adelie Penguin (Pygoscelis adeliae)      
                                  Chinstrap penguin (Pygoscelis antarctica)
                                  Gentoo penguin (Pygoscelis papua)        
  Flipper Length (mm) (mean (SD))                                          
  Body Mass (g) (mean (SD))                                                
                                 
                                  Overall         
  n                                   344         
  Species (%)                         152 (44.2)  
                                       68 (19.8)  
                                      124 (36.0)  
  Flipper Length (mm) (mean (SD))  200.92 (14.06) 
  Body Mass (g) (mean (SD))       4201.75 (801.95)

11.4.3 Correlations

There are a few different approaches to finding a correlation. The cor() function will provide a correlation, but no significance test. cor.test() will give both the correlation and the significance test.

#First, we need to drop missing values, since we can't within the funtion
df3a <- na.omit(df3)

#Calculate a correlation
cor(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)
[1] 0.7624367
#Correlation plus significance test
cor.test(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)

    Pearson's product-moment correlation

data:  df3a$`Flipper Length (mm)` and df3a$`Body Mass (g)`
t = 6.6655, df = 32, p-value = 1.604e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5716627 0.8750009
sample estimates:
      cor 
0.7624367 

If we wanted to obtain correlations, but only for Adelie penguins, we can subset our data:

#Make a new df with just Adelie penguins
sub_df3 <- df3a[df3a$Species == "Adelie Penguin (Pygoscelis adeliae)", ]

#Run the correlation
cor.test(sub_df3$`Flipper Length (mm)`, sub_df3$`Body Mass (g)`)

    Pearson's product-moment correlation

data:  sub_df3$`Flipper Length (mm)` and sub_df3$`Body Mass (g)`
t = 1.6599, df = 11, p-value = 0.1252
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1372942  0.8010097
sample estimates:
      cor 
0.4475467 

11.5 By Function

We can also use the by() function to get diferent descriptive statistics by levels of a categorical variable. Below, we are getting the mean of Body Mass by Species. We could replace mean with cor, or even summary.

#Get mean of body mass by species
by(df3a$`Body Mass (g)`, df3a$Species, mean)
df3a$Species: Adelie Penguin (Pygoscelis adeliae)
[1] 3640.385
------------------------------------------------------------ 
df3a$Species: Chinstrap penguin (Pygoscelis antarctica)
[1] 3580.357
------------------------------------------------------------ 
df3a$Species: Gentoo penguin (Pygoscelis papua)
[1] 4910.714