12 More statistics in R

#|echo: FALSE
df <- women

df2 <- mtcars

library(palmerpenguins)

Warning: package 'palmerpenguins' was built under R version 4.2.3

df3 <- palmerpenguins::penguins_raw

12.1 Descriptives

There are a number of different functions we can use to gather descriptive statistics; some are included in base R and some are associated with other packages.

Function	Package?	Action
table()	No	Obtains frequencies for categorical variables; can also generate cross-tabulations between multiple categorical variables
CrossTable()	gmodels	Gives cross-tabulations between multiple categorical variables
mean(), sd(), median(), min(), and max()	No	Performs advertised function
summary()	No	Gives descriptive statistics for continuous variables or frequencies for categorical variables. Can be used on a single variable or entire dataframe.
describe()	psych	Gives descriptive statistics for variables - most appropriate with continuous variables. Can be used on a single variable or entire dataframe.
describeBy()	psych	Similar to describe(), but allows to give descriptive statistics for different levels of a categorical variable.
CreateTableOne()	tableone	Gives frequencies/percents for categorical variables and means/sds for continuous variables.
cor(), cor.test()	No	Calculates the correlation between two continuous variables; cor.test() will give the significance test.
corr.test()	psych	Gives correlations and significance tests.

12.1.1 psych package

There is a package called psych that contains many useful functions, one of which is describe(). Similar to summary(), it provides a number of different values. However, it provides many more than summary: item name, item number, number of valid cases (n), mean, standard deviation, trimmed mean, median, median absolute deviation (mad), minimum, maximum, skew, kurtosis, and standard error.

#Call the package
library(psych)

Warning: package 'psych' was built under R version 4.2.3

#Get descriptive information on a dataframe
describe(df)

       vars  n   mean    sd median trimmed   mad min max range skew kurtosis
height    1 15  65.00  4.47     65   65.00  5.93  58  72    14 0.00    -1.44
weight    2 15 136.73 15.50    135  136.31 17.79 115 164    49 0.23    -1.34
         se
height 1.15
weight 4.00

Also in the psych package is the function describeBy() which allows you to get descriptive statistics by different levels of a categorical variable.

#Using a different dataset: mtcars
#Asking for descriptive statistics for mpg and wt columns only
describeBy(df2[, c("mpg", "wt")], group = df2$cyl)


 Descriptive statistics by group 
group: 4
    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
mpg    1 11 26.66 4.51   26.0   26.44 6.52 21.40 33.90 12.50 0.26    -1.65 1.36
wt     2 11  2.29 0.57    2.2    2.27 0.54  1.51  3.19  1.68 0.30    -1.36 0.17
------------------------------------------------------------ 
group: 6
    vars n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
mpg    1 7 19.74 1.45  19.70   19.74 1.93 17.80 21.40  3.60 -0.16    -1.91 0.55
wt     2 7  3.12 0.36   3.21    3.12 0.36  2.62  3.46  0.84 -0.22    -1.98 0.13
------------------------------------------------------------ 
group: 8
    vars  n mean   sd median trimmed  mad   min   max range  skew kurtosis   se
mpg    1 14 15.1 2.56  15.20   15.15 1.56 10.40 19.20  8.80 -0.36    -0.57 0.68
wt     2 14  4.0 0.76   3.76    3.95 0.41  3.17  5.42  2.25  0.99    -0.71 0.20

12.2 Frequencies

To get frequencies for a single categorical variable, we can use the table() function. As an example, we will use the Palmer Penguins dataset, and ask for a table by species.

#Get frequency of penguin species
table(df3$Species)


      Adelie Penguin (Pygoscelis adeliae) 
                                      152 
Chinstrap penguin (Pygoscelis antarctica) 
                                       68 
        Gentoo penguin (Pygoscelis papua) 
                                      124

If we wanted proportions instead, we could use the prop.table() function on the table created above. And then, we could round that to get percentages.

#Save the table as an object
table1 <- table(df3$Species)

#Get proportions
prop.table(table1)


      Adelie Penguin (Pygoscelis adeliae) 
                                0.4418605 
Chinstrap penguin (Pygoscelis antarctica) 
                                0.1976744 
        Gentoo penguin (Pygoscelis papua) 
                                0.3604651

#We could also just nest the functions
prop.table(table(df3$Species))


      Adelie Penguin (Pygoscelis adeliae) 
                                0.4418605 
Chinstrap penguin (Pygoscelis antarctica) 
                                0.1976744 
        Gentoo penguin (Pygoscelis papua) 
                                0.3604651

#Turn it into percents
round(100*prop.table(table1))


      Adelie Penguin (Pygoscelis adeliae) 
                                       44 
Chinstrap penguin (Pygoscelis antarctica) 
                                       20 
        Gentoo penguin (Pygoscelis papua) 
                                       36

12.3 Crosstabs

The table() function also allows you to get cross-tabulations between multiple variables. Our penguin dataset also has an island variable - let’s get the cross-tabulation of species and island:

#Crosstabs between species and region
table(df3$Species, df3$Island)

                                           
                                            Biscoe Dream Torgersen
  Adelie Penguin (Pygoscelis adeliae)           44    56        52
  Chinstrap penguin (Pygoscelis antarctica)      0    68         0
  Gentoo penguin (Pygoscelis papua)            124     0         0

We also might just want the number of cases of Adelie penguins:

table(df3$Species == "Adelie Penguin (Pygoscelis adeliae)")


FALSE  TRUE 
  192   152

Running the code above tells us that there are 152 (under the TRUE column) that are Adelie penguins, and 192 that are not.

Video (repost from previous chapter): Tables

12.3.1 Proportion and Percent

As before, we can also determine the proportion and percent of the total dataset for a crosstabs table.

#Crosstabs between species and region
table(df3$Species, df3$Island)

                                           
                                            Biscoe Dream Torgersen
  Adelie Penguin (Pygoscelis adeliae)           44    56        52
  Chinstrap penguin (Pygoscelis antarctica)      0    68         0
  Gentoo penguin (Pygoscelis papua)            124     0         0

#Proportion between species and region
prop.table(table(df3$Species, df3$Island))

                                           
                                               Biscoe     Dream Torgersen
  Adelie Penguin (Pygoscelis adeliae)       0.1279070 0.1627907 0.1511628
  Chinstrap penguin (Pygoscelis antarctica) 0.0000000 0.1976744 0.0000000
  Gentoo penguin (Pygoscelis papua)         0.3604651 0.0000000 0.0000000

#Percent between species and region
prop.table(table(df3$Species, df3$Island)) * 100

                                           
                                              Biscoe    Dream Torgersen
  Adelie Penguin (Pygoscelis adeliae)       12.79070 16.27907  15.11628
  Chinstrap penguin (Pygoscelis antarctica)  0.00000 19.76744   0.00000
  Gentoo penguin (Pygoscelis papua)         36.04651  0.00000   0.00000

The CrossTable() function from the gmodels package performs much the same operation as the table() function, but displays the results differently:

#Repeat the above with CrossTable
#Install the package if we don't have it:
#install.packages("gmodels")

#Then call the package
library(gmodels)

#Get a crosstab between species and region
CrossTable(df3$Species, df3$Island)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  344 

 
                                          | df3$Island 
                              df3$Species |    Biscoe |     Dream | Torgersen | Row Total | 
------------------------------------------|-----------|-----------|-----------|-----------|
      Adelie Penguin (Pygoscelis adeliae) |        44 |        56 |        52 |       152 | 
                                          |    12.313 |     0.027 |    36.661 |           | 
                                          |     0.289 |     0.368 |     0.342 |     0.442 | 
                                          |     0.262 |     0.452 |     1.000 |           | 
                                          |     0.128 |     0.163 |     0.151 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|
Chinstrap penguin (Pygoscelis antarctica) |         0 |        68 |         0 |        68 | 
                                          |    33.209 |    77.157 |    10.279 |           | 
                                          |     0.000 |     1.000 |     0.000 |     0.198 | 
                                          |     0.000 |     0.548 |     0.000 |           | 
                                          |     0.000 |     0.198 |     0.000 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|
        Gentoo penguin (Pygoscelis papua) |       124 |         0 |         0 |       124 | 
                                          |    66.463 |    44.698 |    18.744 |           | 
                                          |     1.000 |     0.000 |     0.000 |     0.360 | 
                                          |     0.738 |     0.000 |     0.000 |           | 
                                          |     0.360 |     0.000 |     0.000 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|
                             Column Total |       168 |       124 |        52 |       344 | 
                                          |     0.488 |     0.360 |     0.151 |           | 
------------------------------------------|-----------|-----------|-----------|-----------|

12.4 Statistical Operations

12.5 CreateTableOne

Another useful function is CreateTableOne() from the tableone package. It creates….a Table 1! It also knows what information to provide based on variable type (though you may need to inform R that you’d like categorical variables treated as factors).

#Install the package if necessary
#install.packages("tableone")

#Call the package
library(tableone)

Warning: package 'tableone' was built under R version 4.2.3

#Get our variable names
names(df3)

 [1] "studyName"           "Sample Number"       "Species"            
 [4] "Region"              "Island"              "Stage"              
 [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
[10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
[13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
[16] "Delta 13 C (o/oo)"   "Comments"

#Create a table
t <- CreateTableOne(data = df3,
                    vars = c("Species", "Flipper Length (mm)", "Body Mass (g)"))

#Print the table
t

                                              
                                               Overall         
  n                                                344         
  Species (%)                                                  
     Adelie Penguin (Pygoscelis adeliae)           152 (44.2)  
     Chinstrap penguin (Pygoscelis antarctica)      68 (19.8)  
     Gentoo penguin (Pygoscelis papua)             124 (36.0)  
  Flipper Length (mm) (mean (SD))               200.92 (14.06) 
  Body Mass (g) (mean (SD))                    4201.75 (801.95)

#We can also show the levels
print(t, showAllLevels = TRUE)

                                 
                                  level                                    
  n                                                                        
  Species (%)                     Adelie Penguin (Pygoscelis adeliae)      
                                  Chinstrap penguin (Pygoscelis antarctica)
                                  Gentoo penguin (Pygoscelis papua)        
  Flipper Length (mm) (mean (SD))                                          
  Body Mass (g) (mean (SD))                                                
                                 
                                  Overall         
  n                                   344         
  Species (%)                         152 (44.2)  
                                       68 (19.8)  
                                      124 (36.0)  
  Flipper Length (mm) (mean (SD))  200.92 (14.06) 
  Body Mass (g) (mean (SD))       4201.75 (801.95)

12.6 Correlations

There are a few different approaches to finding a correlation. The cor() function will provide a correlation, but no significance test. cor.test() will give both the correlation and the significance test.

#First, we need to drop missing values, since we can't within the funtion
df3a <- na.omit(df3)

#Calculate a correlation
cor(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)

[1] 0.7624367

#Correlation plus significance test
cor.test(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)


    Pearson's product-moment correlation

data:  df3a$`Flipper Length (mm)` and df3a$`Body Mass (g)`
t = 6.6655, df = 32, p-value = 1.604e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5716627 0.8750009
sample estimates:
      cor 
0.7624367

If we wanted to obtain correlations, but only for Adelie penguins, we can subset our data:

#Make a new df with just Adelie penguins
sub_df3 <- df3a[df3a$Species == "Adelie Penguin (Pygoscelis adeliae)", ]

#Run the correlation
cor.test(sub_df3$`Flipper Length (mm)`, sub_df3$`Body Mass (g)`)


    Pearson's product-moment correlation

data:  sub_df3$`Flipper Length (mm)` and sub_df3$`Body Mass (g)`
t = 1.6599, df = 11, p-value = 0.1252
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.1372942  0.8010097
sample estimates:
      cor 
0.4475467

12.7 By Function

We can also use the by() function to get different descriptive statistics by levels of a categorical variable. Below, we are getting the mean of Body Mass by Species. We could replace mean with cor, or even summary.

#Get mean of body mass by species
by(df3a$`Body Mass (g)`, df3a$Species, mean)

df3a$Species: Adelie Penguin (Pygoscelis adeliae)
[1] 3640.385
------------------------------------------------------------ 
df3a$Species: Chinstrap penguin (Pygoscelis antarctica)
[1] 3580.357
------------------------------------------------------------ 
df3a$Species: Gentoo penguin (Pygoscelis papua)
[1] 4910.714

12.8 T-test

The t.test() function is used to run all t-tests: a one-sample t-test, repeated measures t-test, and independent samples t-test.

12.8.1 One-Sample T-test

The major assumption in a one-sample t-test is that our data are normally distrubuted. Graphical methods will be addressed in a later chapter. You can also use the Shapiro-Wilk test of normality via the shapiro.test() function. If it is not significant (i.e., p > 0.05), then it indicates that the data do not have a distribution that is significantly different than a normal distrbution.

#Shapiro-Wilk test
shapiro.test(trees$Height)

To run a one-sample t-test, you need to indicate where the data are, and what value you are comparing to (mu). If you only include those two things, you will be running a two-tailed test, determining if the mean of the data is different than the comparison value.

#Run a one sample t test
t.test(trees$Height, #Column in a data frame (or a vector)
       mu = 72) #Comparison value

However, if your hypothesis has a direction (i.e., the mean is greater than/less than the comparison value), then you will run a one-tailed test. You indicate this by adding a third argument of alternative =, then either “greater” or “less”

#One sample t test, Ha is greater
t.test(trees$Height, #Column in a data frame (or a vector)
       mu = 72, #Comparison value
       alternative = "greater") #Here, the alternative hypothesis is that the mean is greater than the comparison value

If you wanted an effect size, you can use the effectsize package and the cohens_d() function:

#Call package
library(effectsize)

#Get Cohen's d; comparing to a value of 72
cohens_d(trees$Height, 
         mu = 72)

12.8.2 Repeated Measures T-test

With a repeated measures t-test, you assign one time point to x and another to y, as seen below. Important to note is that R will do x - y. You should interpret any results with that in mind.

#Run a repeated measures t-test
t.test(x = mice$after, 
       y = mice$before, 
       paired = TRUE) #Indicates repeated measures

12.8.3 Independent Samples T-test

Recall that the assumptions for an independent samples t-test are that your variables are normally distributed for each group, no outliers, and that you have homogeneity of variance. Since normality is most often analyzed graphically, we will address that in the graphing chapter. Homogeneity of variance is tested either with the leveneTest() function, whether you are using the Levene’s Test or the Brown-Forsyth Test. This function is within the car package, which you will need to call first.

In the code below, you can see how to obtain both. I assign each to an object, which I can then print. You can also just run leveneTest() without assigning it to an object, and have it print straight to your console.

#Call the car package
library(car)

#Perform Levene's Test
LT <- leveneTest(extra ~ group, #What variables you intend to compare 
                 data = df, #Where your data is
                 center = "mean") #This indicates a Levene's test

#Perform Brown-Forsythe test
BFT <- leveneTest(extra ~ group, #What variables you intend to compare 
                  data = df, #Where your data is
                  center = "median") #This indicates a Brown-Forsythe test

#Print both of them
print(LT)
print(BFT)

Based on the results of your assumptions, you can now run a t-test using the t.test() function as seen below:

#Run an independent samples t-test
t.test(outcome ~ group, #Variables you are using
       data = df, #Where your data is
       var.equal = TRUE) #If variances are equal or not