#|echo: FALSE
<- women
df
<- mtcars
df2
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':
penguins, penguins_raw
<- palmerpenguins::penguins_raw df3
#|echo: FALSE
<- women
df
<- mtcars
df2
library(palmerpenguins)
Attaching package: 'palmerpenguins'
The following objects are masked from 'package:datasets':
penguins, penguins_raw
<- palmerpenguins::penguins_raw df3
There are a number of different functions we can use to gather descriptive statistics; some are included in base R and some are associated with other packages.
Function | Package? | Action |
---|---|---|
table() | No | Obtains frequencies for categorical variables; can also generate cross-tabulations between multiple categorical variables |
CrossTable() | gmodels | Gives cross-tabulations between multiple categorical variables |
mean(), sd(), median(), min(), and max() | No | Performs advertised function |
summary() | No | Gives descriptive statistics for continuous variables or frequencies for categorical variables. Can be used on a single variable or entire dataframe. |
describe() | psych | Gives descriptive statistics for variables - most appropriate with continuous variables. Can be used on a single variable or entire dataframe. |
describeBy() | psych | Similar to describe(), but allows to give descriptive statistics for different levels of a categorical variable. |
CreateTableOne() | tableone | Gives frequencies/percents for categorical variables and means/sds for continuous variables. |
cor(), cor.test() | No | Calculates the correlation between two continuous variables; cor.test() will give the significance test. |
corr.test() | psych | Gives correlations and significance tests. |
There is a package called psych
that contains many useful functions, one of which is describe()
. Similar to summary()
, it provides a number of different values. However, it provides many more than summary: item name, item number, number of valid cases (n), mean, standard deviation, trimmed mean, median, median absolute deviation (mad), minimum, maximum, skew, kurtosis, and standard error.
#Call the package
library(psych)
#Get descriptive information on a dataframe
describe(df)
vars n mean sd median trimmed mad min max range skew kurtosis
height 1 15 65.00 4.47 65 65.00 5.93 58 72 14 0.00 -1.44
weight 2 15 136.73 15.50 135 136.31 17.79 115 164 49 0.23 -1.34
se
height 1.15
weight 4.00
Also in the psych
package is the function describeBy()
which allows you to get descriptive statistics by different levels of a categorical variable.
#Using a different dataset: mtcars
#Asking for descriptive statistics for mpg and wt columns only
describeBy(df2[, c("mpg", "wt")], group = df2$cyl)
Descriptive statistics by group
group: 4
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 11 26.66 4.51 26.0 26.44 6.52 21.40 33.90 12.50 0.26 -1.65 1.36
wt 2 11 2.29 0.57 2.2 2.27 0.54 1.51 3.19 1.68 0.30 -1.36 0.17
------------------------------------------------------------
group: 6
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 7 19.74 1.45 19.70 19.74 1.93 17.80 21.40 3.60 -0.16 -1.91 0.55
wt 2 7 3.12 0.36 3.21 3.12 0.36 2.62 3.46 0.84 -0.22 -1.98 0.13
------------------------------------------------------------
group: 8
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 14 15.1 2.56 15.20 15.15 1.56 10.40 19.20 8.80 -0.36 -0.57 0.68
wt 2 14 4.0 0.76 3.76 3.95 0.41 3.17 5.42 2.25 0.99 -0.71 0.20
To get frequencies for a single categorical variable, we can use the table()
function. As an example, we will use the Palmer Penguins dataset, and ask for a table by species.
#Get frequency of penguin species
table(df3$Species)
Adelie Penguin (Pygoscelis adeliae)
152
Chinstrap penguin (Pygoscelis antarctica)
68
Gentoo penguin (Pygoscelis papua)
124
If we wanted proportions instead, we could use the prop.table()
function on the table created above. And then, we could round that to get percentages.
#Save the table as an object
<- table(df3$Species)
table1
#Get proportions
prop.table(table1)
Adelie Penguin (Pygoscelis adeliae)
0.4418605
Chinstrap penguin (Pygoscelis antarctica)
0.1976744
Gentoo penguin (Pygoscelis papua)
0.3604651
#We could also just nest the functions
prop.table(table(df3$Species))
Adelie Penguin (Pygoscelis adeliae)
0.4418605
Chinstrap penguin (Pygoscelis antarctica)
0.1976744
Gentoo penguin (Pygoscelis papua)
0.3604651
#Turn it into percents
round(100*prop.table(table1))
Adelie Penguin (Pygoscelis adeliae)
44
Chinstrap penguin (Pygoscelis antarctica)
20
Gentoo penguin (Pygoscelis papua)
36
The table()
function also allows you to get cross-tabulations between multiple variables. Our penguin dataset also has an island variable - let’s get the cross-tabulation of species and island:
#Crosstabs between species and region
table(df3$Species, df3$Island)
Biscoe Dream Torgersen
Adelie Penguin (Pygoscelis adeliae) 44 56 52
Chinstrap penguin (Pygoscelis antarctica) 0 68 0
Gentoo penguin (Pygoscelis papua) 124 0 0
We also might just want the number of cases of Adelie penguins:
table(df3$Species == "Adelie Penguin (Pygoscelis adeliae)")
FALSE TRUE
192 152
Running the code above tells us that there are 152 (under the TRUE column) that are Adelie penguins, and 192 that are not.
Video (repost from previous chapter): Tables
As before, we can also determine the proportion and percent of the total dataset for a crosstabs table.
#Crosstabs between species and region
table(df3$Species, df3$Island)
Biscoe Dream Torgersen
Adelie Penguin (Pygoscelis adeliae) 44 56 52
Chinstrap penguin (Pygoscelis antarctica) 0 68 0
Gentoo penguin (Pygoscelis papua) 124 0 0
#Proportion between species and region
prop.table(table(df3$Species, df3$Island))
Biscoe Dream Torgersen
Adelie Penguin (Pygoscelis adeliae) 0.1279070 0.1627907 0.1511628
Chinstrap penguin (Pygoscelis antarctica) 0.0000000 0.1976744 0.0000000
Gentoo penguin (Pygoscelis papua) 0.3604651 0.0000000 0.0000000
#Percent between species and region
prop.table(table(df3$Species, df3$Island)) * 100
Biscoe Dream Torgersen
Adelie Penguin (Pygoscelis adeliae) 12.79070 16.27907 15.11628
Chinstrap penguin (Pygoscelis antarctica) 0.00000 19.76744 0.00000
Gentoo penguin (Pygoscelis papua) 36.04651 0.00000 0.00000
The CrossTable()
function from the gmodels
package performs much the same operation as the table()
function, but displays the results differently:
#Repeat the above with CrossTable
#Install the package if we don't have it:
#install.packages("gmodels")
#Then call the package
library(gmodels)
#Get a crosstab between species and region
CrossTable(df3$Species, df3$Island)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 344
| df3$Island
df3$Species | Biscoe | Dream | Torgersen | Row Total |
------------------------------------------|-----------|-----------|-----------|-----------|
Adelie Penguin (Pygoscelis adeliae) | 44 | 56 | 52 | 152 |
| 12.313 | 0.027 | 36.661 | |
| 0.289 | 0.368 | 0.342 | 0.442 |
| 0.262 | 0.452 | 1.000 | |
| 0.128 | 0.163 | 0.151 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Chinstrap penguin (Pygoscelis antarctica) | 0 | 68 | 0 | 68 |
| 33.209 | 77.157 | 10.279 | |
| 0.000 | 1.000 | 0.000 | 0.198 |
| 0.000 | 0.548 | 0.000 | |
| 0.000 | 0.198 | 0.000 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Gentoo penguin (Pygoscelis papua) | 124 | 0 | 0 | 124 |
| 66.463 | 44.698 | 18.744 | |
| 1.000 | 0.000 | 0.000 | 0.360 |
| 0.738 | 0.000 | 0.000 | |
| 0.360 | 0.000 | 0.000 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Column Total | 168 | 124 | 52 | 344 |
| 0.488 | 0.360 | 0.151 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Another useful function is CreateTableOne()
from the tableone
package. It creates….a Table 1! It also knows what information to provide based on variable type (though you may need to inform R that you’d like categorical variables treated as factors).
#Install the package if necessary
#install.packages("tableone")
#Call the package
library(tableone)
#Get our variable names
names(df3)
[1] "studyName" "Sample Number" "Species"
[4] "Region" "Island" "Stage"
[7] "Individual ID" "Clutch Completion" "Date Egg"
[10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
[13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
[16] "Delta 13 C (o/oo)" "Comments"
#Create a table
<- CreateTableOne(data = df3,
t vars = c("Species", "Flipper Length (mm)", "Body Mass (g)"))
#Print the table
t
Overall
n 344
Species (%)
Adelie Penguin (Pygoscelis adeliae) 152 (44.2)
Chinstrap penguin (Pygoscelis antarctica) 68 (19.8)
Gentoo penguin (Pygoscelis papua) 124 (36.0)
Flipper Length (mm) (mean (SD)) 200.92 (14.06)
Body Mass (g) (mean (SD)) 4201.75 (801.95)
#We can also show the levels
print(t, showAllLevels = TRUE)
level
n
Species (%) Adelie Penguin (Pygoscelis adeliae)
Chinstrap penguin (Pygoscelis antarctica)
Gentoo penguin (Pygoscelis papua)
Flipper Length (mm) (mean (SD))
Body Mass (g) (mean (SD))
Overall
n 344
Species (%) 152 (44.2)
68 (19.8)
124 (36.0)
Flipper Length (mm) (mean (SD)) 200.92 (14.06)
Body Mass (g) (mean (SD)) 4201.75 (801.95)
There are a few different approaches to finding a correlation. The cor()
function will provide a correlation, but no significance test. cor.test()
will give both the correlation and the significance test.
#First, we need to drop missing values, since we can't within the funtion
<- na.omit(df3)
df3a
#Calculate a correlation
cor(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)
[1] 0.7624367
#Correlation plus significance test
cor.test(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)
Pearson's product-moment correlation
data: df3a$`Flipper Length (mm)` and df3a$`Body Mass (g)`
t = 6.6655, df = 32, p-value = 1.604e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5716627 0.8750009
sample estimates:
cor
0.7624367
If we wanted to obtain correlations, but only for Adelie penguins, we can subset our data:
#Make a new df with just Adelie penguins
<- df3a[df3a$Species == "Adelie Penguin (Pygoscelis adeliae)", ]
sub_df3
#Run the correlation
cor.test(sub_df3$`Flipper Length (mm)`, sub_df3$`Body Mass (g)`)
Pearson's product-moment correlation
data: sub_df3$`Flipper Length (mm)` and sub_df3$`Body Mass (g)`
t = 1.6599, df = 11, p-value = 0.1252
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1372942 0.8010097
sample estimates:
cor
0.4475467
We can also use the by()
function to get different descriptive statistics by levels of a categorical variable. Below, we are getting the mean of Body Mass by Species. We could replace mean with cor, or even summary.
#Get mean of body mass by species
by(df3a$`Body Mass (g)`, df3a$Species, mean)
df3a$Species: Adelie Penguin (Pygoscelis adeliae)
[1] 3640.385
------------------------------------------------------------
df3a$Species: Chinstrap penguin (Pygoscelis antarctica)
[1] 3580.357
------------------------------------------------------------
df3a$Species: Gentoo penguin (Pygoscelis papua)
[1] 4910.714
The t.test()
function is used to run all t-tests: a one-sample t-test, repeated measures t-test, and independent samples t-test.
The major assumption in a one-sample t-test is that our data are normally distrubuted. Graphical methods will be addressed in a later chapter. You can also use the Shapiro-Wilk test of normality via the shapiro.test()
function. If it is not significant (i.e., p > 0.05), then it indicates that the data do not have a distribution that is significantly different than a normal distrbution.
#Shapiro-Wilk test
shapiro.test(trees$Height)
To run a one-sample t-test, you need to indicate where the data are, and what value you are comparing to (mu
). If you only include those two things, you will be running a two-tailed test, determining if the mean of the data is different than the comparison value.
#Run a one sample t test
t.test(trees$Height, #Column in a data frame (or a vector)
mu = 72) #Comparison value
However, if your hypothesis has a direction (i.e., the mean is greater than/less than the comparison value), then you will run a one-tailed test. You indicate this by adding a third argument of alternative =
, then either “greater” or “less”
#One sample t test, Ha is greater
t.test(trees$Height, #Column in a data frame (or a vector)
mu = 72, #Comparison value
alternative = "greater") #Here, the alternative hypothesis is that the mean is greater than the comparison value
If you wanted an effect size, you can use the effectsize
package and the cohens_d()
function:
#Call package
library(effectsize)
#Get Cohen's d; comparing to a value of 72
cohens_d(trees$Height,
mu = 72)
With a repeated measures t-test, you assign one time point to x
and another to y
, as seen below. Important to note is that R will do x - y. You should interpret any results with that in mind.
#Run a repeated measures t-test
t.test(x = mice$after,
y = mice$before,
paired = TRUE) #Indicates repeated measures
Recall that the assumptions for an independent samples t-test are that your variables are normally distributed for each group, no outliers, and that you have homogeneity of variance. Since normality is most often analyzed graphically, we will address that in the graphing chapter. Homogeneity of variance is tested either with the leveneTest()
function, whether you are using the Levene’s Test or the Brown-Forsyth Test. This function is within the car
package, which you will need to call first.
In the code below, you can see how to obtain both. I assign each to an object, which I can then print. You can also just run leveneTest()
without assigning it to an object, and have it print straight to your console.
#Call the car package
library(car)
#Perform Levene's Test
<- leveneTest(extra ~ group, #What variables you intend to compare
LT data = df, #Where your data is
center = "mean") #This indicates a Levene's test
#Perform Brown-Forsythe test
<- leveneTest(extra ~ group, #What variables you intend to compare
BFT data = df, #Where your data is
center = "median") #This indicates a Brown-Forsythe test
#Print both of them
print(LT)
print(BFT)
Based on the results of your assumptions, you can now run a t-test using the t.test()
function as seen below:
#Run an independent samples t-test
t.test(outcome ~ group, #Variables you are using
data = df, #Where your data is
var.equal = TRUE) #If variances are equal or not