#|echo: FALSE
<- women
df
<- mtcars
df2
library(palmerpenguins)
Warning: package 'palmerpenguins' was built under R version 4.3.3
<- palmerpenguins::penguins_raw df3
#|echo: FALSE
<- women
df
<- mtcars
df2
library(palmerpenguins)
Warning: package 'palmerpenguins' was built under R version 4.3.3
<- palmerpenguins::penguins_raw df3
There are a number of different functions we can use to gather descriptive statistics; some are included in base R and some are associated with other packages.
Function | Package? | Action |
---|---|---|
table() | No | Obtains frequencies for categorical variables; can also generate cross-tabulations between multiple categorical variables |
CrossTable() | gmodels | Gives cross-tabulations between multiple categorical variables |
mean(), sd(), median(), min(), and max() | No | Performs advertised function |
summary() | No | Gives descriptive statistics for continuous variables or frequencies for categorical variables. Can be used on a single variable or entire dataframe. |
describe() | psych | Gives descriptive statistics for variables - most appropriate with continuous variables. Can be used on a single variable or entire dataframe. |
describeBy() | psych | Similar to describe(), but allows to give descriptive statistics for different levels of a categorical variable. |
CreateTableOne() | tableone | Gives frequencies/percents for categorical variables and means/sds for continuous variables. |
cor(), cor.test() | No | Calculates the correlation between two continuous variables; cor.test() will give the significance test. |
corr.test() | psych | Gives correlations and significance tests. |
There is a package called psych
that contains many useful functions, one of which is describe()
. Similar to summary()
, it provides a number of different values. However, it provides many more than summary: item name, item number, number of valid cases (n), mean, standard deviation, trimmed mean, median, median absolute deviation (mad), minimum, maximum, skew, kurtosis, and standard error.
#Call the package
library(psych)
#Get descriptive information on a dataframe
describe(df)
vars n mean sd median trimmed mad min max range skew kurtosis
height 1 15 65.00 4.47 65 65.00 5.93 58 72 14 0.00 -1.44
weight 2 15 136.73 15.50 135 136.31 17.79 115 164 49 0.23 -1.34
se
height 1.15
weight 4.00
Also in the psych
package is the function describeBy()
which allows you to get descriptive statistics by different levels of a categorical variable.
#Using a different dataset: mtcars
#Asking for descriptive statistics for mpg and wt columns only
describeBy(df2[, c("mpg", "wt")], group = df2$cyl)
Descriptive statistics by group
group: 4
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 11 26.66 4.51 26.0 26.44 6.52 21.40 33.90 12.50 0.26 -1.65 1.36
wt 2 11 2.29 0.57 2.2 2.27 0.54 1.51 3.19 1.68 0.30 -1.36 0.17
------------------------------------------------------------
group: 6
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 7 19.74 1.45 19.70 19.74 1.93 17.80 21.40 3.60 -0.16 -1.91 0.55
wt 2 7 3.12 0.36 3.21 3.12 0.36 2.62 3.46 0.84 -0.22 -1.98 0.13
------------------------------------------------------------
group: 8
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 14 15.1 2.56 15.20 15.15 1.56 10.40 19.20 8.80 -0.36 -0.57 0.68
wt 2 14 4.0 0.76 3.76 3.95 0.41 3.17 5.42 2.25 0.99 -0.71 0.20
To get frequencies for a single categorical variable, we can use the table()
function. As an example, we will use the Palmer Penguins dataset, and ask for a table by species.
#Get frequency of penguin species
table(df3$Species)
Adelie Penguin (Pygoscelis adeliae)
152
Chinstrap penguin (Pygoscelis antarctica)
68
Gentoo penguin (Pygoscelis papua)
124
If we wanted proportions instead, we could use the prop.table()
function on the table created above. And then, we could round that to get percentages.
#Save the table as an object
<- table(df3$Species)
table1
#Get proportions
prop.table(table1)
Adelie Penguin (Pygoscelis adeliae)
0.4418605
Chinstrap penguin (Pygoscelis antarctica)
0.1976744
Gentoo penguin (Pygoscelis papua)
0.3604651
#We could also just nest the functions
prop.table(table(df3$Species))
Adelie Penguin (Pygoscelis adeliae)
0.4418605
Chinstrap penguin (Pygoscelis antarctica)
0.1976744
Gentoo penguin (Pygoscelis papua)
0.3604651
#Turn it into percents
round(100*prop.table(table1))
Adelie Penguin (Pygoscelis adeliae)
44
Chinstrap penguin (Pygoscelis antarctica)
20
Gentoo penguin (Pygoscelis papua)
36
The table()
function also allows you to get cross-tabulations between multiple variables. Our penguin dataset also has an island variable - let’s get the cross-tabulation of species and island:
#Crosstabs between species and region
table(df3$Species, df3$Island)
Biscoe Dream Torgersen
Adelie Penguin (Pygoscelis adeliae) 44 56 52
Chinstrap penguin (Pygoscelis antarctica) 0 68 0
Gentoo penguin (Pygoscelis papua) 124 0 0
We also might just want the number of cases of Adelie penguins:
table(df3$Species == "Adelie Penguin (Pygoscelis adeliae)")
FALSE TRUE
192 152
Running the code above tells us that there are 152 (under the TRUE column) that are Adelie penguins, and 192 that are not.
The CrossTable()
function from the gmodels
package performs much the same operation as the table()
function, but displays the results differently:
#Repeat the above with CrossTable
#Install the package if we don't have it:
#install.packages("gmodels")
#Then call the package
library(gmodels)
#Get a crosstab between species and region
CrossTable(df3$Species, df3$Island)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 344
| df3$Island
df3$Species | Biscoe | Dream | Torgersen | Row Total |
------------------------------------------|-----------|-----------|-----------|-----------|
Adelie Penguin (Pygoscelis adeliae) | 44 | 56 | 52 | 152 |
| 12.313 | 0.027 | 36.661 | |
| 0.289 | 0.368 | 0.342 | 0.442 |
| 0.262 | 0.452 | 1.000 | |
| 0.128 | 0.163 | 0.151 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Chinstrap penguin (Pygoscelis antarctica) | 0 | 68 | 0 | 68 |
| 33.209 | 77.157 | 10.279 | |
| 0.000 | 1.000 | 0.000 | 0.198 |
| 0.000 | 0.548 | 0.000 | |
| 0.000 | 0.198 | 0.000 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Gentoo penguin (Pygoscelis papua) | 124 | 0 | 0 | 124 |
| 66.463 | 44.698 | 18.744 | |
| 1.000 | 0.000 | 0.000 | 0.360 |
| 0.738 | 0.000 | 0.000 | |
| 0.360 | 0.000 | 0.000 | |
------------------------------------------|-----------|-----------|-----------|-----------|
Column Total | 168 | 124 | 52 | 344 |
| 0.488 | 0.360 | 0.151 | |
------------------------------------------|-----------|-----------|-----------|-----------|
We can get basic descriptive statistics using a variety of functions, conveniently named for what they do. We’ll look at flipper length here, across all penguins. We can also add in na.rm = TRUE
to allow calculation even with missing data. You’ll notice that since the variable name has a parenthetical, we are enclosing it in back-tick marks: `. This is the button next to the number 1 on your keyboard, and is only necessary due to the “(mm)” in the variable name. With other variables (e.g. “Species”), it is not necessary.
#Calculate a mean
mean(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 200.9152
#Calculate the standard deviation
sd(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 14.06171
#Calculate the variance
var(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 197.7318
#Calculate the median
median(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 197
#Calculate the minimum value
min(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 172
#Calculate the maximum value
max(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 231
#Calculate the range
range(df3$`Flipper Length (mm)`, na.rm = TRUE)
[1] 172 231
We also might want those values, but don’t want to call them each individually. The summary()
function will give us similar information, and can determine what values to give based on variable type.
#Continuous variable
summary(df3$`Flipper Length (mm)`)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
172.0 190.0 197.0 200.9 213.0 231.0 2
#And a categorical one
#First, tell R we'd like species to be a factor
$Species <- as.factor(df3$Species)
df3
#Then run summary
summary(df3$Species)
Adelie Penguin (Pygoscelis adeliae)
152
Chinstrap penguin (Pygoscelis antarctica)
68
Gentoo penguin (Pygoscelis papua)
124
Another useful function is CreateTableOne()
from the tableone
package. It also knows what information to provide based on variable type (though you may need to inform R that you’d like categorical variables treated as factors).
#Install the package if necessary
#install.packages("tableone")
#Call the package
library(tableone)
Warning: package 'tableone' was built under R version 4.3.3
#Get our variable names
names(df3)
[1] "studyName" "Sample Number" "Species"
[4] "Region" "Island" "Stage"
[7] "Individual ID" "Clutch Completion" "Date Egg"
[10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
[13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
[16] "Delta 13 C (o/oo)" "Comments"
#Create a table
<- CreateTableOne(data = df3,
t vars = c("Species", "Flipper Length (mm)", "Body Mass (g)"))
#Print the table
t
Overall
n 344
Species (%)
Adelie Penguin (Pygoscelis adeliae) 152 (44.2)
Chinstrap penguin (Pygoscelis antarctica) 68 (19.8)
Gentoo penguin (Pygoscelis papua) 124 (36.0)
Flipper Length (mm) (mean (SD)) 200.92 (14.06)
Body Mass (g) (mean (SD)) 4201.75 (801.95)
#We can also show the levels
print(t, showAllLevels = TRUE)
level
n
Species (%) Adelie Penguin (Pygoscelis adeliae)
Chinstrap penguin (Pygoscelis antarctica)
Gentoo penguin (Pygoscelis papua)
Flipper Length (mm) (mean (SD))
Body Mass (g) (mean (SD))
Overall
n 344
Species (%) 152 (44.2)
68 (19.8)
124 (36.0)
Flipper Length (mm) (mean (SD)) 200.92 (14.06)
Body Mass (g) (mean (SD)) 4201.75 (801.95)
There are a few different approaches to finding a correlation. The cor()
function will provide a correlation, but no significance test. cor.test()
will give both the correlation and the significance test.
#First, we need to drop missing values, since we can't within the funtion
<- na.omit(df3)
df3a
#Calculate a correlation
cor(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)
[1] 0.7624367
#Correlation plus significance test
cor.test(df3a$`Flipper Length (mm)`, df3a$`Body Mass (g)`)
Pearson's product-moment correlation
data: df3a$`Flipper Length (mm)` and df3a$`Body Mass (g)`
t = 6.6655, df = 32, p-value = 1.604e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5716627 0.8750009
sample estimates:
cor
0.7624367
If we wanted to obtain correlations, but only for Adelie penguins, we can subset our data:
#Make a new df with just Adelie penguins
<- df3a[df3a$Species == "Adelie Penguin (Pygoscelis adeliae)", ]
sub_df3
#Run the correlation
cor.test(sub_df3$`Flipper Length (mm)`, sub_df3$`Body Mass (g)`)
Pearson's product-moment correlation
data: sub_df3$`Flipper Length (mm)` and sub_df3$`Body Mass (g)`
t = 1.6599, df = 11, p-value = 0.1252
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1372942 0.8010097
sample estimates:
cor
0.4475467
We can also use the by()
function to get diferent descriptive statistics by levels of a categorical variable. Below, we are getting the mean of Body Mass by Species. We could replace mean with cor, or even summary.
#Get mean of body mass by species
by(df3a$`Body Mass (g)`, df3a$Species, mean)
df3a$Species: Adelie Penguin (Pygoscelis adeliae)
[1] 3640.385
------------------------------------------------------------
df3a$Species: Chinstrap penguin (Pygoscelis antarctica)
[1] 3580.357
------------------------------------------------------------
df3a$Species: Gentoo penguin (Pygoscelis papua)
[1] 4910.714