6 Examining your data
This chapter will detail some of the first steps to take after reading in your data. Often times, you will not have nice, pristine data that neatly fits all the statistical assumptions you need it to. Instead, you may have missing values or out-of-range values, or data that is not normally distributed. Additionally, you may want to see what your data looks like: what is the mean? Number of observations? Are two variables correlated? At the end of this chapter, you should feel more confident performing these tasks and becoming better acquainted with your data.
6.1 Dataset Commands
Each time you read in data, or re-read it in, it opens in a new dataset window. It is good practice (and expected on assignments!) to name, activate, and close your datasets you have been using within your syntax file.
If you just read in a dataset, notice how in the data editor window, it says something like ‘Untitled2[DataSet1]’ in the upper left-hand corner. “DataSet1” is not terribly informative for us. Also, if you have read in multiple datasets, any commands you perform will be done on the most recent one read in - this may cause mistakes or confusing errors! It is a good habit to get into to NAME
and ACTIVATE
your datasets prior to performing actions on them. In the Syntax window, you can also see which dataset is active by looking next to the magnifying glass on the tool bar. It will say “Active DataSet:” and have a drop-down menu with all the available datasets at that point in time.
To name your dataset, use the DATASET NAME
command: DATASET NAME insurance.
Notice how we do not need quotation marks or an EXECUTE
. We do still need the ‘.’ at the end, though!
To activate the dataset, you use a similar pattern: DATASET ACTIVATE insurance.
When you are done with your dataset(s), you can close them: DATASET CLOSE insurance.
NOTE: Closing the dataset will not save any changes you have made. If you want your changes to be saved, use a SAVE OUTFILE
command to save as a .sav file.
6.2 Variable and Value Labels
6.2.1 Variable Labels
While less evident in your syntax, you can add variable labels that explain shortened variable labels in the Variable View window. Since SPSS only allows variable names up to xxx characters, you may need to use shorter names than make sense to a human. For example, if you had a survey, and a variable for each response, a variable “q1”, “R1”, or “year” might fit SPSS’s requirements, but wouldn’t make much sense to a human reading it. Rather than remembering what question 1 is, you can add a variable label. Looking in variable view, you could see that “q1” is “How many years have you been in the program?” (as an example).
The VARIABLE LABELS
command is used, giving the variable name first followed by the label. Keep in mind that while variable names don’t need to be in quotes, the label, which is a string, does need to be in quotes:
VARIABLE LABELS year “How many years have you been in the program?”. EXECUTE.
You can combine multiple variables and labels in one VARIABLE LABELS command:
VARIABLE LABELS
program “What program are you in?”
year “How many years have you been in the program?”
gender “What is your gender?”.
EXECUTE.
6.2.2 Value Labels
Another common data situation is representing categories with numbers, rather than strings. Male = 1, Female = 2 might be an example, or 1, 2, 3, and 4 representing “Strongly Agree”, “Agree”, “Disagree”, and “Strongly Disagree”. NOTE: Value lables cannot exceed 60 characters in length.
This is accomplished with the ADD VALUE LABELS
command:
ADD VALUE LABELS gender 1 “Male” 2 “Female”.
You can also add labels to all variables under one ADD VALUE LABELS
command by adding a / between each variable:
ADD VALUE LABELS
gender 1 “Male” 2 “Female”/
independent 1 “Yes” 2 “No”/
social 1 “Strongly Agree” 4 “Strongly Disagree”.
EXECUTE.
You might also encounter instances where there are single letter abbreviations rather than numbers:
ADD VALUE LABELS
gender “M” “Male” “F” “Female” “NB” “Nonbinary” “NA” “Prefer not to answer”.
EXECUTE.
Notice how since the abbreviations are strings (M, F, NB, NA), they are also in quotes. Also, capitalization is important! ‘M’ is NOT the same thing as ‘m’.
6.3 Sorting Data
Sorting data in SPSS is much less scary than sorting data in Excel - SPSS knows to keep your rows intact! When sorting your data, you may specify which variable (or variables) you wish to sort on, as well as if you want ascending or descending order.
Command | Options | Function |
---|---|---|
SORT CASES BY | Tells which variables to sort by and in what order; these can be string or numeric | |
(A) and (D) |
If we had a variable age (a numeric variable) that we wished to sort in descending order, we would do:
SORT CASES BY age (D).
Notice that while the command must end in a period, it does not require an EXECUTE.
at the end of it.
We can also sort by more than one variable. If we first wanted to sort on region (a string variable), then by age, we would do:
SORT CASES BY region (D) age (A).
This would give us data first sorted by descending alphabetical region, and then within each region, sorted by ascending age.
6.4 Deleting and Renaming variables
To delete a variable from the dataset, you would use the DELETE VARIABLES
command. If we decided that we did not want age as a variable, we could delete it as follows:
DELETE VARIABLES age.
EXECUTE.
We could list more than one variable here, if we had more than one we wished to delete.
If, instead, we wanted to rename our region variable to US_region, we would use the RENAME VARIABLES
command:
RENAME VARIABLES region = US_region.
EXECUTE.
Notice how we don’t need variable names in quotes, and we start with the name we have (region) and end with the name we want (US_region). We can string multiple variables in a row to ‘batch’ rename:
RENAME VARIABLES region = US_region med = dep_med status = vacc_status.
EXECUTE.
6.5 Screening Data
6.5.1 Overview
The commands we have been using thus far do not have output - they are actions we are performing on our data. We have been reading in data, saving it out, adding value and variable labels, sorting data, etc. These are all useful actions to know, but are mostly ‘silent’ in that there is no output created. However, once we have our data read in, one of the first things we want to do is to screen our data. Are values for each variable within the possible range for that variable? Followed by getting to know our data better: Is there a lot of missing data? Are there outliers? What do the descriptive statistics look like? We will address each of those questions, and more, throughout this chapter. The commands that we will use are called procedural commands, and as such, do not require an EXECUTE.
at the end of the command.
Let’s first take the question of ‘are values for each variable within the possible range for that variable?’. If you have a 5 point Likert scale, yet have values of 6 and 0, they would be considered out of range. Same for if you have binary gender represented by 0 (male) and 1 (female), but have entries of 2 or 3. Or, for continuous data: maybe you have systolic blood pressure that you know was taken on living individuals, yet you have a reading of 378/72. With all these examples, the data may have been entered wrong, read into SPSS incorrectly, or may simply be wrong (someone bubbled in the wrong answer). Diagnosing the reason for the out-of-range data is important, as that will impact how you handle it. If the data is plain wrong, do you change it to missing, or just assume that the ‘6’ that was bubbled in on the Likert scale was really meant to be a ‘5’ and fix it? For that blood pressure, maybe you’d go back to the patient sheet or to the physician for some clarification.
Regardless of the cause, if you do make changes to your data (i.e., changing out-of-range values to missing) you should save your data to a new file to avoid accidental modifications to the original data.
Screening your data is one part of cleaning your data. A suggested work flow for this process is:
- Screening your data (i.e. identifying out-of-range values)
- Editing any out-of-range values discovered in step 1.
- Explore variable distributions, obtain descriptive statistics and/or correlations, look for outliers
- Editing any outliers and making any transformations (if necessary)
Commands that we will be using to clean our data include FREQUENCIES, DESCRIPTIVES, EXAMINE, MEANS TABLES, CROSSTABS, and GRAPH. Each of these will give us some form of output - either in graphical form, tabular form, or number form.
6.5.2 Frequencies
The command we can use to screen categorical or continuous variables is FREQUENCIES. The command and some pertinent subcommands are shown below:
Command | Subcommand | Options | Function |
---|---|---|---|
FREQUENCIES VARIABLES | (List variables here) | Names the variables to be tabulated in the order in which you want them considered. Depending on what other sub-commands you select will determine the output of this function. Basic specification gives a frequency table. | |
/FORMAT | AVALUE (sorts in ascending order of values; default), DVALUE (sorts in descending order of values), AFREQ (sorts in ascending order of frequency), DFREQ (sorts in descending order of frequency), LIMIT(n) (suppresses tables with more than n categories), NOTABLE (suppresses all frequency tables) |
||
/MISSING = INCLUDE | Overrides default. Includes user-missing values in statistics and plots. | ||
/BARCHART | (there are options available that we will not be addressing) | Produces a bar chart for each named variable in initial command | |
/HISTOGRAM | (there are options available that we will not be addressing) | Produces a histogram for each named numeric variable | |
/PERCENTILES = | (Put percentages here) | Defines desired percentiles to be reported (e.g. 25, 50, 75) | |
/NTILES = | (Number) | Calculates the percentages that divide the distribution into the specified number of categories. | |
/STATISTICS = | MEAN, SEMEAN (std. error of the mean), MEDIAN, MODE, STDDEV, VARIANCE, SKEWNESS, KURTOSIS, RANGE, MINIMUM, MAXIMUM, SUM, ALL (all available statistics), NONE (No statistics) | Returns the requested statistics for numeric variables. Default return is mean, standard deviation, minimum, and maximum. Missing values are not included in calculations by default. |
At its most basic, we can use this command with no subcommands: FREQUENCIES VARIABLES = age.
Reminder: Since this is a procedure, we do not need EXECUTE.
NOTE: You may occasionally see some commands shortened: FREQUENCIES VARIABLES
as Freq var
or freq var
, EXECUTE
as EXE
or Exe
, DELETE VARIABLES
as DELETE VARS
, etc. The functionality stays the exact same, but when you use the shortened form, you lose the tell-tale colors in your syntax. I personally prefer the ‘long form’, as I find the color changes very useful to quickly identify bugs or typos. The shortened form is not wrong, however, and you may encounter them when working with collaborators.
We can also use this command on a series of response variables: FREQUENCIES VARIABLES = R1 to R10.
From the output, we can determine if our responses are all within the possible range, as well as if we have any missing data (and if so, how much). If we find we have out-of-range values, we would then need to decide how to edit our data - and document whatever changes we decided to make!. The syntax file is a great place to keep these decisions by using comments within your syntax file. This will keep your decisions in the same place as your data cleaning steps, and be available for referencing later if needed.
FREQUENCIES can provide more than just a frequency table - see the chart above for some of the useful subcommands. If we wanted statistics, we could add that subcommand:
FREQUENCIES VARIABLES = R1
/STATISTICS.
This would give us the default statistics: mean, standard deviation, minimum, and maximum. If we only wanted the mean, we could specify it as such:
FREQUENCIES VARIABLES = R1
/STATISTICS = MEAN.
We can also request percentiles:
FREQUENCIES VARIABLES = R1
/STATISTICS
/PERCENTILES = 25 50 70 80.
The code above would give us the default statistics as well as the requested percentiles. Keep in mind that the 25th percentile is the value below which 25% of the scores (or values) fall. I requested the 25th, 50th, 70th, and 80th percentiles above - you can modify that to as few or as many as you wish.
An alternative is the /NTILES
subcommand. Here, you give how many percentiles you want (i.e., 4 would give 4 equal percentiles: 25, 50, and 75; 10 would give deciles.)
FREQUENCIES VARIABLES = R1
/STATISTICS
/NTILES = 10.
6.5.2.1 Univariate Outliers
6.5.2.2 Handling Missing Data
There will be a more extensive discussion around missing data in the next chapter. For now, if you find you have out-of-range values and wish to convert them to user-defined missing values, you would use the MISSING VALUES
command.
For string variables: MISSING VALUES variable_name_here ("value" "value")
For numeric variables: MISSING VALUES variable_name_here (value value)
If you have both string and numeric out-of-range values, you can combine them all in one MISSING VALUES
command. Just remember to put the string values within double quotes.
6.5.3 Descriptives
This command gives descriptive statistics for continuous variables. The command has the following structure:
Command | Subcommand | Options | Function |
---|---|---|---|
DESCRIPTIVES VARIABLES = | You can also create new z-scores here, by naming the new variable in parenthesis after the original. This selectively creates z-scores, and does not require the /SAVE subcommand. | Computes univariate statistics for the listed variables. This command does not sort values into a frequency table. Variables are listed after the command. | |
/SAVE | Creates a new variable in your active dataset that calculates standardized values (z-scores) for the variables. | ||
/STATISTICS = | MEAN, SEMEAN, STDDEV, VARIANCE, KURTOSIS, SKEWNESS, RANGE, MIN, MAX, SUM | Requests specific statistics to be displayed. Default is mean, stddev, min, and max. | |
/MISSING = | INCLUDE (includes user-missing values) | Controls missing values. Default is to exclude cases with missing values on a variable-by-variable basis. | |
/SORT = | Any of the statistics. /(A/) is ascending, (D) is descending. | Allows you to list variables in ascending or descending alphabetical order of variable name or numeric value of statistics. By default, they are listed in the order they appear in /VARIABLES. |
DESCRIPTIVES
gives similar information as the /STATISTICS
subcommand with the FREQUENCIES
command, but unlike FREQUENCIES
does not offer the option to get a frequencies table, nor does it calculate the median, mode, or percentiles. However, it does allow for the sorting of the output as well as calculation of z-scores. You will choose between the two commands depending on your specific data needs. Indeed, you may end up running both commands, and selecting different subcommands for each.
This command you may also see written in short-hand: desc var = rather than DESCRIPTIVES VARIABLES =. Both are correct.
Let’s say you wanted descriptive statistics on responses to a 10-item Likert scale, with questions q1 through q10. We could simply ask for the default:
DESCRIPTIVES VARIABLES = q1 to q10.
Or, maybe you only wanted the mean, and for the output to be sorted in descending order of the mean:
DESCRIPTIVES VARIABLES = q1 to q10
/STATISTICS = MEAN
/SORT = MEAN(D).
6.5.3.1 Z-scores
An advantage of DESCRIPTIVES
over FREQUENCIES
is that it will calculate z-scores (standardized scores) of specified variables. If you use the /SAVE
subcommand, it will calcualte a z-score for every variable in the /VARIABLES
list, and name them by adding a ‘z’ in front of the variable name. You can also select which variables you wish to have z-scores of in the variable list.
DESCRIPTIVES VARIABLES = q1 (q1_z) q2 to q8 q9 (q9_z) q10 (q10_z).
The above would create z-scores of q1, q9, and q10 but not of the rest.
6.5.4 Examine
The EXAMINE
command gives a large amount of information about your variables, and is only intended for use with continuous variables. A nice addition to this command from FREQUENCIES
and DESCRIPTIVES
is that it provides a stem and leaf plot as well as a box plot by default, and can be asked for tests of normality as well as normal plots and histograms.
Command | Subcommand | Options | Function |
---|---|---|---|
EXAMINE VARIABLES = | Computes univariate descriptive statistics and gives plots and tests of normality for the listed variables | ||
/COMPARE = | GROUPS (default), VARIABLES | Controls how boxplots are displayed; GROUPS displays boxplots for all cells together, allowing comparisons across cells for a single dependent variable. VARIABLES groups boxplots for all dependent variables. | |
/PERCENTILES | Displays a percentiles table. If omitted, no percentiles are included. Default breaks are (5, 10, 25, 50, 75, 90, and 95). Otherwise you can specify which you want. | ||
/PLOT = | STEMLEAF (default), BOXPLOT (default), NPPLOT (tests for normality), HISTOGRAM, SPREADLEVEL, ALL, NONE | Displays the requested plots | |
/STATISTICS = | DESCRIPTIVES (default) | Gives statistics. You cannot specify which particular tests you would/would not like displayed | |
/CINTERVAL | 95 (default); can provide any other number | Controls confidence interval displayed when statistics are displayed. | |
/MISSING = | LISTWISE (default), EXCLUDE (default), NOREPORT(default), PAIRWISE, INCLUDE | Determines how missing values are treated |
If we want just the defaults, we can do: EXAMINE VARIABLES = age.
Adding /PLOT NNPLOT
would give us detrended Q-Q plots as well as tests of normality:
EXAMINE VARIABLES = age
/PLOT STEMLEAF HISTOGRAM NNPLOT.
Notice how I also included “STEMLEAF” in the plot list? If I had not, I would have only received the NNPLOT. Both the histogram and the nnplot output will give us information about the distribution of the variable age.
We can also check for meeting the assumption of homogeneity of variance with this command by including “SPREADLEVEL” in the PLOT subcommand:
EXAMINE VARIABLES = age by gender
/PLOT SPREADLEVEL.
Since we are checking for homogeneity of variance, I also included how age was being grouped: by gender in this instance.
6.6 Statistical Tests
Another part of screening your data may involve some simple statistical tests - are variables that should be correlated actually correlated? What is the mean of a particular variable? Does the mean seem to differ across levels of another variable? You also may be ready to perform a t-test after ensuring your data is in the proper shape, or a t-test may be an initial step before running more complicated analyses. Below, you will see brief explanations of some common statistical tests: means, chi-square, t-tests, correlations, and multiple regression. Note: I do not go into great detail with any of these tests, leaving them instead to your dedicated statistics courses.
6.6.1 Means
We can get means in the EXAMINE
command, but it comes with quite a bit of additional output. Using the MEANS TABLES
command produces less output and gives us more control over what other statistics are produced. If we just wanted the mean, we would do:
MEANS TABLES = Share.
The output will only include the mean, N, and standard deviation by default.
We can also request means broken down by levels of another categorical variable. If we add the /CELLS
subcommand, we can specify what statistics we want reported out.
MEANS TABLES = Share by Gender by Coverage
/CELLS = count mean stddev.
In the output, there is each gender reported, then in each gender are the values for private and public coverage. Note that MEANS TABLES
can be used to get the mean for a continuous variable, as well as the mean of a continuous variable broken down by the levels of one (or more!) categorical variable.
6.6.2 Crosstabs
CROSSTABS
is used when you want the frequencies of one categorical variable across levels of another categorical variable. Using our same variables as above, we could get the number of males and females for each type of insurance coverage:
CROSSTABS TABLES = Gender by Coverage.
We are not limited to just two variables - we can look at more than two variables if we want:
CROSSTABS TABLES = Gender by Coverage by State.
This gives us the count of gender, broken down by coverage type, and then coverage type broken down by state.
Given that this looks an awful lot like a chi-square set-up (you’d be right!), we can ask for a chi-square test of independence to be run by adding a /STATISTICS
subcommand. As a reminder, this will tell us if the observed frequencies are different from what would be expected with two unrelated variables. The null hypothesis here is that the two variables are not related, and both variables must be categorical variables.
CROSSTABS TABLES = Gender by Coverage
/STATISTICS = chisq
/CELLS = count expected sresid.
The information after the /CELLS
subcommand is asking for the observed cell frequencies (count), expected frequencies if the two variables were not related (expected), and the Pearson residuals (i.e. standardized residuals; sresid). SPSS will throw a warning if any cells have expected counts less than 5. As a reminder, if there are any cells with expected counts less than 5, the results of the chi-square test should be viewed with suspicion.
6.6.3 T-tests
While perhaps moving past examining your data and entering in running statistical tests land, we can also perform t-tests in SPSS. The base command is T-TEST
, which gets extras depending on what type of t-test we want to perform.
If we had a grouping variable iv with membership indicated by (0, 1), and a dependent variable creatively named outcome, our independent samples t-test would look like so:
T-TEST GROUPS iv(0 1)
/VARIABLES = outcome
/CRITERIA = CIN (.95).
If our grouping variable was a string rather than numeric, we would specify each level by writing it out, and making sure to put it in quotes: gender (“Male” “Female”).
Notice how we are also requesting a 95% confidence interval with the subcommand /CRITERIA = CIN (.95)
. What do you think you would change to make it a 90% confidence interval?
A one-sample t-test to determine if the mean of our variable var was significantly different from 3.8 would be performed with the following modifications:
T-TEST
/TESTVAL 3.8
/VARIABLE = var.
6.6.4 Correlations
Briefly, you would get a correlation using the command CORR VAR =
and listing the variables you want correlated. Remember: for a corrlation, variables need to be continuous! Another important note is regarding the /MISSING
subcommand. If you do not specify /MISSING = listwise
, the default will be pairwise deletion (not a great thing).
CORR VAR = var1 var2 var3
/MISSING = listwise.
6.6.5 Regression
Recognizing that there could be an entire book written on the ins and outs of multiple regression, I will just leave you with a basic command and leave the finer details to other courses.
REG
/DEPENDENT outcome
/METHOD = enter var1 var2.
Above is a multiple regression with outcome as the dependent variable and var1 and var2 as two independent variables that are being entered into the equation at the same time.
6.7 Graphs
Graphs in SPSS leave a bit to be desired, but are still useful if you are just using them to screen and examine your data. If you are trying to create publication-quality visualizations, another program might be more flexible. Excel and R both offer good visualization options, with R being my personal favorite due to the extreme flexibility and customization options.
The initial command to create a graph is, unsurprisingly, GRAPH
. The subcommands will then specify what type of visualization you are looking for, as well as allowing you to put an informative title on your visualization.
Command | Subcommand | Options | Function |
---|---|---|---|
GRAPH | Initiates a graph. Following sub-commands will specify graph type and variables | ||
/TITLE = ‘your title’ | Adds a title to the graph | ||
/BAR = vars | (SIMPLE) (GROUPED) (STACKED) | Initiates a bar graph. The parenthetical options go before the = and indicated what type of bar graph. Grouped and simple are the defaults, depending on variable specification. Variables are defined after the = | |
/LINE = vars | (we’re using default) | Creates a line graph with the specified variables | |
/HISTOGRAM = vars | Creates a histogram of the specified variables | ||
/SCATTERPLOT = vars | |||
/INTERVAL CI (95) | Adds error bars to the chart |
6.7.1 Bar Graphs
If we just wanted a bar graph, we could do that with:
GRAPH
/BAR = program.
If we wanted to add a title to our bar graph, we would use the /TITLE subcommand:
GRAPH
/BAR = program
/TITLE = ‘Number of students in each medical program’.
We can also use other functions within the subcommands. For example, if we wanted the average share amount of insurace broken down both between private vs. public as well as by gender, we would do the following:
GRAPH
/BAR = MEAN(Share) BY Coverage BY Gender
/TITLE = ‘Average market share of different types of health insurance by gender.’.
6.7.2 Scatterplots
To create a scatterplot, we would use the /SCATTERPLOT subcommand, and include which two variables we are examining the relationship between.
GRAPH
/SCATTERPLOT = program WITH age.
Notice how the two variables are separated by WITH
. This is specifying that we want to see if there is any relationship between these two variables: one is the x value and the other the y value. We can also add in grouping as above:
GRAPH
/SCATTERPLOT = program WITH age by gender.
This would give us a scatterplot showing the relationship between program and age for males and females.
6.7.3 Histograms
A histogram is useful to see the distributional characteristics of your variable (i.e. Is this variable normally distributed?). The y-axis of a histogram will always be a count of the cases at a particular x value.
GRAPH
/HISTOGRAM = age.
To make it easier to compare your data to a normal distribution, you can add (NORMAL)
after HISTOGRAM
to super-impose a normal curve on the graph:
GRAPH
/HISTOGRAM(NORMAL) = age.
6.7.4 Line graphs
We can also make line graphs to see how one variable varies over levels of another. For example, we could see how the patient to clinician ratios varied in Virginia over the past decade:
GRAPH
/LINE = MEAN(Ratio) BY Year.
Notice how I included MEAN
around Ratio - this is because otherwise SPSS will assume you want a count. Try removing MEAN
and see what happens. Not quite what we were hoping for, right?