5 Examining your data

This chapter will detail some of the first steps to take after reading in your data, regardless of the statistical software you are using. These steps should be performed in Excel, SPSS, R, or any other software of choice. Future chapters will detail specifically how to perform these steps in SPSS and R, since the exact functions will be different. However, the process is the same.

Often times, you will not have nice, pristine data that neatly fits all the statistical assumptions you need it to. Instead, you may have missing values or out-of-range values, or data that is not normally distributed. Additionally, you may want to see what your data looks like: what is the mean? Number of observations? Are two variables correlated? At the end of this chapter, you should feel more confident about how to become better acquainted with your data.

5.1 Gathering Data

If you are going to be analyzing data, you should also be involved as much as you can when the data are gathered. This can help alleviate future headaches that could have been prevented with your input. Some things to keep in mind:

Use identification numbers, and keep them simple! There is nothing wrong with starting to number participants at 1 and continuing on for as long as you need.
Use open-ended format questions cautiously. While asking “How would you describe your gender?” in an open-ended format allows for participants to describe what best fits them, it also opens up the possibility of answers such as “F, female, Female, femael, girl, etc.”. All of those would be treated as a different gender by a program. Using a multiple choice format would eliminate that problem. You could still have an open-ended option for those who do not feel their gender is represented by your options, but it cuts down on the variability.
Categories for numeric variables may be tempting to use if that is what you’re currently interested in (e.g. What is your age? 6-10, 11-15, 16-20, etc.). However, if, in the future, you are interested in the same data but with a different age breakdown, you won’t have that data. It is better to gather numeric data in a continuous format and, if absolutely necessary, use syntax or a formula to make the groups later. You’ll always have the original data to go back to if you want to change your analysis!
Missing data should be coded in such a way as to not be mistaken for actual variable values! For example, some people may choose to code missing data as 99 or 999; however, if that is read in under an age column, odd things may happen with your analysis. Both 99 and 999 (one more plausibly than the other) are “valid” ages, at least according to a computer program that is only looking for numbers in that column. A more useful identifier is a . or even NA. A period is automatically recognized by many programs as missing data, and NA is recognized by R as missing data.

5.1.1 Data Types

Data are stored in variables, and variables contain a set of values. Gender, score, anxiety_level are all variables. The values contained in them may be female and male; 29, 20, 10, and 38; 9, 3, 5, 2. The values are what go in the ‘cells’ under each variable.

Variables are typically one of two formats: numeric or string. Numeric variables, as you may suspect, only contains numbers. If a numeric variable has a value that contains a string, it will not be interpreted as a valid value. String variables are those that contain numbers, characters, or a combination of the two. “Female” is a string variable, as is “6-10” and “2A”. The type of variable will influence what analyses are and are not able to be done with it, and if the variable type needs to be changed prior to an analysis.

5.2 Variable and Value Labels

One important thing to keep in mind as you are analyzing your data is that there are two different entities examining the data: a human (you!) and a computer. Some of the things we do are for the computer (i.e., making sure a factor is accurately coded as a factor) and other things we do for the human (i.e., comments!). Adding in variable or value labels is another action we do for the humans - this helps us, and anyone else who may be looking at our data - keep the variables and levels of a variable straight in our brains.

5.2.1 Variable Labels

A variable label is a human-understandable description of the variable. Some statistical programs limit the length of a variable name, which leads to some creatively named variables. Other times, it may be easier to type “pop1” rather than “population_one_under_18_midwest” in your syntax. Using a variable label lets you give a long description of a variable so you, or anyone using your code, can understand what each variable is.

5.2.2 Value Labels

Value labels are similar to variable labels, but rather than variable descriptions, they provide descriptions of the values a variable can take. For example, is it that 1 = Strongly Agree and 5 = Strongly Disagree, or the other way around? Is M = male or M = Monday? What, exactly, does an education level of 1 mean? Value labels will provide all this information.

5.3 Getting Acquainted with Your Data

You will typically start your examination process by first getting data into your statistical program. Then, you will likely add value and variable labels to help keep things straight, sort your data, etc. These are all useful actions to know, but are mostly housekeeping-type tasks. However, once we have our data read in and labeled, one of the first things we want to do is to screen our data. Are values for each variable within the possible range for that variable? Followed by getting to know our data better: Is there a lot of missing data? Are there outliers? What do the descriptive statistics look like? These are all questions that need to be answered prior to performing any statistical analysis. We will address each of those questions, and more, throughout this chapter.

Let’s first take the question of ‘are values for each variable within the possible range for that variable?’. If you have a 5 point Likert scale, yet have values of 6 or 0, they would be considered out of range. Same for if you have binary gender represented by 0 (male) and 1 (female), but have entries of 2 or 3. Or, for continuous data: maybe you have systolic blood pressure that you know was taken on living individuals, yet you have a reading of 378/72. With all these examples, the data may have been entered wrong, read into the program incorrectly, or may simply be wrong (someone bubbled in the wrong answer). Diagnosing the reason for the out-of-range data is important, as that will impact how you handle it. If the data is plain wrong, do you change it to missing, or just assume that the ‘6’ that was bubbled in on the Likert scale was really meant to be a ‘5’ and fix it? For that blood pressure, maybe you’d like to go back to the patient sheet or to the physician for some clarification. An important thing to note here is that how you decide to handle out-of-range data may vary by discipline (or even variable). If it is not easily confirmed, a good rule of thumb is to change it to missing. This avoids these values inappropriately impacting statistical tests down the road.

Regardless of the cause, if you do make changes to your data (i.e., changing out-of-range values to missing) you should save your data to a new file to avoid accidental modifications to the original data. Additionally, if you are using a program that has comments (i.e., SPSS or R), ADD A COMMENT! State which cases had out-of-range values and what you did. This will save you headaches later.

Screening your data is the first step to cleaning your data. But, there are other steps to take as well. A suggested work flow for this process is:

Screening your data (i.e. identifying out-of-range values)
Editing any out-of-range values discovered in step 1.
Explore variable distributions, obtain descriptive statistics and/or correlations, look for outliers
Editing any outliers and making any transformations (if necessary)

Step 3 (Exploring distributions) will help foreshadow any potential issues with statistical tests, as well as begin to answer assumptions (i.e., is my variable normally distributed? Does it need to be?) This is also what will help you build a “Table 1” for your eventual results section.

5.3.1 Outliers

If you have outliers, you must decide what to do with them. Is an outlier someone at the edge of the distribution? Do you include them in your analysis? Additionally, the context of your data matters when determining if a data point is an outlier or not. A commonly used example is height. Someone who is 6’8” tall would likely be an outlier if the sample is taken from the general population. However, if the sample was taken from NBA basketball players, that same individual would probably fall neatly into the distribution. This is another reason you need to screen your data first - what does the sample distribution look like, and does that make sense.

The question of what to do with your outlier(s) will generally depend on three factors:

The field/context of your analyses and the research question. One approach you might take is to perform your analyses twice: once with and once without outliers to evaluate their impact on the conclusions. If results change drastically due to the outliers, this should caution you against making overambitious claims.
Whether your planned analyses are robust to the presence of outliers or not. For example, the slope of a simple linear regression may significantly vary with just one outlier, whereas non-parametric tests such as the Wilcoxon test are usually robust to outliers.
How distant are the outliers from other observations? Some observations considered as outliers are actually not really extreme compared to all other observations, while other potential outliers may be really distant from the rest of the observations.

One commonly used rule for (roughly) symmetric data in determining if a data point is an outlier is to use quartiles. Specifically, if a data point is more than 1.5 * IQR above the third quartile or 1.5 * IQR below the first quartile it could be considered an outlier.

5.4 Statistical Tests

Another part of screening your data may involve some simple statistical tests - are variables that should be correlated actually correlated? What is the mean of a particular variable? Does the mean seem to differ across levels of another variable? You also may be ready to perform a t-test after ensuring your data is in the proper shape, or a t-test may be an initial step before running more complicated analyses. Some common statistical tests you may run include: means, chi-square, t-tests, and correlations.

Note

I do not go into great detail with any of these tests, leaving them instead to your dedicated statistics courses.

5.5 Visualization as Screening

When screening your data, it is sometimes easiest to use a ‘quick-and-dirty’ visualization to get an idea of what it looks like. Histograms very quickly show if a variable is unimodal or bimodal, skewed or normally distributed. Bar plots can show counts or frequencies of categorical variable levels. When we do these visualizations, we are not creating publication-quality visualizations. This is part of our screening process, and is typically just used for our own information and to guide our next steps. If you find one of these visualizations would be helpful (or even required) as part of a results section, you can always go back and pretty it up. There is no need to spend excessive time in these first examinations in selecting the perfect color pallet, witty title, or more detailed axis labels than the default.

5.6 File Types

While screening your data is crucial, you must first have data to screen! Often, the first struggle is getting the data into your program - which may include figuring out what type of data you are working with in the first place. With both SPSS and R, the function you use to read in your data will depend on what type of data you are working with, so it is important to get it right!

Not all the data you will bring into your software program will be from Excel - you can also bring in text files that have the data separated in a uniform fashion, or even data that is written continuously, one record right after another.

Fixed width data have one record per observation (one line per unit of measurement), and the variables are located in specific columns. Columns can be easily seen in a text editor, or counted in any program by moving your cursor over one by one.

Fixed width data might look like this:

101 M  1011011001
102 M  0010101110
103 F  0000000000
104 F  1111111111
105 M  1111100000

Here, ID is in columns 1-3, gender is in column 5, and responses to a 10 question survey indicated in columns 8-17. Notice, also, that the variable names are not contained within the data itself; this is something you would need to know or have a key to.

5.6.1 Delimiters

Delimiters are things that separate your data. The types we will discuss are comma-delimited data, tab-delimited data, and space-delimited data.

Tab-delimited data (tabs show up as → in text editors)

101→M→1→0→1→1→0→1→1→0→0→1
102→M→0→0→1→0→1→0→1→1→1→0
103→F→0→0→0→0→0→0→0→0→0→0
104→F→1→1→1→1→1→1→1→1→1→1
105→M→1→1→1→1→1→0→0→0→0→0

Comma-delimited data

101,M,1,0,1,1,0,1,1,0,0,1
102,M,0,0,1,0,1,0,1,1,1,0
103,F,0,0,0,0,0,0,0,0,0,0
104,F,1,1,1,1,1,1,1,1,1,1
105,M,1,1,1,1,1,0,0,0,0,0

Space-delimited data

101 M 1 0 1 1 0 1 1 0 0 1
102 M 0 0 1 0 1 0 1 1 1 0
103 F 0 0 0 0 0 0 0 0 0 0
104 F 1 1 1 1 1 1 1 1 1 1
105 M 1 1 1 1 1 0 0 0 0 0

5.6.2 Free Format

Free format data have the variables delimited by a space or comma, but rather than one record per observation (one row per unit of measurement), one observation’s data immediately follows the next. This type can be visually challenging for a human to parse.

Free format data: 101,M,1,0,1,1,0,1,1,0,0,1,102,M,0,0,1,0,1,0,1,1,1,0,103,F,0,0,0,0,0,0,0,0,0,0,104,F,1,1,1,1,1,1,1,1,1,1,105,M,1,1,1,1,1,0,0,0,0,0