8 Merging, Transposing, and Concatenating Data
This chapter will go over how to merge, concatenate, and transpose data files. At the end, we will briefly touch on the process of a ‘do loop’, and when you might want to make use of that functionality.
A brief differentiation of two methods at the beginning: merging and concatenating. When we merge datasets, we generally have all (or most) of the same subjects, we just want to add extra columns. As an example, in Data_A, we may have subject ID along with demographic variables. In Data_B, we may have subject ID along with performance on non-cognitive items. We want both demographic variables and non-cognitive items to be in the same dataset, so we will merge the two datasets, matching on the ID variable.
Concatenation, on the other hand, is when you have two datasets with the same (or very similar) variables and you want to add subjects in one to the other. An example here might be you have height, weight, and hours of exercise for your subject pool performed in October. You also have the height, weight, and hours of exercise for your subject pool performed in April. You want to concatenate the two datasets to get all your subjects in the same dataset.
8.1 Review of Renaming and Sorting Variables
Before we merge datasets, we need to make sure that the merging variable is named the same in all datasets, that it is the same type and length in each file, and that each data file is sorted on the matching variable in ascending order.
8.1.1 Renaming Variables
To rename variables, you can use the RENAME VARIABLES
command:
RENAME VARIABLES stud_id = ID.
EXECUTE.
As a reminder, start with the name we have (stud_id) and end with the name we want (ID). Capitalization will matter, so if you have an id
variable and an ID
variable, you’ll want them both in the same case.
8.1.2 Sorting Variables
To sort a variable in ascending order, we will use SORT CASES BY
:
SORT CASES BY id (A).
Here, the (A)
is indicating ascending order. Also notice how an EXECUTE
is not required for this command.
8.1.3 Formatting Variables
The last thing to check is to make sure that each of your matching variables is the same length. So, if you have 3 character spaces for ID in dataset A but 4 in dataset B, that will be a problem. What you can so is open both, format them to be the same just in case, then save them back out.
To format them, you will use the FORMATS
command:
FORMATS id (F4.0).
Here, we are saying that the id variable should take up four spaces, and have nothing after the decimal. Additionally, since we specify F, we are saying it is a number.
8.2 Merging and Concatenating Data
8.2.1 Merging Data
Once our matching variables all match, we can now merge! We will be using the MATCH FILES
command, with a series of sub-commands:
MATCH FILES
/FILE = data_A
/FILE = data_B
/BY id.
EXECUTE.
Here, we are merging the SPSS files ‘data_A’ and ‘data_B’ (NOTE: these are already opened, and named using DATASET NAME
. Since ‘data_B’ comes second, we are adding it to ‘data_A’. /BY
is telling SPSS on what variable we want to match (id). A good second step to this is to save out the merged file. Remember: until you save it out, nothing is happening to the original datasets!
To save out, I would first name the merged file within SPSS, and then save it to my desired location:
DATASET NAME dataAB
SAVE OUTFILE = ‘path\to\directory\here.sav’.
EXECUTE.
We could also do the above without first opening the datasets, and reading them straight from a file:
MATCH FILES
/FILE = ‘path\to\data_A.sav’
/FILE = ‘path\to\data_B.sav’
/BY id.
EXECUTE.
8.2.2 Concatenating Data
When we concatenate data, we are no longer matching on cases (subjects), but on variables. As a result, we need to do more checking on our variables to make sure they have the same name and are the same length. We will want to check all pertinent variables, and if they need renaming or reformatting, use a RENAME VARIABLES
or FORMATS
command as described above.
Once we have all our variables in order, we can add one file to the other (concatenate):
ADD FILES
/FILE = ‘path\to\fileA.sav’
/FILE = ‘path\to\fileB.sav’.
EXECUTE.
Above, we are adding ‘fileB.sav’ to ‘fileA.sav’ - stacking one on top of the other. All variables in both files will be included, and the ones that are the same will be matched. As before, we’ll save this out as a new file, so we aren’t over-writing any of our original files. And, so we won’t have to do this again if we come back to it.
SAVE OUTFILE = ‘path\to\directory\here.sav’.
EXECUTE.
We can also concatenate from already open and named SPSS datasets, as with merging.
8.2.3 More than two Files
In the examples above, we were only merging or concatenating two data files at one time. If we have more than two data files, we can merge or concatenate them all at once (provided we’ve done our due diligence with variables!):
ADD FILES
/FILE = ‘path\to\fileA.sav’
/FILE = ‘path\to\fileB.sav’
/FILE = ‘path\to\fileC.sav’
/FILE = ‘path\to\fileD.sav’
/FILE = ‘path\to\fileE.sav’
/FILE = ‘path\to\fileF.sav’.
EXECUTE.
or:
MATCH FILES
/FILE = ‘path\to\data_A.sav’
/FILE = ‘path\to\data_B.sav’
/FILE = ‘path\to\data_C.sav’
/FILE = ‘path\to\data_D.sav’
/FILE = ‘path\to\data_E.sav’
/FILE = ‘path\to\data_F.sav’
/BY id.
EXECUTE.
8.2.4 Dropping or Keeping Variables
Sometimes when we are merging data, we want to drop variables, or only keep certain variables. We can always do that for the individual datasets, or we can do it as part of the merge or concatenation step.
As an example, here I am dropping the ‘Name’ variable from ‘data_A’:
MATCH FILES
/FILE = ‘path\to\data_A.sav’
/DROP = Name /FILE = ‘path\to\data_B.sav’
/BY id.
EXECUTE.
If I also wanted to drop ‘BMI’ from data_B, I would do this:
MATCH FILES
/FILE = ‘path\to\data_A.sav’
/DROP = Name /FILE = ‘path\to\data_B.sav’
/DROP = BMI /BY id.
EXECUTE.
8.2.5 Identifing Data of Origin
Sometimes when merging or concatenating, you might want to keep track of which dataset the cases came from. It may be that different files represent different study administrations, and tracking that is of importance. We can add an indicator variable that will tell us from which file each case came from by using the /IN
sub-command:
Concatenating:
ADD FILES
/FILE = ‘path\to\data_A.sav’
/IN = session1
/FILE = ‘path\to\data_B.sav’
/IN = session2.
EXECUTE.
Merging:
MATCH FILES
/FILE = ‘path\to\data_A.sav’
/IN = grpA
/FILE = ‘path\to\data_B.sav’
/IN = grpB
/BY id.
EXECUTE.
In both of these examples, new variables are being created that have a value of “1” if the case came from the dataset directly above the /IN
sub-command, and a “0” if it did not.
8.3 One-to-many Merge
There may be times when you have nested data, and information on both the cluster-level variables (e.g. hospitals) and the unit-level variables (e.g. patients). You may wish to merge your datasets so that the hospital-level data is added to the patient-level data (using our example). To do this, we will do a one-to-many merge, where the patient-level file would be the “many” and the hospital-level file would be the “one”.
This time, we will merge based on a variable in both files - perhaps ‘room’, ‘floor’, or ‘department’. We will still need to make sure it is named the same and in the same format in both datasets. The initial command is the same (MATCH FILES
), but then we will use a /FILE
sub-command for the “many” dataset and a /TABLE
sub-command for the “one” dataset:
MATCH FILES
/FILE = ‘path\to\patients.sav’
/TABLE = ‘path\to\hospital.sav’
/BY dept.
EXECUTE.
8.4 Data Restructuring
When data are gathered, they are either gathered in ‘long’ or ‘wide’ format. Data in ‘long’ format have multiple rows for a single observation. Think of a study where you gather participant weight for three consecutive months. Each time you weigh the participant, you have another row. This would be long format, and might look something like this:
id | Weight |
---|---|
1 | 248 |
1 | 241 |
1 | 243 |
2 | 177 |
2 | 178 |
2 | 177 |
Data that are in wide format, on the other hand, have only one row per observation, and multiple columns representing the different variables. The data above in wide format would look like so:
id | Weight.1 | Weight.2 | Weight.3 |
---|---|---|---|
1 | 248 | 241 | 243 |
2 | 177 | 178 | 177 |
8.4.1 Long-to-Wide
Going from long format to wide format can also be called going from cases to variables. What you first need consider is what variable do you want to collapse on? Using our example above, we are collapsing the id column, going from 3 entries down to 1. Once we have figured out which variable we are collapsing, we need to sort on that variable:
SORT CASES BY id.
EXECUTE.
Then, we can perform the transformation:
CASESTOVARS
/ID = id.
EXECUTE.
In the syntax above, /ID
is asking what is being collapsed; it can be confusing when your answer is also id! Notice also how SPSS automatically generates variable names: Weight.1, Weight.2, and Weight.3. If you don’t like the names, you can always go in and rename them.
You may also already have some additional information about each of the entries - maybe each weight was gathered in a certain month, or if you are giving a test, scores belong to different test forms. That data may look like this:
id | Weight | Month |
---|---|---|
1 | 248 | Jan |
1 | 241 | Apr |
1 | 243 | July |
2 | 177 | Jan |
2 | 178 | Apr |
2 | 177 | July |
We can use this additional information to populate the names of the new variables by using it as an index variable. To do this, we again consider what variable we are collapsing on (here, id) and then sort by it. We also need to sort by our index variable as well:
SORT CASES BY id Month.
EXECUTE.
Then, we will use CASESTOVARS
to transform our data from long to wide, but we will add an additional sub-command indicating we should name the new variables using the index variable values.
CASESTOVARS
/ID = id
/INDEX = Month.
EXECUTE.
Resulting in:
id | Jan | Apr | July |
---|---|---|---|
1 | 248 | 241 | 243 |
2 | 177 | 178 | 177 |
Another instance may be if we have a variable that has the same value for a given individual across observations (e.g. gender or SES). Look at the example below:
id | gender | Weight | Month |
---|---|---|---|
1 | M | 248 | Jan |
1 | M | 241 | Apr |
1 | M | 243 | July |
2 | F | 177 | Jan |
2 | F | 178 | Apr |
2 | F | 177 | July |
If we were to transform this from long to wide, gender
will not get restructured into multiple variables in the new datafile:
id | gender | Jan | Apr | July |
---|---|---|---|---|
1 | M | 248 | 241 | 243 |
2 | F | 177 | 178 | 177 |
It is also possible to have two variables that have values that differ for a given individual across records:
id | gender | Weight | Month | Exercise |
---|---|---|---|---|
1 | M | 248 | Jan | 20 |
1 | M | 241 | Apr | 18 |
1 | M | 243 | July | 25 |
2 | F | 177 | Jan | 30 |
2 | F | 178 | Apr | 33 |
2 | F | 177 | July | 31 |
In this table, there is the addition of Exercise
- representing average minutes of exercise in the time period. When we transpose this, both weight and exercise will expand out:
SORT CASES BY id Month.
EXECUTE.
CASESTOVARS
/ID = id
/INDEX = Month.
EXECUTE.
Resulting in:
id | gender | Weight.Jan | Weight.Apr | Weight.July | Exercise.Jan | Exercise.Apr | Exercise.July |
---|---|---|---|---|---|---|---|
1 | M | 248 | 241 | 243 | 20 | 18 | 25 |
2 | F | 177 | 178 | 177 | 30 | 33 | 31 |
8.4.2 Wide-to-Long
Wide-to-long is the exact opposite of long-to-wide - here we are going from variables to cases. When going in this direction, we will need to define the new variable we are making, and let SPSS know what variables from the wide format will make up the new variable. Going in reverse from what we did above, we will start simple, with only the weight columns and id:
id | Weight.1 | Weight.2 | Weight.3 |
---|---|---|---|
1 | 248 | 241 | 243 |
2 | 177 | 178 | 177 |
This time, we do not have to first sort on id (or any other variable), and we can go straight into transformation:
VARSTOCASES
/MAKE Weight from Weight.1 Weight.2 Weight.3.
EXECTUTE.
In the syntax above, Weight
is the new variable name, and it will be composed of information from Weight.1
, Weight.2
, and Weight.3
, giving us the table below.
id | Weight |
---|---|
1 | 248 |
1 | 241 |
1 | 243 |
2 | 177 |
2 | 178 |
2 | 177 |
Now, let’s say we had information about when the subjects were weighed:
id | Jan | Apr | July |
---|---|---|---|
1 | 248 | 241 | 243 |
2 | 177 | 178 | 177 |
Recall that Jan
, Apr
, and July
are indicating information about when the weight was recorded. We now need to parse apart the information: we want to make a Weight
column, but also want to have a Month
column. We will do that with both a /MAKE
sub-command and an /INDEX
sub-command:
VARSTOCASES
/MAKE Weight from Jan Apr July
/INDEX = Month(Weight)
/NULL = KEEP.
EXECTUTE.
I also included /NULL = KEEP
to tell SPSS that I’d like records included even if there’s a missing value on the specified variable.
After running the syntax above, our table will look like so:
id | Weight | Month |
---|---|---|
1 | 248 | Jan |
1 | 241 | Apr |
1 | 243 | July |
2 | 177 | Jan |
2 | 178 | Apr |
2 | 177 | July |
8.5 Transpose All Data
We can also completely flip our entire dataset (rows become columns and columns become rows). As an example, lets say we have four raters evaluating nursing students on their patient assessment skills.
Rater | stud_1 | stud_2 | stud_3 | stud_4 |
---|---|---|---|---|
Rater1 | 21 | 15 | 12 | 45 |
Rater2 | 34 | 38 | 38 | 32 |
Rater3 | 58 | 42 | 32 | 52 |
Rater4 | 58 | 20 | 14 | 68 |
Rather than having raters as rows, we would really prefer to have our students as rows and our raters as columns. To do this, we will run the following syntax:
FLIP VARIABLES = stud_1 stud_2 stud_3 stud_4
/NEWNAMES = Rater.
/NEWNAMES
is specifying the variable values that should be used as the new variable names. After running the syntax, our table now looks like this:
case_lbl | Rater1 | Rater2 | Rater3 | Rater4 |
---|---|---|---|---|
stud_1 | 21 | 34 | 58 | 58 |
stud_2 | 15 | 38 | 42 | 20 |
stud_3 | 12 | 38 | 32 | 14 |
stud_4 | 45 | 32 | 54 | 68 |