8  Merging, Transposing, and Concatenating Data

This chapter will go over how to merge, concatenate, and transpose data files. At the end, we will briefly touch on the process of a ‘do loop’, and when you might want to make use of that functionality.

A brief differentiation of two methods at the beginning: merging and concatenating. When we merge datasets, we generally have all (or most) of the same subjects, we just want to add extra columns. As an example, in Data_A, we may have subject ID along with demographic variables. In Data_B, we may have subject ID along with performance on non-cognitive items. We want both demographic variables and non-cognitive items to be in the same dataset, so we will merge the two datasets, matching on the ID variable.

Concatenation, on the other hand, is when you have two datasets with the same (or very similar) variables and you want to add subjects in one to the other. An example here might be you have height, weight, and hours of exercise for your subject pool performed in October. You also have the height, weight, and hours of exercise for your subject pool performed in April. You want to concatenate the two datasets to get all your subjects in the same dataset.

8.1 Review of Renaming and Sorting Variables

Before we merge datasets, we need to make sure that the merging variable is named the same in all datasets, that it is the same type and length in each file, and that each data file is sorted on the matching variable in ascending order.

8.1.1 Renaming Variables

To rename variables, you can use the RENAME VARIABLES command:

RENAME VARIABLES stud_id = ID.
EXECUTE.

As a reminder, start with the name we have (stud_id) and end with the name we want (ID). Capitalization will matter, so if you have an id variable and an ID variable, you’ll want them both in the same case.

8.1.2 Sorting Variables

To sort a variable in ascending order, we will use SORT CASES BY:

SORT CASES BY id (A).

Here, the (A) is indicating ascending order. Also notice how an EXECUTE is not required for this command.

8.1.3 Formatting Variables

The last thing to check is to make sure that each of your matching variables is the same length. So, if you have 3 character spaces for ID in dataset A but 4 in dataset B, that will be a problem. What you can so is open both, format them to be the same just in case, then save them back out.

To format them, you will use the FORMATS command:

FORMATS id (F4.0).

Here, we are saying that the id variable should take up four spaces, and have nothing after the decimal. Additionally, since we specify F, we are saying it is a number.

8.2 Merging and Concatenating Data

8.2.1 Merging Data

Once our matching variables all match, we can now merge! We will be using the MATCH FILES command, with a series of sub-commands:

MATCH FILES
 /FILE = data_A
 /FILE = data_B
 /BY id.
EXECUTE.

Here, we are merging the SPSS files ‘data_A’ and ‘data_B’ (NOTE: these are already opened, and named using DATASET NAME. Since ‘data_B’ comes second, we are adding it to ‘data_A’. /BY is telling SPSS on what variable we want to match (id). A good second step to this is to save out the merged file. Remember: until you save it out, nothing is happening to the original datasets!

To save out, I would first name the merged file within SPSS, and then save it to my desired location:

DATASET NAME dataAB
SAVE OUTFILE = ‘path\to\directory\here.sav’.
EXECUTE.

We could also do the above without first opening the datasets, and reading them straight from a file:

MATCH FILES
 /FILE = ‘path\to\data_A.sav’
 /FILE = ‘path\to\data_B.sav’
 /BY id.
EXECUTE.

8.2.2 Concatenating Data

When we concatenate data, we are no longer matching on cases (subjects), but on variables. As a result, we need to do more checking on our variables to make sure they have the same name and are the same length. We will want to check all pertinent variables, and if they need renaming or reformatting, use a RENAME VARIABLES or FORMATS command as described above.

Once we have all our variables in order, we can add one file to the other (concatenate):

ADD FILES
 /FILE = ‘path\to\fileA.sav’
 /FILE = ‘path\to\fileB.sav’.
EXECUTE.

Above, we are adding ‘fileB.sav’ to ‘fileA.sav’ - stacking one on top of the other. All variables in both files will be included, and the ones that are the same will be matched. As before, we’ll save this out as a new file, so we aren’t over-writing any of our original files. And, so we won’t have to do this again if we come back to it.

SAVE OUTFILE = ‘path\to\directory\here.sav’.
EXECUTE.

We can also concatenate from already open and named SPSS datasets, as with merging.

8.2.3 More than two Files

In the examples above, we were only merging or concatenating two data files at one time. If we have more than two data files, we can merge or concatenate them all at once (provided we’ve done our due diligence with variables!):

ADD FILES
 /FILE = ‘path\to\fileA.sav’
 /FILE = ‘path\to\fileB.sav’
 /FILE = ‘path\to\fileC.sav’
 /FILE = ‘path\to\fileD.sav’
 /FILE = ‘path\to\fileE.sav’
 /FILE = ‘path\to\fileF.sav’.
EXECUTE.

or:

MATCH FILES
 /FILE = ‘path\to\data_A.sav’
 /FILE = ‘path\to\data_B.sav’
 /FILE = ‘path\to\data_C.sav’
 /FILE = ‘path\to\data_D.sav’
 /FILE = ‘path\to\data_E.sav’
 /FILE = ‘path\to\data_F.sav’
 /BY id.
EXECUTE.

8.2.4 Dropping or Keeping Variables

Sometimes when we are merging data, we want to drop variables, or only keep certain variables. We can always do that for the individual datasets, or we can do it as part of the merge or concatenation step.

As an example, here I am dropping the ‘Name’ variable from ‘data_A’:

MATCH FILES
 /FILE = ‘path\to\data_A.sav’
 /DROP = Name  /FILE = ‘path\to\data_B.sav’
 /BY id.
EXECUTE.

If I also wanted to drop ‘BMI’ from data_B, I would do this:

MATCH FILES
 /FILE = ‘path\to\data_A.sav’
 /DROP = Name  /FILE = ‘path\to\data_B.sav’
 /DROP = BMI  /BY id.
EXECUTE.

8.2.5 Identifing Data of Origin

Sometimes when merging or concatenating, you might want to keep track of which dataset the cases came from. It may be that different files represent different study administrations, and tracking that is of importance. We can add an indicator variable that will tell us from which file each case came from by using the /IN sub-command:

Concatenating:

ADD FILES
 /FILE = ‘path\to\data_A.sav’
 /IN = session1
 /FILE = ‘path\to\data_B.sav’
 /IN = session2.
EXECUTE.

Merging:

MATCH FILES
 /FILE = ‘path\to\data_A.sav’
 /IN = grpA
 /FILE = ‘path\to\data_B.sav’
 /IN = grpB
 /BY id.
EXECUTE.

In both of these examples, new variables are being created that have a value of “1” if the case came from the dataset directly above the /IN sub-command, and a “0” if it did not.

8.3 One-to-many Merge

There may be times when you have nested data, and information on both the cluster-level variables (e.g. hospitals) and the unit-level variables (e.g. patients). You may wish to merge your datasets so that the hospital-level data is added to the patient-level data (using our example). To do this, we will do a one-to-many merge, where the patient-level file would be the “many” and the hospital-level file would be the “one”.

This time, we will merge based on a variable in both files - perhaps ‘room’, ‘floor’, or ‘department’. We will still need to make sure it is named the same and in the same format in both datasets. The initial command is the same (MATCH FILES), but then we will use a /FILE sub-command for the “many” dataset and a /TABLE sub-command for the “one” dataset:

MATCH FILES
 /FILE = ‘path\to\patients.sav’
 /TABLE = ‘path\to\hospital.sav’
 /BY dept.
EXECUTE.

8.4 Data Restructuring

When data are gathered, they are either gathered in ‘long’ or ‘wide’ format. Data in ‘long’ format have multiple rows for a single observation. Think of a study where you gather participant weight for three consecutive months. Each time you weigh the participant, you have another row. This would be long format, and might look something like this:

id Weight
1 248
1 241
1 243
2 177
2 178
2 177

Data that are in wide format, on the other hand, have only one row per observation, and multiple columns representing the different variables. The data above in wide format would look like so:

id Weight.1 Weight.2 Weight.3
1 248 241 243
2 177 178 177

8.4.1 Long-to-Wide

Going from long format to wide format can also be called going from cases to variables. What you first need consider is what variable do you want to collapse on? Using our example above, we are collapsing the id column, going from 3 entries down to 1. Once we have figured out which variable we are collapsing, we need to sort on that variable:

SORT CASES BY id.
EXECUTE.

Then, we can perform the transformation:

CASESTOVARS
 /ID = id.
EXECUTE.

In the syntax above, /ID is asking what is being collapsed; it can be confusing when your answer is also id! Notice also how SPSS automatically generates variable names: Weight.1, Weight.2, and Weight.3. If you don’t like the names, you can always go in and rename them.

You may also already have some additional information about each of the entries - maybe each weight was gathered in a certain month, or if you are giving a test, scores belong to different test forms. That data may look like this:

id Weight Month
1 248 Jan
1 241 Apr
1 243 July
2 177 Jan
2 178 Apr
2 177 July

We can use this additional information to populate the names of the new variables by using it as an index variable. To do this, we again consider what variable we are collapsing on (here, id) and then sort by it. We also need to sort by our index variable as well:

SORT CASES BY id Month.
EXECUTE.

Then, we will use CASESTOVARS to transform our data from long to wide, but we will add an additional sub-command indicating we should name the new variables using the index variable values.

CASESTOVARS
 /ID = id
 /INDEX = Month.
EXECUTE.

Resulting in:

id Jan Apr July
1 248 241 243
2 177 178 177

Another instance may be if we have a variable that has the same value for a given individual across observations (e.g. gender or SES). Look at the example below:

id gender Weight Month
1 M 248 Jan
1 M 241 Apr
1 M 243 July
2 F 177 Jan
2 F 178 Apr
2 F 177 July

If we were to transform this from long to wide, gender will not get restructured into multiple variables in the new datafile:

id gender Jan Apr July
1 M 248 241 243
2 F 177 178 177

It is also possible to have two variables that have values that differ for a given individual across records:

id gender Weight Month Exercise
1 M 248 Jan 20
1 M 241 Apr 18
1 M 243 July 25
2 F 177 Jan 30
2 F 178 Apr 33
2 F 177 July 31

In this table, there is the addition of Exercise - representing average minutes of exercise in the time period. When we transpose this, both weight and exercise will expand out:

SORT CASES BY id Month.
EXECUTE.

CASESTOVARS
 /ID = id
 /INDEX = Month.
EXECUTE.

Resulting in:

id gender Weight.Jan Weight.Apr Weight.July Exercise.Jan Exercise.Apr Exercise.July
1 M 248 241 243 20 18 25
2 F 177 178 177 30 33 31

8.4.2 Wide-to-Long

Wide-to-long is the exact opposite of long-to-wide - here we are going from variables to cases. When going in this direction, we will need to define the new variable we are making, and let SPSS know what variables from the wide format will make up the new variable. Going in reverse from what we did above, we will start simple, with only the weight columns and id:

id Weight.1 Weight.2 Weight.3
1 248 241 243
2 177 178 177

This time, we do not have to first sort on id (or any other variable), and we can go straight into transformation:

VARSTOCASES
 /MAKE Weight from Weight.1 Weight.2 Weight.3.
EXECTUTE.

In the syntax above, Weight is the new variable name, and it will be composed of information from Weight.1, Weight.2, and Weight.3, giving us the table below.

id Weight
1 248
1 241
1 243
2 177
2 178
2 177

Now, let’s say we had information about when the subjects were weighed:

id Jan Apr July
1 248 241 243
2 177 178 177

Recall that Jan, Apr, and July are indicating information about when the weight was recorded. We now need to parse apart the information: we want to make a Weight column, but also want to have a Month column. We will do that with both a /MAKE sub-command and an /INDEX sub-command:

VARSTOCASES
 /MAKE Weight from Jan Apr July
 /INDEX = Month(Weight)
 /NULL = KEEP.
EXECTUTE.

I also included /NULL = KEEP to tell SPSS that I’d like records included even if there’s a missing value on the specified variable.

After running the syntax above, our table will look like so:

id Weight Month
1 248 Jan
1 241 Apr
1 243 July
2 177 Jan
2 178 Apr
2 177 July

8.5 Transpose All Data

We can also completely flip our entire dataset (rows become columns and columns become rows). As an example, lets say we have four raters evaluating nursing students on their patient assessment skills.

Rater stud_1 stud_2 stud_3 stud_4
Rater1 21 15 12 45
Rater2 34 38 38 32
Rater3 58 42 32 52
Rater4 58 20 14 68

Rather than having raters as rows, we would really prefer to have our students as rows and our raters as columns. To do this, we will run the following syntax:

FLIP VARIABLES = stud_1 stud_2 stud_3 stud_4
 /NEWNAMES = Rater.

/NEWNAMES is specifying the variable values that should be used as the new variable names. After running the syntax, our table now looks like this:

case_lbl Rater1 Rater2 Rater3 Rater4
stud_1 21 34 58 58
stud_2 15 38 42 20
stud_3 12 38 32 14
stud_4 45 32 54 68