LOCF Processing

[This site is not connected with the SAS Institute]

(Last updated: 26 May 2003 - test pack added as base for user validation)

Introduction

"LOCF" stands for "Last Observation Carried Forward", right? Well actually it does and it doesn't. It doesn't mean last SAS dataset observation carried forward. It means last non-missing value carried forward. It is the values of individual measures that are the "observations" in this case. And if you have multiple variables containing these values then they will be carried forward independently. If you are a SAS programmer but not a statistician it is very easy to fall into the trap of thinking the "observation" being carried forward is an observation in a SAS dataset. It isn't. And if you code on that basis then you will be making a mistake. It is the non-missing values that are being carried forward and they are being carried forward independently at that. This makes things an awful lot more complicated.

I have seen code to do this processing using arrays. In one sense this is a good idea since it is the most efficient way of handling it. However, the solution using this method never struck me as being elegant. I feel the array method springs from an incorrect assumption of what can be achieved using merges.

LOCF Merging

For some reason, people think a merge as in the following can not be done.

Values to carry forward (values dataset)
 date     value
01jan02     1
01mar02     2
01may02     3

Dates to merge with (dates dataset)
 date
01jan02
01feb02
01mar02
01apr02
01may02
01jun02

Resulting dataset (final dataset)
 date     value
01jan02     1
01feb02     1
01mar02     2
01apr02     2
01may02     3
01jun02     3

Well, it can't be done using only the variables shown above. But what we can do is add a sequence number to the "dates" dataset and increment it when we get a match on date with the "values" dataset like this:

data dates;
  retain _seq 0;
  merge dates values(in=_val);
  by date;
  if _val then _seq=_seq+1;
run;

And now we have a sequence number that retains its value corresponding to each observation in the values dataset. So if we simply add the same sequence number to the "values" dataset like this:

data values;
  set values;
  _seq=_n_;
run;

.... we can merge like this:

data final;
  merge dates values;
  by _seq;
  drop _seq;
run;

And this gives us the type of merge we need for LOCF merging. If you need further convincing, then here is some code you can copy and paste and run under sas.

data values;
  date='01jan2002'd;value=1;output;
  date='01mar2002'd;value=2;output;
  date='01may2002'd;value=3;output;
  format date date7.;
run;

data dates;
  date='01jan2002'd;output;
  date='01feb2002'd;output;
  date='01mar2002'd;output;
  date='01apr2002'd;output;
  date='01may2002'd;output;
  date='01jun2002'd;output;
  format date date7.;
run;

data dates;
  retain _seq 0;
  merge dates values(in=_val keep=date);
  by date;
  if _val then _seq=_seq+1;
run;

data values;
  set values;
  _seq=_n_;
run;

data final;
  merge dates values;
  by _seq;
  drop _seq;
run;

options nocenter;
proc print data=final;
run;

And if you go to the trouble to run it under sas then this is what you get for a print of the "final" dataset:

Obs       date    value

 1     01JAN02      1
 2     01FEB02      1
 3     01MAR02      2
 4     01APR02      2
 5     01MAY02      3
 6     01JUN02      3

The above is what we want, I am sure you will agree. So at least, in principle, it is possible to merge values in with a time sequence using a sequence.variable we have created. The above is very simple test data, of course, but we can use the same principle in more complex cases. In reality we will have an owner variable such as a subject-id that changes. There may be subcategories like a lab parameter. We would probably use visit numbers instead of dates. But the same principle of adding a sequence number to both datasets and merging using that sequence number could still apply. And it is this principle I have used in my macro solution for this sort of processing.

More Complex LOCF Merging

Let's drop the use of dates and assume we have visit numbers that ascend with time. Also that we have multiple subjects and that we are dealing with lab data with multiple measures. Suppose we have datasets like the following:

(values dataset)
subj  param  visit  value
 AA    RBC     2      22
 AA    RBC     4      44
 AA    WBC     3      33
 BB    ALB     3      55

(visits dataset)
subj  param  visit
 AA    RBC     2
 AA    RBC     3
 AA    RBC     4
 AA    RBC     5
 AA    WBC     2
 AA    WBC     3
 AA    WBC     4
 AA    WBC     5
 BB    ALB     2
 BB    ALB     3
 BB    ALB     4
 BB    ALB     5

Now before we decide on how to merge these datasets, take a look at the "visits" dataset. Visits 2-5 are the "on-treatment" visits right up to the planned end point which is visit 5. It is for these visits we want to carry forward our last observation. Visit 1 is not shown because it is likely a baseline value and because it was not a measure taken while on and as a result of treatment we would probably not want to carry forward these baseline values. So Visits 2-5 are the "eligible" visit values for carrying forward. Now look at the values dataset. Again these only have the "eligible" visits and their values and these will be the values we will be carrying forward where required. Note that the subj/param pairings in the "visits" dataset match those in the values dataset. It makes sense to only have subj/param pairings to match values otherwise you would have nothing to carry forward. But this is an assumption. You should understand the assumptions being made. Sometimes the assumptions will not be approppriate for a study you are working on and you might have to change the logic.

Now consider how we are going to approach the merge of the two datasets. Look at the data for subject AA. We would not want to carry forward values of RBC into the following WBC so obviously we will be merging using matching on the subj/param pairing. We will have to reset our sequence count at the start of each subj/param grouping and increment from there. And when we add the sequence number to the "values" dataset we will have to do the same so that when we merge using subj/param/_seq we will get the correct matching.

First read through the following code until you understand what it is doing. The first two data steps generate the observations as shown above. The next two data steps add the sequence number and the final data step merges the data before printing it. Then copy and paste the code, run it under sas and convince yourself that it is producing the correct output:

data values;
  subj='AA';
  param='RBC';
  visit=2;value=22;output;
  visit=4;value=44;output;
  param='WBC';
  visit=3;value=33;output;
  subj='BB';
  param='ALB';
  visit=3;value=55;output;
run;

data visits;
  subj='AA';
  do param='RBC', 'WBC';
    do visit=2 to 5;
      output;
    end;
  end;
  subj='BB';
  param='ALB';
  do visit=2 to 5;
    output;
  end;
run;

data visits;
  retain _seq 0;
  merge visits values(in=_val keep=subj param visit);
  by subj param visit;
  if first.param then _seq=0;
  if _val then _seq=_seq+1;
run;

data values;
  retain _seq 0;
  set values;
  by subj param;
  if first.param then _seq=0;
  _seq=_seq+1;
run;

data final;
  merge values visits;
  by subj param _seq;
  drop _seq;
run;

options nocenter;
proc print data=final;
run;

The Macro Solution - %locf

The macro I wrote to do this processing uses the method of merging shown above. It will operate on whatever data you feed into the macro. You should only feed it data that is eligible for carrying forward (for example, you would not feed in baseline visit data if baseline values were not eligible for carrying forward). It is your responsibility to prepare the data beforehand if needed. Non-missing for numeric values means not equal to missing value. Non-missing for a character variable means not a blank. If "0" signifies missing for some of your numeric variables then the dataset you feed in to the %locf macro will have to be an amended one with the 0's set to missing values. The same applies to the character variables. You have to do all the data preparation. The %locf macro does none. And if you want only one or two timepoints in the output dataset then you have to make that selection. The %locf macro will not do it for you. This way I can keep the macro as simple as possible and that way it should stay maintainable.

You should regard this macro as a solution rather than the solution. You need to understand what it is doing and what assumptions are being made. It may or may not match with your study requirements. You might end up writing a new macro that fits your site standards better, based on this macro. I can only go so far with it and keep it as generic and as simple as possible. If you want to develop it further then feel free. To view the latest version, click here. To view its test pack click here testpack.

Go back to the home page.

E-mail the macro and web site author.