RPAnalyzer 1.30

RPAnalyzer Screenshot
Latest version for AMEX RIB users in this folder...

RPAnalyzer

After I have joined American Express, I found Ranks and Plots was perhaps the most widely used procedure throughout the modelling teams. Although the normal method for doing ranks and plots (through using a text mode graphics embedded in the .lst file, or through DSSTools) works fine, but I saw some scopes of improvement in it. I found it was, atleast theoritically, possible to use raw data on the buckets to show a better plot, may be with also the regression line, with a little bit of interactivity, and this should improve the efficiency of the process greatly. And thus was born the idea of RPAnalyzer.

Objective

Simply put, the objective was to tear off the graphics part from the SAS ranks and plots... and use a much more interactive interface for providing graphical output instead.

The first prototype was ready after two busy days and nights on the first weekend itself... but it was only a distant cousin of what RPAnalyzer is today. My primary focus after that was ease of use (i was determined to make it fun to use), without sacrificing power. Below I give the goals and some of the features that I plan to introduce... some of which, as you can see, are have already made it into the program

Goals

Fanciful Wishlist

Walkthrough

Step 1: Preparing the input

In order to use this program, you first need to create the .lst file on your modeling dataset. The format for the lst file is simple, it contains the data divided into 180 buckets by rankings of the dependent variable, and corresponding to each bucket, the statistics on the independent variable which are generally there in the normal ranks and plots code.

To create the dataset, you should use the following macro...

%macro genrpdata(dsn,depvar,indepvar) ;

options ls=max ps=max nocenter;

title1 "Dependent variable: &depvar";
title2 "Independent variable: &indepvar";

*proc rank data=&dsn (keep=&depvar &indepvar wgt) groups=180 out=_rank_ ;
proc rank data=&dsn (keep=&depvar &indepvar) groups=180 out=_rank_ ;
var &indepvar ;
ranks RankGrp ;
run;

proc means data=_rank_ noprint missing;
var &depvar &indepvar ;
format &indepvar best15.14;
class RankGrp; ways 1;
*weight wgt;
output out=_temp_(drop=minbad maxbad _TYPE_ _FREQ_)
mean=mean_dep mean_indep n=nobs
min=minbad min_dep max=maxbad max_dep;

proc print data=_temp_;
run;

%mend;

This macro requires three parameters as input, they are, name of the dataset, dependent variable, and independent variable.

Then create a sas code which will call the macro several times, once for each independent variable, the code should look somewhat like... (this is only an example and should be customized for every project...)

libname in '/world/records';

%include 'genrpdata_macro';

%genrpdata(in.transactions,fraud,age);
%genrpdata(in.transactions,fraud,income);
%genrpdata(in.transactions,fraud,amount);

Running this will generate the .lst file readable by the program!

Step 2: Using the program

You should have the .lst file (generated by step 1) ready in the local hard disk (C: drive).

When you start the program (i.e. run RPAnalyzer.exe) it will ask for a file to load. In the drop box labeled 'Files of type' please select 'SAS generated lst file'. Now you can browse into the folder where the file and load it into the program.

You may not want all the features while you are working, but for now, turn all the features on in the Display menu (a feature is turned on if you see the check mark beside it on the menu).

Now let's play around with the plot a bit...

  1. The top left drop box is the list of independent variables. You can also scroll through the variables with your mouse wheel, or in case you have a laptop and don't have a mouse wheel, Ctrl+PgUp and Ctrl+PgDn keys will be useful in quick scrolling. When the dropbox is selected (Alt+P will select it), typing in a variable name will take you to that variable.
  2. If you are using a 0-1 data to model a probability, you might want to plot the log-odds instead of the actual y values. You should then turn on the "Use Logodds" from the display menu. This also acts as a cue for the program to provide some extra information pertaining to 0-1 valued data only.
  3. The size of the points is relative to the frequency of observations on the bucket. When you move your mouse over a point, a region corresponding to the minimum to maximum range of the point will be immediately hilighted. The information on the currently selected point will also appear beside it.
  4. When you hold the mouse over a point for some time, a subplot might appear if you have checked the "Enable Bucket Expansion" on the Display menu. This should be particularly useful on the first (left most) and last (right most) buckets for fixing the minimum/maximum.
  5. The "Abs" transformation, coupled with the Centre, will put a V-shape transformation on the variable with the wedge located at the centre.
  6. The "Box Cox" transformation will induce a power of Lambda transformation if Lambda is not 0, and will induce a log transformation if Lambda is 0. This transformation also depends on the centre, which serves as a shift of location here.
  7. If you have missing values in the data, and have "Missing Value" turned on in Display, mean for the dependent variable for missing (independent variable) will register as a red line. The estimated missing will be denoted by a red point on this line. There are many ways to do the fixing of the missing, you could either enter a value in the "Missing" editbox manually, or you could drag the red missing point with your mouse on screen. You could also press the "Reg." button, which will fix the missing value based on the bivariate regression (honouring the current transformations). This should be the preferred way to fix missing unless there are other reasons based on the business meaning of the variable.
  8. Clicking "View Text" in Variables menu will show the ranks as text instead of the plot. Clicking it again will take you back to the plots. In the text mode, you can see normal SAS like bucketting, and some other statistics. Most importantly, you can see the exact formula induced by the transformation if you have chosen to use the transformation, clarifying what exactly the transformation is doing.
  9. One of the most useful features (according to my humble opinion), is the Ranks menu. From here you can change the number of buckets on the fly. The way it is done by the program is simply by dynamically clubbing 180 buckets which are input to it, using a logic consistent with SAS ranking.

There are a few more features in the program... but I suppose the above tips will get you started!

Step 3: Getting the output

You can get output in 3 different formats... using the following menus under the File command...

Explanation of the Menus

File Menu

Load Lst File opens a dialog box from which you can load a .lst file which has been output from SAS using the macro provided with this program.

Open lets you open a file which was saved from this program itself.

Save lets you save your current work.

Save As lets you save your current work in the file you specify.

Export .csv exports the current fixing information into Comma Seperated Values (CSV) format, which can be read in through Excel.

Export .sas snippet generates a SAS code which can be included inside the data step to carry out the fixing of the variables.

Export HTML report generates an HTML report to depict the current fixing information.

Exit exits the program.

Display Menu

Regression Line turns on or off the display of the regression line.

70% Confidence Region turns on or off the display of 70% confidence region around the regression line.

Missing Value turns on or off the display of the adjustable missing value point and also the missing value line.

Regression Line turns on or off the display of the adjustable min/max lines.

Enable Bucket Expansion turns on or off the bucket expansion feature. When this is on, you can hold the mouse on a point for some time to display a subplot of points constituting the point.

Color shows a submenu to set the color intensity, also has an option to turn the display into grayscale.

Show Lorenz Curve overlays the display of the bivariate lorenz curve on the plot.

Use Logodds this should be turned on for 0-1 valued dependent variable, i.e. on which logistic regression is to be performed, and should be turned off for OLS.

Variables Menu

View Text toggles display of plot, or text showing the ranks in a SAS-like format.

Update and Add Appends data on new variables, and updates data on variables which were already read before. Reads in lst file generated using the macro provided with the program.

Ranks Menu

Emulate n ranks is a very powerful feature which lets you instantaneously change the number of ranks plotted or displayed, on the fly.

That's it...

The About box (in the Help menu) contains some statements and thanks to the people whose constant encouragement and support has driven this project throughout! Here in this webpage I wish to thank them once more, and also all the other people in American Express whose cooperation was invaluable for the project.

I hope this program is useful to you!