After I have joined American Express, I found Ranks and Plots was perhaps the most widely used procedure throughout the modelling teams. Although the normal method for doing ranks and plots (through using a text mode graphics embedded in the .lst file, or through DSSTools) works fine, but I saw some scopes of improvement in it. I found it was, atleast theoritically, possible to use raw data on the buckets to show a better plot, may be with also the regression line, with a little bit of interactivity, and this should improve the efficiency of the process greatly. And thus was born the idea of RPAnalyzer.
Simply put, the objective was to tear off the graphics part from the SAS ranks and plots... and use a much more interactive interface for providing graphical output instead.
The first prototype was ready after two busy days and nights on the first weekend itself... but it was only a distant cousin of what RPAnalyzer is today. My primary focus after that was ease of use (i was determined to make it fun to use), without sacrificing power. Below I give the goals and some of the features that I plan to introduce... some of which, as you can see, are have already made it into the program
In order to use this program, you first need to create the .lst file on your modeling dataset. The format for the lst file is simple, it contains the data divided into 180 buckets by rankings of the dependent variable, and corresponding to each bucket, the statistics on the independent variable which are generally there in the normal ranks and plots code.
To create the dataset, you should use the following macro...
%macro genrpdata(dsn,depvar,indepvar) ; options ls=max ps=max nocenter; title1 "Dependent variable: &depvar"; title2 "Independent variable: &indepvar"; *proc rank data=&dsn (keep=&depvar &indepvar wgt) groups=180 out=_rank_ ; proc rank data=&dsn (keep=&depvar &indepvar) groups=180 out=_rank_ ; var &indepvar ; ranks RankGrp ; run; proc means data=_rank_ noprint missing; var &depvar &indepvar ; format &indepvar best15.14; class RankGrp; ways 1; *weight wgt; output out=_temp_(drop=minbad maxbad _TYPE_ _FREQ_) mean=mean_dep mean_indep n=nobs min=minbad min_dep max=maxbad max_dep; proc print data=_temp_; run; %mend;
This macro requires three parameters as input, they are, name of the dataset, dependent variable, and independent variable.
Then create a sas code which will call the macro several times, once for each independent variable, the code should look somewhat like... (this is only an example and should be customized for every project...)
libname in '/world/records'; %include 'genrpdata_macro'; %genrpdata(in.transactions,fraud,age); %genrpdata(in.transactions,fraud,income); %genrpdata(in.transactions,fraud,amount);
Running this will generate the .lst file readable by the program!
You should have the .lst file (generated by step 1) ready in the local hard disk (C: drive).
When you start the program (i.e. run RPAnalyzer.exe) it will ask for a file to load. In the drop box labeled 'Files of type' please select 'SAS generated lst file'. Now you can browse into the folder where the file and load it into the program.
You may not want all the features while you are working, but for now, turn all the features on in the Display menu (a feature is turned on if you see the check mark beside it on the menu).
Now let's play around with the plot a bit...
There are a few more features in the program... but I suppose the above tips will get you started!
File Menu |
Load Lst File opens a dialog box from which you can load a .lst file which has been output from SAS using the macro provided with this program. Open lets you open a file which was saved from this program itself. Save lets you save your current work. Save As lets you save your current work in the file you specify. Export .csv exports the current fixing information into Comma Seperated Values (CSV) format, which can be read in through Excel. Export .sas snippet generates a SAS code which can be included inside the data step to carry out the fixing of the variables. Export HTML report generates an HTML report to depict the current fixing information. Exit exits the program. |
Display Menu |
Regression Line turns on or off the display of the regression line. 70% Confidence Region turns on or off the display of 70% confidence region around the regression line. Missing Value turns on or off the display of the adjustable missing value point and also the missing value line. Regression Line turns on or off the display of the adjustable min/max lines. Enable Bucket Expansion turns on or off the bucket expansion feature. When this is on, you can hold the mouse on a point for some time to display a subplot of points constituting the point. Color shows a submenu to set the color intensity, also has an option to turn the display into grayscale. Show Lorenz Curve overlays the display of the bivariate lorenz curve on the plot. Use Logodds this should be turned on for 0-1 valued dependent variable, i.e. on which logistic regression is to be performed, and should be turned off for OLS. |
Variables Menu |
View Text toggles display of plot, or text showing the ranks in a SAS-like format. Update and Add Appends data on new variables, and updates data on variables which were already read before. Reads in lst file generated using the macro provided with the program. |
Ranks Menu |
Emulate n ranks is a very powerful feature which lets you instantaneously change the number of ranks plotted or displayed, on the fly. |
The About box (in the Help menu) contains some statements and thanks to the people whose constant encouragement and support has driven this project throughout! Here in this webpage I wish to thank them once more, and also all the other people in American Express whose cooperation was invaluable for the project.
I hope this program is useful to you!