How to calculate ROC curves

I will make a short tutorial about how to generate ROC curves and other statistics after running rDock molecular docking (for other programs such as Vina or Glide, just a little modification on the way dataforR_uq.txt file is interpreted will make it work, see below).

I assume all of you are familiar with what ROC curves are, what are they for and how they are made.
Just in case, a very brief summary would be:

ROC curves are graphic representations of the relation existing between the sensibility and the specificity of a test. It is generated by plotting the fraction of true positives out of the total actual positives versus the fraction of false positives out of the total actual negatives.
In our case, we will use it for checking whether a docking program is able to select active ligands with respect to inactive ligands (decoys) and whether it is able to select these active ligands in the top % of a ranked database.
R Library ROCR is mandatory (try with command install.packages("ROCR") in R before downloading from source).

The example selected for this tutorial is a system from the DUD benchmark set, "hivpr" or "hiv protease".
These are the files you will need (all can be downloaded in this Dropbox shared folder):

List of active ligands (ligands.txt)
List of inactive ligands (decoys.txt).
Output file with the docked poses of each ligand with the corresponding docking scores (hivpr_all_results.sd.gz).
R script with all the R commands in this tutorial (ROC_curves.R).

Before getting into R, the resulted docked poses have to be filtered out for only having the best pose for each ligand (the smallest score - or highest in negative value). To do so run:

NOTE: sdsort and sdreport are really useful tools for managing sd formatted compound collections. They are very user-friendly and free to download. They are provided along with rDock software in rDock website.

gunzip hivpr_all_results.sd.gz
sdsort -n -s -fSCORE hivpr_all_results.sd | sdfilter -f'$_COUNT == 1' > hivpr_1poseperlig.sd
#sdsort with -n and -s flags will sort internally each ligand by increasing score and sdfilter will get only the first entry of each ligand.

sdreport -t hivpr_1poseperlig.sd | awk '{print $2,$3,$4,$5,$6,$7}' > dataforR_uq.txt
#sdreport will print all the scores of the output in a tabular format and, with command awk, we will format the results

This dataforR_uq.txt (also in Dropbox folder) file must contain one entry per ligand with the docked scores (what R will use to rank and plot the ROC curves).
Then, run the following commands in R for plotting the ROC curves:

#load ROCR

library(ROCR);

#load ligands and decoys

lig <- unique(read.table("ligands.txt")[,1]);

dec <- unique(read.table("decoys.txt")[,1]);

#load data file from docking

uniqRes <- read.table("dataforR_uq.txt",header=T);

#change colnames

colnames(uniqRes)[1]="LigandName";

#add column with ligand/decoy info

uniqRes$IsActive <- as.numeric(uniqRes$LigandName %in% lig)

#define ROC parameters

#here INTER is selected to compare between ligands using rDock SCORE.INTER
#this could be changed for also running with other programs
predINTERuq <- prediction(uniqRes$INTER*-1, uniqRes$IsActive)

perfINTERuq <- performance(predINTERuq, 'tpr','fpr')

#plot in jpg format with a grey line with theoretical random results

jpeg("hivpr_Rinter_ROC.jpg")

plot(perfINTERuq,main="hivpr - ROC Curves",col="blue")

abline(0,1,col="grey")

dev.off()

Which will give us the following plot:

Afterwards, other useful statistics such as AUC or Enrichment factors can also be calculated:

#AUC (area under the curve)

auc_rdock <- performance(predINTERuq, "auc")

auc.area_rdock <- slot(auc_rdock, "y.values")[[1]]
cat("AUC: \n")
cat(auc.area_rdock)
cat("\n\n")

AUC:

0.7700965

#Enrichment Factors

EF_rdock <- perfINTERuq@y.values[[1]]/perfINTERuq@x.values[[1]]

EF_rdock_1 <- EF_rdock[which(perfINTERuq@x.values[[1]] > 0.01)[1]]

EF_rdock_20 <- EF_rdock[which(perfINTERuq@x.values[[1]] > 0.2)[1]]
cat("Enrichment Factor top1%:\n")
cat(EF_rdock_1)
cat("\n\n")

Enrichment Factor top1%:

11.11817

cat("Enrichment Factor top20%:\n")
cat(EF_rdock_20)
cat("\n\n")

Enrichment Factor top20%:

3.200686

Moreover, a good analysis of these curves is to re-plot them in semilogarithmic scale (x axis in logarithmic scale). This way, one can focus on the early enrichment of the database and have a more detailed view of the selected actives in the top % of all the ligands.

jpeg("hivpr_semilog_ROC.jpg")

rdockforsemilog=perfINTERuq@x.values[[1]]

rdockforsemilog[rdockforsemilog < 0.0005]=0.0005

plot(rdockforsemilog,perfINTERuq@y.values[[1]],type="l",xlab="False Positive Rate", ylab="True Positive Rate",xaxt="n", log="x", col="blue",main="hivpr - Semilog ROC Curves")

axis(1, c(0,0.001,0.01,0.1,1))

x<-seq(0,1,0.001)

points(x,x,col="gray",type="l")

dev.off()

Obtaining the following semi-logarithmic ROC curves:

User login

Recent blog posts