Preparing the data
The data set consists of measurements for three classes of drugs: protease inhibitors (PIs), nucleoside reverse transcriptase (RT) inhibitors (NRTIs), and nonnucleoside RT inhibitors (NNRTIs). Protease and reverse transcriptase are two enzymes in HIV-1 that are crucial to the function of the virus. This data set seeks associations between mutations in the HIV-1 protease and drug resistance to different PI type drugs, and between mutations in the HIV-1 reverse transcriptase and drug resistance to different NRTI and NNRTI type drugs (The raw data are saved asgene_df).
In order to evaluate our results, we compare with the treatment-selected mutation panels created by (Rhee et al. 2005), which can be viewed as the ground true. These panels give lists of HIV-1 mutations appearing more frequently in patients who have previously been treated with PI, NRTI, or NNRTI type drugs, than in patients with no previous exposure to that drug type. Increased frequency of a mutation among patients treated with a certain drug type implies that the mutation confers resistance to that drug type (The raw data are saved astsm_df).
To simplify the analysis, in this project we will confine our attention to the PI drugs.
drug_class =’PI’ # Possible drug types are’PI’, ‘NRTI’, and’NNRTI’.
Fetching and cleaning the data
First, we download the data and read it into data frames.
base_url = ‘http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006’ gene_url = paste (base_url,’DATA’, paste0 (drug_class,’_DATA.txt’), sep=’/’) tsm_url = paste (base_url, ‘MUTATIONLISTS’, ‘NP_TSM’, drug_class, sep=’/’)
gene_df = read.delim (gene_url, na.string = c (‘NA’, ”), stringsAsFactors = FALSE) tsm_df = read.delim (tsm_url, header = FALSE, stringsAsFactors = FALSE) names (tsm_df) = c (‘Position’, ‘Mutations’)
A small sample of the data is shown below.
head (gene_df, n=6)
IsolateName PseudoName MedlineID APV ATV IDV LPV NFV RTV SQV
1 CA10676 CA622 10839657 2.3 NA 32.7 NA 23.4 51.6 37.
2 CA37880 CA622 15995959 76.0 NA 131.0 200.0 50.0 200.0 156.
3 CA9984 CA624 11897594 2.8 NA 12.0 NA 100.0 41.0 145.
4 CA17003 CA628 15995959 6.5 9.2 2.1 5.3 5.0 36.0 13.
5 CA10670 CA634 10839657 8.3 NA 100.0 NA 161.1 170.2 100.
6 CA42683 CA634 15995959 82.0 75.0 400.0 400.0 91.0 400.0 400.
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P
1 – – – – – – – – – I – – – – – – – – – –
2 – – – – – – – – – F L – – – – A – – I –
3 – – – – – – – – – – – – V – – – – – – R
## 4 – – – – – – – – – I – – – – – – – – – –
## 5 – – – – – – – – – I – – – – – – – – – R
## 6 – – – – – – – – – I – – – – – – – – – R
## P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P
## 1 – – – – – – – – – – – – – – – – DN –
## 2 – – – – – – – – – – – – F – – – – –
## 3 – – – – – – – – – N – – IL – D I – –
## 4 – – I – – – – – – – – – – – – – – –
## 5 – – F – – – – – – – – – – – D I D –
## 6 – – – – – – – – – – – – F T D I D –
## P39 P40 P41 P42 P43 P44 P45 P46 P47 P48 P49 P50 P51 P52 P53 P54 P55 P
## 1 – – K – – – RK I – – – – – – – – – –
## 2 – – K – – – – I – – – – – – – V – –
## 3 – – – – – – – – – – – – – – – V – –
## 4 – – – – – – – L – – – – – – – – – –
## 5 – – – – – – – I – V – – – – – T – –
## 6 – – – – T – – I – V – V – – – S – –
## P57 P58 P59 P60 P61 P62 P63 P64 P65 P66 P67 P68 P69 P70 P71 P72 P73 P
## 1 – – – – – – P – – – – – – – LV – S –
## 2 – – – – – – P – – – – – – – V – S –
## 3 – – – – – – P – – – – – – – TA – – –
## 4 – E – E EK – P – – – – – – – T T – –
## 5 – – – – – V P – – – – – – – V X – –
## 6 – – – – – V P – – – – – – – V V – –
## P75 P76 P77 P78 P79 P80 P81 P82 P83 P84 P85 P86 P87 P88 P89 P90 P91 P
## 1 – – I – – – – T – V V – – – – M – –
## 2 – – – – – – – T – V – – – – V M S –
## 3 – – – – – – – – – – – – – D – M – –
## 4 – – – – – – – A – V – – – – – M – –
## 5 – – I – – – – A – – – – – – – – – –
## 6 – – I – – – – A – – V – – – – – – –
## P93 P94 P95 P96 P97 P98 P
## 1 L – – – – – –
## 2 L – – – – – –
## 3 – – – – – -.
## 4 – – – – – – –
## 5 L – – – – – –
## 6 L – – – – – –
head (tsm_df, n=6)
Position Mutations
1 10 F R
2 11 I
3 20 I T V
4 23 I
5 24 I
6 30 N
Intsm_df, the variablePositiondenotes the position of the mutations that are associated with drug- resistance, whileMutationsindicating the mutation type.
The gene data table has some rows with error flags or nonstandard mutation codes. For simplicity, we remove all such rows.
# Returns rows for which every column matches the given regular expression.
grepl_rows <- function (pattern, df) {
cell_matches = apply (df, c (1,2), function (x) grepl (pattern, x))
apply (cell_matches, 1, all)
}
pos_start = which ( names (gene_df) == 'P1')
pos_cols = seq.int (pos_start, ncol (gene_df))
valid_rows = grepl_rows ('^(\\.|-|[A-Zid]+)
Preparing the regression matrix
We now construct the design matrix X and matrix of response vectors Y. The features (columns of X ) are given by mutation/position pairs. Define $$ X_{i,j} = 1 if the i th patient has the j th mutation/position pair and 0 otherwise Y_{i,k} = resistance of patient i to drug k. $$ For example, in the sample for PI type drugs, three different mutations (A, C, and D) are observed at position 63 in the protease, and so three columns of X (named P63.A, P63.C, and P63.D) indicate the presence or absence of each mutation at this position. # Flatten a matrix to a vector with names from concatenating row/column names. flatten_matrix <- function (M, sep=’.’) { x <- c (M) names (x) <- c ( outer ( rownames (M), colnames (M), function (…) paste (…, sep=sep))) x }
# Construct preliminary design matrix.
muts = c (LETTERS,'i','d')
X = outer (muts, as.matrix (gene_df[,pos_cols]), Vectorize (grepl))
X = aperm (X, c (2,3,1))
dimnames (X)[[3]] <- muts
X = t ( apply (X, 1, flatten_matrix))
mode (X) <- 'numeric'
# Remove any mutation/position pairs that never appear in the data.
X = X[, colSums (X) != 0]
# Extract response matrix.
Y = gene_df[,4 : (pos_start-1)]
An excerpt of the design matrix is shown below. By construction, every column contains at least one 1, but
the matrix is still quite sparse with the relative frequency of 1s being about 0.025.
library (DT)
datatable ( data.frame (X)[1 : 10, ], options = list (scrollX=T, pageLength = 10))
Show 10 entries Search:
Showing 1 to 10 of 10 entries Previous 1 Next
P4.A P12.A P13.A P16.A P20.A P22.A P28.A P37.A P51.A P54.A P63.A P71.A
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 1 0 0 0
9 0 0 0 1 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0
The response matrix looks like:
head (Y, n=6)
APV ATV IDV LPV NFV RTV SQV
1 2.3 NA 32.7 NA 23.4 51.6 37.
2 76.0 NA 131.0 200.0 50.0 200.0 156.
3 2.8 NA 12.0 NA 100.0 41.0 145.
4 6.5 9.2 2.1 5.3 5.0 36.0 13.
5 8.3 NA 100.0 NA 161.1 170.2 100.
6 82.0 75.0 400.0 400.0 91.0 400.0 400.
There are 7 PI-type drugs: APV, ATV, IDV, LPV, NFV, RTV, and SQV.
Selecting drug-resistance-associated mutations
In this step, you need to build an appropriate linear regression model, and use the method we discussed in lecture to select mutations that may associated with drug-resistance. For 7 PI-type drugs, you need to run a seperate analysis for each drug.
Notice that there are some missing values.
Before building the model, we need to perform some final pre-processing steps. We remove rows with missing values (which vary from drug to drug) and we then further reduce the design matrix by removing predictor columns for mutations that do not appear at least three times in the sample. Finally, for identifiability, we remove any columns that are duplicates (i.e. two mutations that appear only in tandem, and therefore we cannot distinguish between their effects on the response).
selection <- function (X, y, alpha) { # Remove patients with missing measurements. missing = is.na (y) y = y[! missing] X = X[! missing,]
# Remove predictors that appear less than 3 times.
X = X[, colSums (X) >= 3]
# Remove duplicate predictors.
X = X[, colSums ( abs ( cor (X) - 1) < 1e-4) == 1]
# Buid an appropriate linear regression model
# Select the mutations that may associated with drug-resistance
#######################
#######################
##### Your Code #######
#######################
#######################
}
alpha = 0.05 # the nominal FWER
# results = lapply(Y, function(y) selection(X, y, fdr, alpha))
Evaluating the results
In this case, we are fortunate enough to have a ground truth obtained by another experiment (data saved
astsm_df). Using this, we can evaluate the selected results. Note that we only need to compare the position
of the mutations, not the mutation type. This is because it is known that multiple mutations at the same
protease or RT position can often be associated with related drug-resistance outcomes.
# Evaluate the result by comparing it to the ground true
#######################
#######################
##### Your Code #######
#######################
#######################