I recently published a short article in the Journal of Bioinformatics and Research that investigated host switching in polyomaviruses (which you can also download here).
The analysis was pretty straightforward, although I thought it provided a nice example of how Mauve can help supplement traditional phylogenetic analysis. Also, it looks like most polyomavirus phylogenies in virology journals compare the divergence of individual protein sequences, but I found that analysis of the genomic sequence seems to provide more useful, consistent results.
In the interests of full disclosure, part of the reason I want to plug this paper is that I am on the editorial board for this journal. That said, I do honestly think it is a journal worth considering if you have a paper that isn't a good fit for a more established journal: it is open-access, turn-around time is quick, and it only costs $100 in processing charges for an accepted manuscript.
Wednesday, October 10, 2012
Monday, October 1, 2012
My DREAM Model for Predicting Breast Cancer Survival
This summer, I have worked on submitting a few models to the DREAM competition for predicting breast cancer survival.
Although I was originally planning on posting about my model after the competition was completely finished, I decided to go ahead and describe my experience because 1) my model honestly didn't radically differ from the example model and 2) I don't think I have enough time to redo the whole model building process on the new data before the 10/15 deadline.
To be clear, the performance isn't all that different for the old and new data, but there are technical details that would have to be worked out to submit the models (and I would want to take time to re-examine the best clinical variables to include in the model). For example, here are the concordance index values for my three models on the training dataset:
The old models are supposed to be converted to work on the new data. If this does happen, then I'll be able to see the performance of these models on the future datasets (additional METABRIC test dataset + new, previously unpublished dataset). That would certainly be cool, but this conversion has not yet happened.
In general, my strategy was to pick the gene expression values that correlated most strongly with survival, and I then averaged the expression of probes either positively or negatively correlated with patient survival. On top of this, I further filtered the probes to only include those that vary between high and low grade patients. My qualitative observation with working with breast cancer data has been that genes that vary with multiple clinically relevant variables seem to be more reproducible in independent cohorts. So, I thought that this might help when examining the true, new validation set. However, I gave this much smaller weight than the survival correlation (I required the probes to have a survival correlation FDR < 1e-8 and a |correlation coefficient| > 0.25, but I only required the probes to also have a differential grade FDR < 0.01).
So, these three models can be described as:
CWexprOnly: cox regression; positive and negative metagenes only
CWfullModel: cox regression; tumor size + treatment * lymph node positive + grade + Pam50Subtype + positive metagene + negative metagene
CWreducedModel: cox regression; tumor size + treatment * lymph node positive + positive metagene
The CWreducedModel was used to see how much a difference it made to only include the strongest variables (and to what extent the full model may be subject to over-fitting). The CWexprOnly model was used to see how well the gene expression could predict survival, even without the assistance of any clinical variables.
I included the treatment * lymph node positive variable because it defined a variable similar to the strongly correlated "group" variable, without making assumptions about which were the most important variables (and, as I would later learn, the "group" variable won't be provided for the new dataset).
Additionally, one observation I made prior to the model building process was how strongly the collection site correlated with survival (see below). This variable wasn't defined by the individual patient, and I assumed this should be a technical variation (or at least something that won't be useful in a truly independent validation dataset). The new data dimenishes the imact of this confounding variable, but the correlation is still there.
ER, PR, and HER2 status are also important variables. However, PR and HER2 status was missing in the old data, and I didn't record the original ER correlation. Therefore, they are among the variables that I don't report in the above table. Likewise, the representation of the tumor size and lymph node status variables changed between the two datasets.
This was a valuable experience to me, and I'm sure the DREAM papers that come out next year will be worth checking out. There were some details about the organization that I think can be improved (avoid changing the data throughout the competition, find a way to limit the model of models to avoid cherry picking of over-fitted, non-robust models, and providing rewards for intermediate predictions of data where the users could cheat use the publicly available test dataset). Nevertheless, I'm sure the process will be streamlined if SAGE assists with the DREAM competition next year, and I think there will be some useful observations about optimal model building from the current competition.
Although I was originally planning on posting about my model after the competition was completely finished, I decided to go ahead and describe my experience because 1) my model honestly didn't radically differ from the example model and 2) I don't think I have enough time to redo the whole model building process on the new data before the 10/15 deadline.
To be clear, the performance isn't all that different for the old and new data, but there are technical details that would have to be worked out to submit the models (and I would want to take time to re-examine the best clinical variables to include in the model). For example, here are the concordance index values for my three models on the training dataset:
New Data |
||
CWexprOnly
|
||
CWfullModel
|
||
CWreducedModel
|
The old models are supposed to be converted to work on the new data. If this does happen, then I'll be able to see the performance of these models on the future datasets (additional METABRIC test dataset + new, previously unpublished dataset). That would certainly be cool, but this conversion has not yet happened.
In general, my strategy was to pick the gene expression values that correlated most strongly with survival, and I then averaged the expression of probes either positively or negatively correlated with patient survival. On top of this, I further filtered the probes to only include those that vary between high and low grade patients. My qualitative observation with working with breast cancer data has been that genes that vary with multiple clinically relevant variables seem to be more reproducible in independent cohorts. So, I thought that this might help when examining the true, new validation set. However, I gave this much smaller weight than the survival correlation (I required the probes to have a survival correlation FDR < 1e-8 and a |correlation coefficient| > 0.25, but I only required the probes to also have a differential grade FDR < 0.01).
So, these three models can be described as:
CWexprOnly: cox regression; positive and negative metagenes only
CWfullModel: cox regression; tumor size + treatment * lymph node positive + grade + Pam50Subtype + positive metagene + negative metagene
CWreducedModel: cox regression; tumor size + treatment * lymph node positive + positive metagene
The CWreducedModel was used to see how much a difference it made to only include the strongest variables (and to what extent the full model may be subject to over-fitting). The CWexprOnly model was used to see how well the gene expression could predict survival, even without the assistance of any clinical variables.
I included the treatment * lymph node positive variable because it defined a variable similar to the strongly correlated "group" variable, without making assumptions about which were the most important variables (and, as I would later learn, the "group" variable won't be provided for the new dataset).
Additionally, one observation I made prior to the model building process was how strongly the collection site correlated with survival (see below). This variable wasn't defined by the individual patient, and I assumed this should be a technical variation (or at least something that won't be useful in a truly independent validation dataset). The new data dimenishes the imact of this confounding variable, but the correlation is still there.
Old Data
|
New Data
|
|
Collection Site
|
0.42
|
|
Group
|
-0.51
|
|
Treatment
|
0.29
|
|
Tumor Size
|
-0.18
|
|
Lymph Node Status
|
-0.24
|
ER, PR, and HER2 status are also important variables. However, PR and HER2 status was missing in the old data, and I didn't record the original ER correlation. Therefore, they are among the variables that I don't report in the above table. Likewise, the representation of the tumor size and lymph node status variables changed between the two datasets.
This was a valuable experience to me, and I'm sure the DREAM papers that come out next year will be worth checking out. There were some details about the organization that I think can be improved (avoid changing the data throughout the competition, find a way to limit the model of models to avoid cherry picking of over-fitted, non-robust models, and providing rewards for intermediate predictions of data where the users could cheat use the publicly available test dataset). Nevertheless, I'm sure the process will be streamlined if SAGE assists with the DREAM competition next year, and I think there will be some useful observations about optimal model building from the current competition.
Subscribe to:
Posts (Atom)