Monday, September 16, 2019

Examples of Visual Critical Assessment for Ancestry Chromosome Painting

[this post is a collection of images to try and make my points from this Twitter discussion more clear]

NOTE: After creating this blog post, I created this Biostars discussion.  I think this is a little shorter and perhaps a better format for discussion.  So, you can want to consider looking at that discussion instead of (or in addition to) this post.  Thank you very much for your interest.

As also mentioned in this other post, my African ancestry (whether that is what most people would consider to be African, or ancestors that migrated out of Africa relatively more recently) should come from my father's side.

While upstream phasing by SHAPEIT can also be a factor, I did some RFMix re-analysis with various data types, including the result below:




Assuming that each row represents a chromosome that I inherited from each parent, the most clear problem is that the African (AFR red ancestry) should really only be on one copy of chromosome 14 (and I assume the adjacent European EUR segments are not precise, and there should really be one larger segment).

While I am not entirely certain about the underlying method, there is also an example where I think visual inspection can be useful for a result from basepaws.  While my notes are messy (and sometimes incomplete), you can look here for more information (if desired).  I purchased ~15x Whole Genome Sequencing for $1000 (to get raw data), rather than the more typical $95 for low coverage Whole Genome Sequencing (lcWGS) and Amplicon-Seq for health markers.

So, I don't exactly have my own basepaws report, but I think there is a fairly new version of broad ancestry assignments (via chromosome painting) that can be viewed on page 3 of this PDF (for another cat).  In terms of separate images that I can find on-line, this blog post with an earlier report (again, for another cat) has ancestry painting, but it doesn't have the same problem.  Likewise, the chromosome painting plot on this blog post doesn't have the same issue.

So, I will just verbally say that the cat chromosome painting on on page 3 of this PDF looks off in that the broad ancestry assignments seem to be the same on both copies of each chromosome.  To be fair, they aren't always the same (which is good - I think the ancestry for most cats should probably be relatively independent for each chromosome copy).

However, in terms of giving advice for troubleshooting, I can also show you my attempted RFMix analysis (which clearly has problems, and I wouldn't recommend for returning as a result for anybody else).





Now, the chromosome copies do show more independent ancestry (per chromosome copy), but the results are not reproducible.  However, in that particular context (performing re-analysis of my cat's data), I thought ADMIXURE and PCA (using public reference samples) did have reasonable results.  So, I think there were other strategies (which I would probably consider "simpler" strategies that I thought did give reasonable results), which is good and important.  While there may be a problem in the assumption of use for some more specific breeds (such a single markers for the Scottish Fold or Sphynx), my point is that I consider the ADMIXTURE and PCA results to be OK for the broader ancestry (meaning I think there is some sort of robust ancestry result that can be provided for cats).

Nevertheless, the overall goal was to be able to visually identify likely problems with inheritance (and/or limitations to precision for ancestry results), and I think the above plots are OK for that.

That said, even this troubleshooting should probably be thought of as "hypothesis generation."  In other words, if your first assumption when seeing results like I have shown above is not "Something looks like it could / likely is wrong," then I think this is helpful in terms of needing to critically assess genomics results.  However, it is also important that you then try to think of ways to identify the more specific problem.  While phasing may be an issue with the human 23andMe results and the more limited number of probes / markers for the public reference samples is my expected problem with the basepaws cat RFMix analysis, you generally gain confidence in a result when you keep trying find a problem (and you keep finding valid explanations for the results).  So, in some ways, it may be best to call my critique a "hypothesis."  For example, saying I couldn't get reasonable RFMix results for my cat analysis (and I should also admit that there could also be some sort of bug that I haven't been able to find) is not the same of saying some sort of chromosome painting analysis is not possible in the future (as long as the underlying biological assumption is valid).

Change Log:

9/16/2019 - public post date
2/6/2020 - add Biostars discussion link

Saturday, September 14, 2019

Informal Notes: Additional Research on APOE variants for Increased Risk of Late-Onset Alzheimer's Disease

[this set of notes branched out from this blog post; with 2 similar / partially identical paragraphs] 

I found it hard to find cohorts with individuals greater than 80 years of age in the AlzGene database (and 23andMe was reporting that is the age interval when onset was most likely, and risk was most increased).

To be fair, I also noticed that a recent study by Licher et al. 2019 reported Low-Risk Onset with an average age of 85.5 and High-Risk Onset with an average age of 81.3 (meaning individuals must have been over 80 at some point).  However, they grouped both E3/E4 individuals (like myself) with E4/E4 individuals (who should be at noticeably higher risk) in the "High-Risk" group.  The CDC also provides a short review here.

There is an earlier paper by Corder et al. 1993 reports "mean age at onset decreased from 84 to 68 years," but here were less than a dozen E4/E4 individuals for each gender in that paper (and the maximum risk of 90% seems high compared to other studies).  To be fair, Farrer et al. 1997 also reports an earlier age of onset with more samples (closer to 755 Caucasian E4/E4 individuals from meta-analysis; 14.8% of 5107 cases).  However, with that later study, I would kind of like to see some error bars (and ideally see individual points and/or be able to download an Excel / tab-delimited text / comma-separated text file with APOE genotype, disease status, age of onset, gender, ethnicity, and cohort / batch / family ID).

Myers et al. 1996 report an E4/E4 age of onset closer to 80 years (for 55% of E4/E4 individuals); however, the total percent of individuals developing Alzheimer's Disease is noticeably lower for E3/E4 and E3/E3 (27% and 9%, respectively) as well as having a later onset of closer to 85 years than 80 years.  Unfortunately, I don't have access to that later information, to check additional details.  It also doesn't look like 23andMe directly cites these papers (but I think these are what are discussed in review articles), but they report something more similar to Myers et al. 1996 (and Genin et al. 2011, which I do have access to).

Also, to be fair, my 23andMe Report says "Approximately 40-65% of Alzheimer's patients have one or two copies of the APOE ε4 variant. However, many people with the APOE ε4 variant will not develop late-onset Alzheimer's disease" (citing Alzheimer's Association 2016).  This probably matches the 55% APOE E4/E4 onset (and an indication that E3/E4 individuals like myself are increased risk, but are still more likely not to get the disease).

There are also different genes (such as APP / PSEN1 / PSEN2) which are more associated with early-onset Alzheimer's disease (which I found from this CDC link).

While a slightly different topic, my understanding is that the previous content from the 23andMe forums will be deleted.  So, to copy over some of the main points here, the 23andMe scientific report describes the "General population" risk estimates for individuals over 85 to be 11% for males and 14% from females.  I noticed that the overall risk estimates for individuals over 85 from other sources were numbers like 25%, 50%, and 50%.  Another member of the 23andMe community passed along a link to a PDF of the World Alzheimer Report 2018 from Alzheimer's Disease International; while a bit hard to find that same document, I do see the prevalence values that I mention in this article (which is "3% of people age 65-74, 17% of people age 75-84 and 32% of people age 85 or older have Alzheimer's dementia.").  If you check the reference in this CDC post, then Hebert et al. 2013 indicate previous and projected frequencies of 32% to 37% in Table 1.  I think this leaves some questions unanswered, but I think also the younger age estimates are more similar (and I would guess it is easier to collect data from those younger than 85).  So, I thought the feedback and discussion was helpful.

Change Log:

9/14/2019 - public post date
10/3/2019 - add link for CDC review
8/21/2021 - add overall frequency notes from 23andMe forum

Informal Notes: Pathogenic Variant Risk for BRCA1/2 Mutations

[this set of notes branched out from this blog post

I think the estimates of 50% risk for breast cancer and 30% risk of ovarian cancer (in this infographic from the CDC) are in line with that I was expecting (although that doesn't capture mutations in BRCA1 being higher risk than BRCA2).  There is also this page with 60-75% risk for BRCA1 carriers and 50-70% risk for BRCA2 carriers (which I learned about from this Twitter reply).

Importantly, because of that Twitter communication, I read the book "Resurrection Lily" by Amy Byer Shainman.  I think this provides a very poignant perspective of what is like for yourself and/or your friends to go through either breast cancer treatment or prevention strategies (including up to preventive surgeries).  There are also multiple pages for risk estimates at the end of the book (pages 253 and 262-263 in my copy of the book).

Before I learned about that book, I think I would mostly typically refer to the Stanford BRCA Decision Tool (Kurian et al. 2012) that provides the variance of risk with a few options (although I'm not sure about the the intervals of screening, I don't know what is the relative effectiveness of hormonal therapies, and these estimates are at the gene level when I would expect some specific variants are higher risk than others).

While it is probably not going to be a problem in the future, there was a few days when the website was down.  So, I have attached some screenshots below (for BRCA1, and then BRCA2):




There was a recommendation for BRCA screening with a "B" grade in individuals with family histories, but a recommendation against BRCA screening with a "D" grade for individuals that didn't already have an increased risk of getting breast cancer (which I think is shown most clearly on the US Preventive Services website).  While the context is different, I think this figure from Biesecker 2019 shows one example of how filtering for prior risk may help (although not specifically for BRCA1/2).

While the topic of treatment strategies is a little different, I did notice some different number on an NCI Prevention Tweet (going back to topic of the range of risk estimates).  For example, that number seemed noticeably different for BRCA2, and the total risk reduction (particularly if you focus on death from breast cancer) also seemed different that the text in that Tweet (compared to what I see in the BRCA Decision Tool).

I also think I had another interesting conversation on Twitter here, some of which was converted into notes as a blog post (in terms of the overall prevalence of cancers where variants in moderate-to-high risk genes could be found).

There are some NCCN guidelines, although I thought those were more clear for criteria for screening versus recommendation for surgery in a BRCA-positive individual (which supports use on a "case-by-case basis").  Nevertheless, there is a lot of information available if you register for an account in this document, where some details for the "Bilateral Total Mastectomy" are on page MS-24.  Nevertheless, if this implies that people who don't qualify for the NCCN screening criteria (but had a test result, for some reason) are less likely to benefit from surgery, then perhaps that is helpful for making decisions.

The CDC has guidelines to define "average," "moderate," and "high" hereditary breast cancer risk.  I think their guidelines for genetic testing are similar to the NCCN and USPSTF (but you might find that website easier to read, and this page mentions that the USPSTF recommendations may affect health care coverage).  They also have an additional link for mammogram screenings that I didn't previously know about (on this page).

While I think there are some specific things that can be improved (for communication with non-scientists and communicating ranges for estimates of risk for specific variants, for example), I believe free, publicly-available resources like BRCAExchange.org should play an important roles in helping patients learn more about their results.  ClinVar also provides some information, for variants in general.

For example, I believe these are the 3 BRCA1/BRCA2 variants covered by 23andMe: rs386833395rs80357906, and rs80359550.

The mutation described by the author of "Resurrection Lily" was referred to as "BRCA1 #5385 insC" as well as "BRCA1 #5382".  The book includes some references to ClinVar, but I don't believe I saw the rsID.  However, I believe those names refer to this variant in BRCA Exchange and this variant in ClinVar (based upon this Hamel et al. 2011 paper, and the synonyms in the database listings).  That makes this is second of the 3 variants tested by 23andMe above.

I am not sure about other available estimates.  However, there are selected genes with high (or highest?) lifetime risk variants for 7 cancer types (including breast and ovarian cancer) on this page from JScreen.

You can listen to some personal perspectives from a patient advocate in this podcast from DNA Today.  One statistic that caught my attention was that 1 in 40 Ashkenazi Jewish women have a mutation in the BRCA1 or BRCA2 genes.  I would also find that statistic on this page from the CDC.  I believe that page is combining pathogenic variants from the BRCA1 and BRCA2 genes when reporting "About 50 out of 100 women with a BRCA gene mutation will get breast cancer by the time they turn 70 years old".

There are also various ways to see the influence of ENIGMA on BRCA1 and BRCA2 variant annotations.  For example, BRCA Exchange includes pathogenic variant annotations from ENIGMA and ClinVar.  This page provides information for ENIGMA annotations in ClinVar, and there is also an ENIGMA BRCA1/BRCA2 expert panel in ClinGen (related to "Variant Pathogenicity" for BRCA1 and BRCA2).

Change Log:

9/14/2019 - public post date
11/12/2019 - add link to BRCA decision tool paper; website not currently working, but I will add some screenshots later (to show substantial difference in risk at the gene level for BRCA1 versus BRCA2)
11/20/2019 - add screenshots for BRCA Decision Tool.
12/2/2019 - add prior risk venn diagram link
7/6/2020 - add "breast cancer" label
8/16/2020 - add additional Penn/Basser link for risk estimates
8/22/2020 - add BRCAExchange link
10/25/2020 - add Resurrection Lily link
2/23/2022 - correct typo
12/1/2023 - add JScreen link (after listening to this podcast)
1/21/2024 - add additional DNA Today link as well as CDC Link for BRCA variants in women with Ashkenazi Jewish ancestry.
1/22/2024 - after receiving helpful feedback from GenomeConnect / ClinGen staff, add links for various types of ENIGMA annotations.

Monday, September 9, 2019

Informal Notes: Collection of References Regarding the Frequency of Pathogenic Mutations in Moderate-to-High Cancer Risk Genes

While I typically prefer to keep Twitter responses less formal than blog posts (which in turn are less formal than pre-prints or peer-reviewed journal articles), I feel kind of bad continuing to add citations to an earlier discussion thread.

So, I thought I would copy the notes here, and provide information for those that are interested (without having all the notices go to the person with the original response):

BRCA Exchange tries to provide some variant information (for BRCA1 / BRCA2)

data.color.com also provides some information about specific variants (but counts are often low).  Color also has this page for Breast Cancer Awareness Month (October).

Ambry provides an Interactive Prevalence Tool (in collaboration with Mayo)

Stanford BRCA Decision Tool provides some statistics to guide discussions about risk prevention options (although that relates more to risk than total frequency; however, this with total numbers is essentially what I want to see)

MSKCC lists total inherited breast cancer rate at 5-10%

There is a link from the CDC website listing the BRCA1+BRCA2 rate for hereditary breast cancer as 3% for breast cancer and 10% for ovarian cancer.  If you sum the frequencies, that is somewhat close to LaDuca et al. 2019.

There is also a different source of information from the CDC, mentioning for all hereditary breast and ovarian cancer (which must be greater than the sum of BRCA1+BRCA2), which was 5-10% and 10-15% respectively.

  • Editorial about BRCA mutation frequency and risk


Hu et al. 2018
  • Table 2 shows some percentages per gene


LaDuca et al. 2019
  • Learned about from @kristinclift
  • GenomeWeb review mentions "13.8 percent of ovarian cancer patients carried germline pathogenic variants in at least one of the 32 genes tested"
  • Supplemental Table S7 also shows a higher BRCA1/2 ovarian cancer percent (3-5%, OC) than breast cancer percent (1-2%, BC)
  • There are all high-risk patients 
  • Total numbers are fairly different: 66,954 BC, 9,106 OC, 1,087 BC+OV.
  • Supplemental Table S6 has gnomAD control frequencies, but I think variant calls may need to be more specialized for targeted gene panels.
    • In other words, those percentages look low compared to that matched control study, but they do provide those details.
  • I also think they used results from the reports from different Ambry panels, rather than obtaining and re-analyzing raw data independently.  So, I wonder how much of an effect that had.


Unpublished (?) CARRIER Study Data
  • In a CARRIERS plot (a PDF from a Clinical Cancer Genomics Conference presentation that I don't know if I have permission to post), getting control variants for BRCA1/2 below 0.4% puts BRCA1/2 case mutation rates at under 1.4%.
    • I think that is for breast cancer.
  • Plots for 14 individual studies (for a total of 10,000s of samples, with matched controls) were shown for 21 genes of with overall case rate of ~6% and control rate of ~2.5%, as well as 17 higher-risk genes with overall case rate of ~5% and control rate of ~2% (which I guess could up to a 30% “false positive” rate either way).


Variability in Variant Calling
  • May need more specialized variant calling for targeted gene panels - Warden et al. 2014
    • For example, notice the difference in scale for the y-axis in Figure 9, where the relative “novel” variants in controls is a proxy for false positives:
  • You may get better concordance between variants with reprocessing (in this case, using a more typical processing strategy for WGS and Exome Data) - blog post on custom scripts and precisionFDA comparisons.
  • In addition to data.color.com, I think cBioPortal can be a useful research to look at mutation information across different cancer types and studies
    • cBioPortal has some tutorials that can be viewed here
    • For example, I think I could see results more similar to what I read elsewhere (with BRCA2 being found at higher frequencies) if I removed variants of unknown significance
    • I think 47:45 - 51:55 of this webinar also gives some useful caveats to mutation and copy number calling that might be worth taking into consideration (where a region that is supposed to have a homozygous deletion has a variant with an allele frequency that you would normally consider a match for a heterozgyous variant, when a true homozygous deletion call shouldn't have any variants)

Polygenic Risk Scores

These are arguably more for risk assessment rather than frequency (although the risk estimate vary across the population).

Nevertheless, I would also be interested in seeing more results from Myriad myRisk and AmbryScore-Breast Polygenic Risk Scores (PRS).  This is in part because I haven't been highly satisfied with the polygenic scores that I was able to apply to myself, but I should note that clinical features are included in the calculation for these scores (beyond just the genetic information).

For example, I noticed that some more information Myriad myRisk was provided in these slides, which I believe reference this announcement.  I believe that is a reference to the Hughes et al. 2020 and Gallagher et al. 2020 papers.  However, what caught my eye was this figure, which I don't think it is directly from those publications?

I am not sure if I am correctly understanding that both Myriad myRisk and AmbryScore-Breast use "clinical history" information?

As a possibly separate point, I thought Figure 2 of Fahed et al. 2020 was interesting because of the difference in the range of risk estimates by PRS with versus without a BRCA1 or BRCA2 mutation.  The stratification without being a BRCA1/2 carrier was more limited.  However, I would be interested to get a better sense of how PRS stratification compares to variants in or near the BRCA1 or BRCA2 genomic footprint.  I think the later more along the goals of resources like BRCA Exchange.

Those interested in these notes may also be interested in this post with notes on the range of risk estimates.

Change Log:

9/9/2019 - public post date
9/10/2019 - add "informal note" category, along with link to original response
9/13/2019 - add CDC links, as well as some extra information not previously copied over from earlier Twitter discussion
10/10/2019 - add Color link for BRCA1/2 mutation frequencies
11/12/2019 - add link to other blog post
5/8/2020 - add cBioPortal notes
7/6/2020 - add "breast cancer" label
7/7/2020 - add note about interest in hearing about a range of PRS experiences.
8/14/2020 - add additional PRS links
8/16/2020 - minor change
3/9/2021 - add link for Ambry Prevalence Table
4/23/2021 - move link from earlier blog post to this blog post
6/26/2022 - add link to PRS within BRCA1/2 carriers

Saturday, September 7, 2019

Updated Thoughts on The Language of Life

I have an earlier review on "The Language of Life" by Francis Collins (who was previously the lead for the Human Genome Project, and is the current director of the NIH).  However, given that I am presenting this book at the Monrovia Library book club in October, I thought it would be good to have a newer post with some additional posts (and I will also create a post with discussion topics for the book club, in October).

First, I should critique my own previous post.  For example, I currently feel more confident about preventing rare diseases than guiding drug treatments.  However, I think there was a sense in the limits to prediction that I was trying to convey before (in terms of "designer babies" where a large number of traits could be predicted and selected), and I still don't support that (or necessarily believe that can / should be accomplished).  I think this is also emphasized in the HFEA guidelines (described on page 55, in my edition) as well as some more updated scientist options (such as not using Polygenic Risk Scores for embryo screening).

I still really like that Francis Collins provided a balanced view of genomics (with both potential and limits), and I was glad to see that I could also notice that ~10 years ago.

My earlier post also reminded me of the statistic that "[adverse] drug reactions are the fifth leading cause of death in the United States" (page 233, in my edition), although I admittedly also forgot that in between the time that I first marked that page with a book dart and when I started writing the draft for this post.

Going back to the genomics and drug treatments, Francis Collins also mentions "the biggest reason for potentially deadly drug reactions is simple human error [but this isn't the only reason]" (page 233, in my edition).  So, even though I implied that I had less confidence in pharmacogenomics (or at least I think we have to be more careful about the assessments than I used to), I really do think biomedical informatics can help patient care.  In other words, making sure relatively simple actions are consistently understood and carried out appropriately is not trivial (which I also touch on when describing my cystic fibrosis carrier status, even though I believe that is more complicated than some people may expect), and that is something important that we can improve (without even using exceptionally complicated models / techniques).  Nevertheless, to be clear, this area was fairly represented in the book, with an entire section of the 9th chapter called "Obstacles to the Pharmogenomics Revolution" (page 247-249, in my edition).

Now, in terms of my updated thoughts:

1) In the introduction, "Dr. James" (who was really Francis Collins) describes interacting with somebody with a BRCA1 mutation in their mother's DNA, saying "[the patient] faced a 50 percent risk of having inherited that misspelling, in which case her lifetime risk of breast cancer would be approximately 80 percent, and that of ovarian cancer about 50 percent" (page XII, in my edition).  However, I think I have more recently gained better appreciation for the value in the range of risk estimates (at the gene or variant level).  For example, there is a Stanford BRCA Decision Tool that provides the variance of risk with a few options (although I'm not sure about the the intervals of screening, I don't know what is the relative effectiveness of hormonal therapies, and these estimates are at the gene level when I would expect some specific variants are higher risk than others).  Likewise, there was a recommendation for BRCA screening with a "B" grade in individuals with family histories, but a recommendation against BRCA screening with a "D" grade (which I think is shown most clearly on the US Preventive Services website).  In other words, I believe the significance of the result (and the preventive option chosen) varies depending upon whether the individual has a family history of early-onset breast cancer (ideally, I believe, with a variant that specifically validates between cases and controls in their own family).

In the interests of space, I have saved a collection of informal notes in another blog post.  This includes some things like strategies to define risk from family history from the CDC.

2) On page 194 (in my edition), Francis Collins describes "a company called Psynomics is marketing a DNA test for susceptibility to bipolar disorder, arguing that this information could be useful in establishing the diagnosis in an uncertain case.  The test being offered, however, is based upon a variation in a gene called GRK3, and this has not been validated in a large-scale study.  This result could turn out to be utterly useless.  Even worse, this kind of unvalidated test, utilized by individuals or their physicians to make a serious diagnosis in an uncertain situation, might do more harm than good." As with pretty much all of the posts, I will probably update my review of "Blueprint", but I agree with concerns about the over-estimation in the accuracy of tests that can possibly negatively impact the rest of someone's life (as well as my opinion that the reaction to genetic results may be particularly important for mental health).

Similarly, on page 204, a company offering testing of V1aR variants for $99 to test for increased susceptibility to infidelity is also presented with an appropriately critical view that "the actual influence on the behavior of an individual male is quite modest, and should certainly not be used in mate selection or as an excuse for cheating on one's partner.

3) Perhaps it is a bit of a tangent; however, in terms of the rare diseases, the first episode of Diagnosis on Netflix involves a patient story that was resolved with Whole Genome Sequencing (WGS) to determine the cause of her ailments for the past 10 years were due to CPT2 (meaning her symptoms could improve by increasing sugar and decreasing fatty acids in her diet).  I was surprised that she went to Italy for the diagnosis (where she was treated for free after the arrived, but I would have expected treatment costs to usually be above a few thousand dollars to justify the trip; I could get Veritas WGS data for $1000, but I did need to re-analyze it).

I was also surprised that her US doctors were trying to sue her for hundreds of dollars of medical bills (when she was already in debt, and the treatments weren't helping her in the long-run since they didn't reveal the underlying problem).  However, that unfortunately seems like it may not be an isolated incident: for example, I recently heard about this happening to a large number of individuals in the UVA health system.

However, getting back to this book, on page 92 (in my edition), Francis Collins warns that some nutrigenomics companies are running "consumer scams," while there are legitimate rare diseases whose symptoms can be improved with diet (such as PKU).  I was also skeptical about some of my nutrigenomics results, but it sounds like the Netflix show also provides a genuine example where genetics can inform diet (and vastly improve your quality of life).


Also, similar to my 1st post, here are some assorted minor points:

a) Francis Collins (as Dr. James) indicated some someone from Navigenics implied that "most of the remaining genetic risk factors for common disease will have been discovered in the next two or three years; as a scientist working in this field, that seems unlikely to me" (page XXII, in my edition, emphasis added).  I also don't think Navigenics exists anymore - at least the Wikipedia company link does not go to a genetics company website (even though they also mention it was acquired by Thermo Fisher in 2014).

b) As noted in the first post, Francis Collins has blue eyes when 23andMe predicted them to be brown (page XXVIII, in my edition).

c) I am a Bioinformatics Specialist (doing genomics research).  However, I don't think that term was in widespread use when the book was written.  For example, I believe his term "DNA cryptography" (page 13, in my edition) is meant to be synonymous with "Bioinformatics."

d) Reading this book also influenced another blog post, in terms of the discussion of the ACLU Supreme Court case invalidating Myriad's patents on the BRCA1/2 genes and contrasting his own actions for the CFTR gene for cystic fibrosis.

e) On page 187 (in my edition), Francis Collins describes "[one] remarkable gene in the brain is estimated to be able to make 38,000 different proteins."  However, I kind of wish there was a reference to the citation in the primary literature.  For example, I thought most cells tended to have one predominant version of a gene transcript, and I am worried about false positives (or at least rare alternative splicing events) when describing very large numbers of isoforms for genes.

f) On page 317 (in my edition), 23andMe is listed as testing for the Δ508 cystic fibrosis variant.  While I got my first 23andMe test in 2011 (a little after this book was published), I am a carrier for a different cystic fibrosis variantSo, 23andMe currently covers more than just that one cystic fibrosis variant.

Finally, I specify "in this edition" whenever I reference something from the book.  However, I think the relatively newly purchased paperback was still the 1st edition.  So, I'm not sure how necessary this is.  However, in terms of trying to minimize errors in peer-reviewed publications (and making sure people acknowledge and correct errors), I think the concept that books have editions may be kind of important.

Update (10/19/2019): After I finished re-reading the book (again - to prepare for leading the book club discussion), I thought I should write a little more to make sure that I am to down-playing the pharmacogenomics part too much.  While I do think the introduction is a fair match to my interests / opinions, I do want to make clear that I am sure there are important genomic applications with decent predictive power for guiding drug dosage, drug effectiveness, and/or serious adverse side effects.

So, similar to the separate blog posts containing notes on BRCA1/2 pathogenic risk, high-to-moderate inherited cancer risk frequencies for pathogenic variants, and APOE variant frequencies and Alzheimer's Disease risk, I will try to add a few links about the influence of VKORC1 on Warfarin / Coumadin dosage.  However, I have spent considerably less time looking into that, so this will just be bullet points below (instead of a separate blog post):




On the other hand, I really do have some interest in understanding (and critically assessing) the use of genomics for depression treatment.  I have tried to collect some notes on that within my review of "blueprint", as well as expressing concerns I have about what people might percieve about the predictive power of genomic data and anxiety / depression (based upon my own personal experience as well as some general genomics research experience).

Change Log:

9/7/2019 - public post date
9/8/2019 - revise post from sister's feedback; minor changes
9/9/2019 - add UVA example + NCCN guidelines + additional Twitter / blog link
9/10/2019 - minor changes
9/13/2019 - add links from the CDC
9/14/2019 - move longer set of BRCA1/2 notes to separate post
10/19/2019 - add update with pharmacogenomic notes

Monrovia Library Book Club Discussion Topics for "The Language of Life"

You can see my thoughts in an two earlier blog posts (my first ever blog post in 2010, as well as a more recent blog post in 2019).

However, the book club (at 6:30 PM on Tuesday October 22nd) is really about other people's thoughts (although I hope this non-fiction book helped with understanding about genetics/genomics).

So, here are some discussion topics, which I think could be of interest (even if you didn't already have a passion for genomics):

1) In general, what did you find to be the most interesting part of the book?

2) Did you think this was a good introduction to genetics / genomics? If not, I also recommend reading "The Cartoon Guide to Genetics" (which was required for my AP Bio class in High School, along with a more formal textbook). However, please be aware that the cartoons within the book are in black-and-white.  Also, as with just about anything else, the book isn't absolutely perfect: for example, there is a reference to 200,000 genes in the human genome on page 80 (which was believed at one point, but I would now say we feel much more comfortable with 20,000 genes that can be relatively consistently transcribed).

3) There is a section of the 7th chapter about the influence of genetics on Criminality.  For example, there are a few paragraphs about the X-linked MAOA gene.  While I mostly have to trust the study was fairly presented (and the reproducible in subsequent studies), a study showing decreased expression of MAOA was associated with increased risk of violent behavior and criminal convictions, but only if the individual was abused as a child (page 202).  So, I think this is a good example of a gene-environment interaction, but I don't know how strong / predictive the risk association was.

Likewise, to put things in perspective, Francis Collins also pointed out "approximately half of the US population carries a genetic risk factor that places people at a sixteenfold higher likelihood of imprisonment than the other half.  That happens to be the Y chromosome" (also on page 202).

Would your opinions of someone change if you knew they had a negative genetic predisposition (and you thoroughly understood exactly what has been observed and how much of an effect that has)?  For example, what do you think about giving somebody a lesser or more severe sentence because of their genetics?

4) Also in the introduction, Francis Collins discusses Alzheimer's disease risk, and questions the value of returning results when there is nothing that can be done medically (page xx, as well as illustrated on page 222).

I (Charles Warden) carry one copy of the APOE E4 risk variant (and I know which parent also has that risk variant).

4a) What do you think about a risk assessment for a disease that cannot be prevented or treated?

4b) Does that opinion change if I emphasize the need for you (and your genetic counselor, physician, etc.) to have access to the data to calculate the risk assessments, as well as making sure that you have access to your raw data for re-analysis / evaluation?

If interested, you can see my longer list of informal notes in another blog post.  However, the main message I think I should explain is that it takes some time to get confidence in a risk assessment (and I think there should ideally be some sort of access to the primary data used to come to those conclusions).

While I won't focus on what (from what I understood) were the less representative results here, my impression is that the more robust conclusion was similar to what was reported in my 23andMe report, Genin et al. 2011, and Myers et al. 1996 (which I am using to report the following statistics):


  • ~55% of E4/E4 individuals developed Alzheimer's Disease (with an age of onset ~80 years)
  • ~27% of E4/E3 individuals developed Alzheimer's Disease (with an age of onset ~85 years)
  •  ~9% of E3/E3 individuals developed Alzheimer's Disease (with an age of onset ~85 years)


Likewise, my 23andMe Report says "Approximately 40-65% of Alzheimer's patients have one or two copies of the APOE ε4 variant. However, many people with the APOE ε4 variant will not develop late-onset Alzheimer's disease" (citing Alzheimer's Association 2016).

5) Do you have any direct experiences with genomics results (from 23andMe, AncestryDNA, uBiome, Genes for Good, American Gut, etc)? For example, I have recorded some of my relatively recent experiences in this set of blog posts.

Having 5 questions to guide the discussion may already fill an hour (with a group of 20-30 people).  However, I hope the blog post can help with discussions before the book club (to help me better prepare) as well as after the book club (if anybody doesn't have a chance to express their opinion).

Change Log:

9/7/2019 - public post date
9/8/2019 - revise post from sister's feedback; minor changes
9/9/2019 - trim content
9/10/2019 - fix typo
9/11/2019 - add extra APOE E3/E4 citations (from 23andMe, ClinVar, and accepted middle-author paper; although I think the last of which was also in the pre-print)
9/13/2019 - add CDC links
9/14/2019 - separate blog post for detailed APOE notes
10/1/2019 - minor changes

Book Review for "Blueprint"

Update (10/1/2019): If you would like to read a shorter review that shares my main overall opinions, please check out this.

Otherwise, there are some additional points being made in this blog post, which I think still has some value.

First, I think this book was very helpful in terms of critically assessing plots for Polygenic Risk Scores (PRS):

I had a bit of difficulty finding matching public images that I can post here, and I don't think there is a public interface (kind of like data.color.com; or some of the PGC data and/or the UK Biobank, even thought I had an issue with this link more recently) to query the TEDS data (or CAPS data, WTCCC data, etc.).  However, I would be very happy if somebody could how how a non-scientist can reproduce the results from the book.

Nevertheless, Figure 2 in Chapter 2 (page 25 in the paperback edition) shows scatter plots for weight where monozygotic (MZ) twins have a correlation of 0.84 and dizygotic (DZ) twins have a correlation of 0.55.  I think having concrete examples to show the spread of these correlations (which directly relate to heritability values in this book, defined as 2 times the difference between MZ and DZ correlations on page 27 in the paperback edition) is very important.

Scatter plots can also be important for interpretation.  In an earlier version of this blog post, I had a plot of ranks from here, but I noticed more recently that the linked image was not appearing in the post.  So, I kept searching, and I found some data from this post.  Using code that I uploaded here, I created the following plots for BMI correlations for  MZ versus DZ twins:




You can clearly see that the correlation is higher for twins that are 100% identical (MZ) versus 50% identical (DZ).  However, importantly, the ability to predict a BMI for an individual twin also has limitations.  The first row shows all points, and the second row uses the same data but provides a density distribution to see where a lot of points are close together.

You might also notice that the correlation coefficients are both lower than mentioned in Blueprint: as much as possible, I hope independent validation in multiple large cohorts is helpful, and transparency in data and sample section/filtering is also important.  However, at the current time, I am not sure what explains the difference in correlation coefficient, even if both datasets come to the same conclusion that the genetic impact is higher for monozygotic twins than dizygotic twins.

In the context of a Polygenic Risk Score, creating a scatterplot between the true value and the predicted value may also be helpful in interpreting and critically assessing the results.

In terms of an introduction, I think Chapter 12 (The DNA fortune teller) is quite good in terms of explaining how PRS are calculated and presented (although I kind of whish it was called something different, like "Introduction to Polygenic Risk Scores," since the main thing that was clear to me was the limitations and that seems strange for something called a "fortune teller").  For example, that chapter says "[the] most predictive polygenic risk score so far is height, which explains 17 per cent of the variance in adult height" (emphasis added, page 139 in paperback edition) as well as showing a scatterplot for actual height versus PRS for height (Figure 5, page 142, paperback edition), and Robert Plomin specifically has an actual height at the 99th percentile but a PRS for height in the 90th percentile (as a sort of "best case scenario" for a PRS).

Similarly, the Plomin's PRS for BMI was at the 94th percentile, while his actual BMI was at the 70th percentile (page 146, paperback edition).  He explains this in terms of being at the 99th percentile for height and possibly having to take extra effort to keep off weight (which I think sounds like a plausible combination of factors, in addition to general limits to predictive power).  If most people consider this as a motivating factor (similar to the author), perhaps that is good.

I also really like that Plomin gives examples of PRS percentiles for himself (Figure 11 on page 160, paperback edition): 22% for bipolar disorder, 35% for major depressive disorder, 39% for Alzheimer's disease, 85% for schizophrenia, and 94% for educational attainment.  That said, while I think some of his "self-understanding" discussions about his schizophrenia PRS may be acceptable in a research setting (page 151 and 177 in the paperback edition), it sounds like 85% is not a high enough score to be relevant in a clinical setting (if something like the PRS could increase the chance of somebody being institutionalized with less direct evidence).  This kind of makes sense for a disease with less than 1% prevalence, although I think that does bring into question the value of using common variants (instead of rare variants) in the PRS calculation.

Another useful plot is density plots for extreme values (with the range and overlap of PRS values for each of those populations.  Again, I am trying to show you something in the book without copying it, but I can make some representative examples in R:



For example, I would say the simulated example on the left looks good, but the utility of the example on the right could be questionable (although I think this is encountered more often, especially if you took a randomly selected trait and your own generated PRS).

On the other hand, what I thought could be misleading was the decile plot shown in Figure 6 on page 144 in the paperback edition.  Yet again, I'll use an example on-line (from Figure 3 of Calafato et al. 2018, instead of in the book).  However, one of my top Google searches happened to be for psychiatric traits (rather than height as a positive example):



To be fair, this does make the schizophrenia PRS look like it may have some value with a percentile  >90% (matching the expectation not too much should be read into for the 85% PRS for the author), and it looks like the bipolar PRS is probably of limited utility. Nevertheless, you might have something that is not very predictive (with a lot of variability in the scatter plot) in a decile plot that looks like the 1st 9 deciles for schizophrenia (in that paper).

While I think the example with height was meant to give some sense of an inflection point above the 80th percentile, I think this does not do a good job of capturing the variability that you would see in a scatter plot (so, I think showing a scatter plots and density plots should be required for any PRS).  In particular, I believe the crucial point is made by Plomin on pages 143-144 (emphasis added): "The line running through each data point [in the decile plot] is called the standard error...Note that the standard error refers to the average of each group, not the error of estimating an individual's score...It does not mean that the actual height of 95 per cent of individuals in the top decile of polygenic scores will be in this range."

Second, I admittedly started the book with a bit of a negative impression - the prologue mentions predicting depression (and schizophrenia / school achievement) "from the moment of your birth, it is completely reliable and unbiased - and it costs only £100" (page vii, paperback edition).  As somebody who has had to manage anxiety and depression, I know that symptoms are context-dependent and change over time.  So, even without taking limitations to the genomics predictions into consideration, I would say that something like "probability of having at least 1 depressive episode" likely has a genetic component, but whether or not you have depression / anxiety at any particular interval (and whether or not that requires medication) will require additional factors.  In other words, I think there are a lot of exceptions to the assumption "[psychologists] study hundreds of traits, which is their collective label for differences between us that are consistent across time and across situations" (page 3, paperback edition).

I also don't believe I completely agree with the claim "[for] the first time, genetics offers a causal basis for predicting disorders rather than waiting until symptoms appear and trying to use these symptoms, rather than causes, to diagnose disorders" (page 66 in the paperback edition).  For example, I believe I even had a psychiatrist who explicitly said that getting caught up on the names for the diagnosis can sometimes cause problems beyond trying to treat symptoms.  However, I do agree that "[whether] you become anxious or you become depressed is caused by environmental factors" and I believe there is some useful insight in terms of clustering "internalizing problems" and "externalizing problems" (both on page 67 in the paperback edition).

Additionally, I think there is an important point being made about continuous traits and PRS values on page 164 in the paperback edition: "A second way in which polygenic risk scores will transform clinical psychiatry is by moving away from diagnoses and towards dimensions.  One of the big findings in this book is that the abnormal is normal, meaning that, from a genetic perspective, there are no qualitative disorders, only quantitative dimensions".  So, practically speaking, on the trait side, there are certain thresholds for needing to take action (such as not being able to function at work, at least without treatment / adaptations).  Likewise, I believe you still need concrete examples of less severe behavior to watch out for, in order to possibly identify problems early and prevent progression.

I certainly hope that there can be ways to better identify problems at an early stage in order to have the sort of prevention described on page x (of the paperback edition), but I think it is also important to be realistic about predictive power.  In other words, if the true predictive power is lower than you expect, then providing a diagnosis based upon DNA sequence alone (at least using one strategy of interpretation) might contribute to unnecessary stigma for a patient.  Given the prevalence of depression, I think there are a number of things that probably should be done to improve perception of mental illness (so, a false positive would be less of a big deal).  However, I would be more concerned if limits in predictive power were not properly understood in situations where a false positive could negatively impact the rest of a person's life (if we assume a person will have a severe mental health problem, without any evidence from their actual behavior).  So, it is the part about "This means that we can foretell our futures from birth.  For example, in the case of mental illness, we no longer need to wait until people show brain or behavioral signs of the illness and then rely on asking them about their symptoms" (emphasis added, page x of my paperback edition) that I think is either not being commentated precisely (especially in the present tense) or causes me concern.

To be fair, I have general experience with genomics research (and personal experience with mental health problems), but I didn't have any previous psychiatry research experience reading this book.  So, what may very well be true is that the genetic predictors are better than other risk factors.  For example, Plomin says "[there] are very few large effect sizes in psychology.  On example is that general intelligence accounts for about 25% of variance in educational achievement." (page 31, paperback edition).  However, I think it is also important to keep in mind how the predictive power for these traits / illnesses compares to other associations, and there is still a need to fairly judge each situation independently.

In other words, before reading this book, I thought I mostly remembered schizophrenia having the least significant association.  For example, in Selzam et al. 2019 (with Robert Plomin as the last author), the Polygenic Risk Scores in Figure 1 had significantly higher beta coefficients for height and BMI (or the other cognitive traits) than schizophrenia (SCZ), which was opaque in the lower-right because it's own p-value was greater than 0.01 (and the difference is indicated as not being significant).  Likewise, this review concluded "[these] limitations mean that [Polygenic Risk Scores] are not yet clinically useful in psychiatry."  The book itself also describes limited success for psychological disorders in a 2007 study, although studies with even greater sample sizes were emphasized after that (such as from the Psychiatric Genomics Consortium).

However, to be fair, the Figure 1 in the PRS paper linked above does show better success in predicting educational attainment (General Certificate of Secondary Education, GCSE, in that paper).  While I don't discuss it much in this post, this is a topic of discussion in multiple chapters of the book.  As yet another way to compare PRS, the Epilogue of the book (on page 187 of the paperback edition) summarizes: "polygenic scores...can predict 17 per cent of variance in height, 6 percent of variance in weight, 11 per cent of the variance in school achievement, 7 per cent of the variance in intelligence, and 7 per cent of the variance in liability to schizophrenia."  However, I was less impressed with Figure 10 density plots on page 158 of the paperback version of the book (showing density distributions for the top / bottom 10% of educational attainment PRS, as a function of GCSE score percentile), so perhaps the lower values than Height or BMI for the beta coefficients in the paper should also be emphasized (and, while I think the paper seems to be a better match to my expectations, I don't believe this is entirely consistent with the relative percent variance explained in the Epilogue of the book).

Outside of the book, I did stumble across a gene that was specifically named because of it's association with schizophrenia (DISC1, although that might have been from the NCBI entry for mouse name for Disc1).  However, it is probably also helpful to have numbers like "if one sibling is diagnosed as schizophrenic, their siblings have a 9 per cent risk of being schizophrenic, much greater than the rate of 1 per cent across the general population" (emphasis added, page 71, paperback edition).  In that situation, there is considerably increased risk, but predictive power is still low.  Likewise, I have concerns that the reader may over-estimate the predictive power from sentences like "[for] schizophrenia, DNA differences packaged as polygenic risk scores are now the best predictor we have for who will become schizophrenic"(page 126 in the paperback edition), even though that may in fact be true (and the predictive power for other traits / risk factors is just worse).

There was also at least one paper that indicated "twin studies [can overestimate] heritability," and my comment on that paper references a pre-print where it looks like varying definitions of heritabilty can be used (between the twin 2*|corMZ - corDZ|)  I also noticed that the MaTCH entry for cystic fibrosis (under "ICF/ICD10 Subchapter") wasn't especially high; I'm not sure if that is an issue with sample size, but that makes me think this measure may not be absolutely perfect in terms of representing how well we understand the biology of a given disease (or the severity of the rare disease).  I also thought it was strange that the dizygotic twin correlations were higher than the monozygotic twin correlations for cystic fibrosis.  Perhaps I should look more into the associated Polderman et al. 2015 paper.

I also noticed this other blog post about the genomics in psychiatry (HT @elo81).  The context here is a little different.  However, the article where I first heard about Myriad's GeneSight had a subtitle that implied some confusion in the ability of this pharmacogenetic test to be used for diagnosis (which was not what the content of the article showed: you may be able to use genetics to guide testing different medications, but I didn't see any evidence that this particular test could diagnosis whether you actually have depression).  Additional, this article says "United HealthCare in August announced that it will cover panels of genetic test for guiding the use of drugs for major depression and other depressive disorders, although the American Psychiatric Association’s research council last year concluded that the evidence for testing in those indications is not conclusive", which I believe is specifically in reference to GeneSight?  Either way, the end of this article has two citations (Zeier et al. 2018 and Zubenko et al. 2018) that discuss the field (the later of which describes clinical trials for GeneSight).

In general, you can also find some information on ClinicalTrials.gov (for GeneSight).  While some results are more clear than others, I thought this was interesting.  For example, I think one was "Completed" but actually canceled (under "Results Submitted")?  One is recruiting in Canada (this is for the US National Library of Medicine).  Some can be complete yet have "No Results Posted" (as opposed to not having results because the study still active).  In fact, there was only one with results (NCT01610063, out of the 10 from my search), and it has a link in the original table of search results (to make it stand out more).

There were twice as many lost to follow-up for the Guided (using GeneSight) versus Unguided treatment, but I don't know how often that happens.  The difference for "primary outcome" (and multiple secondary outcomes) was greater for red category results than green/yellow combined category (which matches my expectation that what would give me the most confidence was for me to have a red result and test taking the medication - even though I don't recommend that for most people, particularly if you have no good reason to take such a risk).  I certainly don't want to downplay being able to identify adverse side effects better in 15-20% of patients (if I understand that correctly, that does matter, particularly if you consistently see that between independent cohorts), but I also don't want people to think that the method was precise enough to predict exactly what they should take on the first try.

Third, I am sure I would be guilty of this if I tried to write an entire book (and this is part of why I have "change logs" on my blog posts), but there were some situations where I think there was some room for improvement in the wording.  For example, I think there are some valid points within sections like "Parents Matter, But They Don't Make A Difference" (page 82 in the paperback edition) or "Schools Matter, But They Don't Make A Difference" (page 86 in the paperback edition), but there was understandably some complaints described in the afterword (page 191 in the paperback edition).

For example, on the positive side, the explanation of what is and what can be (at a population measure, as being reported in these studies) that is described several times in the book (including, but not limited to, page 192 in the paperback addition) is useful advice that I have used at least one time when making a point in causal conversation.  However, using statistics as an incentive for change is different than than presenting fate as highly deterministic from genetics (and therefore predictive regardless of future action), and I think this is a caveat that may require less emphasis on the predictive power (and therefore hopefully avoid the need for this additional explanation).

In terms of sections like "Life Experiences Matter, But They Don't Make A Difference" (page 89 in the paperback edition), I think Plomin is right to discourage people from being overly worried about small mistakes.  However, figuring out what life experiences are traumatic enough to need extra effort to avoid is important.  It is also my opinion that convergence of traits like personality and mental health over a long enough period of time can perhaps be though of like needing to go through developmental stages (even as an adult), where certain concepts are easier to understand after you have gone through certain first-hand experiences.  In other words, I believe having some sort of anchor for clear understanding can be important for preventing certain problems; while I agree with tacking these issues as early as possible, I would therefore disagree that certain traits / diseases can be prevented from birth (if experience / communicating / logic are required to understand the underlying problem and modify behavior).

I also think it is extremely important that Plomin acknowledges "Severe genetic problems such as single-gene or chromosomal problems or severe environmental problems such as neglect or abuse can have devastating effects on children's cognitive and emotional development.  But these devastating genetic and environmental events are, fortunately, rare and do not account for much variance in the population" (page 85 in the paperback edition).  However, I think this may also get back to some limitation in the heritabiltiy measure for cystic fibrosis (a single-gene disorder that I believe is one of better examples some thing that can realistically be prevented with methods like IVF+PGT).

Likewise, I thought an insightful example was provided in the Afterword on page 197 in the paperback edition: "No prediction is perfect, especially in behavioral sciences.  We often make big decisions on the basis of much weaker correlations.  For example, the correlation between blood alcohol levels and automobile accidents is weak, but that doesn't, and shouldn't deter us from making strict laws about drunk-driving."  I do wonder if perhaps choosing a title other than "blueprint: how DNA makes us who we are" would have helped with the "No prediction is perfect" part, but that is also discussed in the Afterword (on page 190-191 in the paperback edition).

Fourth, on page 180 in the paperback edition, Plomin mentions "dating websites might extend their data to include polygenic scores...Unlike the hype of dating websites, polygenic-score information could be verifiable through password-protected links to a direct-to-consumer company".  While the limits to predictive power in "percent match" on dating websites might be a good analogy to the limits to "hypothesis generation" for some genomics applications, I certainly wouldn't encourage something like this.  Plus, I'm not quite as certain about the genomics verification being foolproof (which also means that I would at least somewhat disagree with the sentence on page 181 that "You can't fake or train your DNA").  For example, I had a strange experience when I uploaded my 23andMe data into FamilyTreeDNA, and Francis Collins was able to submit his DNA sample under another name to mutiple companies (as described in the prologue to The Language of Life, which is a book that I admittedly prefer, and I have a blog post with an updated summary of thoughts as well as a set of discussion questions for a book club).

Similarly, I have some concerns about the suggestion of using polygenic risk scores for job interviews (mentioned on page 181 of the paperback version), even though I certainly acknowledge limitations in fairly accessing somebody during a brief screening / interview process.

To be clear, I don't expect any system to be completely foolproof (kind of like it is possible to pay somebody to take your SATs, but that is rare and we have only heard about wealthy people doing that).  However, I still strongly disagree with using PRS for dating apps or job applications, which I believe is in line with the view that PRS should usually not be used for embryo selection.

Finally, on a closing note, the Afterword has a section called "Public Reaction" (pages 199 in the paperback edition).  Plomin describes a highly positive response after describing critiques from scientists and the media: "Far from the nightmare predicted before publication, the public reaction has been positive beyond my wildest dreams."  However, my concern that individuals with less background in an area may initially have a more positive reaction, but those individuals may have a more negative reaction in the long run (if there were limitations that were not made clear to them, especially if they repeated incorrect or imprecise information to others).  While I realize this can make success hard to define, I think this may be important for many genomics researchers and companies.

Change Log:

9/7/2019 - public post date
9/8/2019 - fix typos; minor changes
9/16/2019 - add early link to book; minor changes
9/20/2019 - add pre-print citation
10/1/2019 - add other review link + minor subsequent changes
10/2/2019 - add links related to GeneSight
10/20/2019 - add ClinicalTrials.org links
11/14/2019 - add link to swirl lesson with Galton's height data.
7/6/2020  - change tense for "polygenic risk score" label
7/9/2022  - provide alternative BMI scatterplot, and modify content accordingly; minor changes

Also, I moved the following paragraph out of the main text, given that I found it somewhat confusing on re-reading (even if the view of the variation might in fact be of some interest):

Similarly, I thought it was interesting to see Sir Francis Galton's parent-child height data in the "1: Introduction" lesson in the swirl course for "Regression Models" (the scatter for what I believe is one of the most heritable traits is still noticeable and the data is being brought up in terms of describing regression towards a mean).  Also, if you work interactively the data, I thought it was useful to deviate a bit from the instructions and create a plot using smoothScatter(galton$child ~ galton$parent).  Also, to be clear, you should expect more variability for a parent-child plot than a MZ/DZ twin plot.

Also, for reasons of brevity, I thought it might help to move out the following content (and re-number the later sections):

Third, I thought it was a bit odd that the author referenced an error in the afterword without actually correcting it.  Namely, on page 113 in the paperback edition, Plomin says "[if] a SNP is associated with a psychological trait, that means the SNP was expressed."  If the variant changes the protein coding sequence, then expression of that gene is important.  However, I remember Plomin also mentioning that variants for psychological traits are often located in non-coding regions: "most DNA associations with psychological traits involve SNPs in non-coding regions of DNA rather than in classical genes" (page 116 of the paperback edition).  Without getting into whether that is actually causing more false positives for those associations, intergenic or promoter variants do not need to be expressed themselves (in order to affect expression of a causal gene).  This is also mentioned in the Afterword (page 198 in the paperback edition) in the context of epigenetic regulation, but I am a little confused about the counter-argument to epigenetic regulation (except to say some variability, like drug resistance, can be caused through epigenetic mechanisms that won't be captured by Polygenic Risk Scores) and I don't see anything explicitly saying "the SNP doesn't have to be expressed" particularly if a lot of SNPs are in non-coding regions.

Also, it is probably a minor point, but I have some issues with the sentence "[the] rest of this book focuses on SNPs, because they have played a central role in the DNA revolution" (page 113 of the paperback edition) because i) certain classes of SNPs can be called with higher accuracy than indels (insertions and deletions) but I think indels may on average have a greater effect on function in coding regions (kind of like saying "you only searched for your keys under the street light because that is where you could see best") and ii) something about "DNA revolution" strikes me as something generally associated with hype (even if there are contexts where that truly is a fair representation of the advancements in technology and medicine).  This 2nd point is kind of like calling the discovery of the double-helix by Watson and Crick as "the most important ever produced in biology" (page 110 in the paperback edition).

I also noticed that the predictive power of the Fabbri et al. 2019 PRS for resistance to depression treatment was not very impressive, but that may relate to epigenetic changes (getting back to the Afterword comment).
 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.