For me, the topic of convergence in yDNA first came up early
in 2014. I had just posted a paper and
one of the comments was – “What about convergence?” I said to myself, “What convergence?” I admit I had to look up the topic.
Convergence: A
term used in genetic genealogy to describe the process whereby two different
haplotypes mutate over time to become identical or near identical resulting in
an accidental or coincidental match. - Turner A & Smolenyak M 2004.
My response back to the comment was - “All of the haplotypes
in my paper are unique.” My data did not
exhibit convergence.
Convergence casts a shadow on genetic genealogy |
I started to poke around on the topic of convergence within
yDNA STR haplotypes and the immediate impression that I got was that folks were
ready to give up on STRs in favor of SNPs and the sky was falling. Chicken Little was running around in the
genetic genealogy circles. Here is a
small sample:
“Y-STRs are
effectively dead” - Dienekes Pontikos, 2011
“Convergence of Y
chromosome STR haplotypes from different SNP haplogroups compromises accuracy
of haplogroup prediction” – Wang, et al, 2013
Okay, convergence happens, but it’s an illusion.
Let’s take a big step backwards in this story. Did you know that most scientific papers
relating to genetic genealogy use 17 STR markers or less? Some use as few as 9 or 10. For any of you who ever took one of the
original 12 STR marker tests, you know that the results were essentially
useless for anything except deep haplogroup association and history.
Many researchers in the last couple of years are using the AmpFLSTR®
Yfiler® to get their 17 marker results.
This equipment is approved for forensic cases. Research papers are not forensic cases and
researchers don’t need to limit themselves to 17 markers. Thirty-seven marker yDNA tests have been
available since 2004.
Why does the number of STR markers matter? I’m going to release my inner math geek to
help explain. If we look at marker DYS19,
usually listed first in science papers and third in Family Tree DNA results, it
can have a value within the range of 7 to 22 across all haplogroups. Looking at R1b specifically, DYS19 ranges
from 10 to 17 and statistically at two standard deviations (2 sigma) the range
of values narrows to 13, 14 and 15. From
a probability point of view, there is a 1 in 3 chance that DYS19 will be 13, 14
or 15. Making the odds even better in
our favor, 95% of the time DYS19 for R1b will already be 13, 14 or 15. This means there is a 1 in 2 chance that
DYS19 could change to another value on its way to converging with another
haplotype.
Taking standard deviation into account to determine the
possible number of values for the STR markers and then multiplying each
probability gives the odds that a haplotype could converge.
STR
|
DYS393
|
DYS390
|
DYS19
|
DYS391
|
DYS385a
|
DYS385b
|
DYS426
|
DYS388
|
DYS439
|
DYS389i
|
DYS392
|
DYS389ii
|
Total
|
# of possible
marker values
|
2
|
4
|
2
|
2
|
2
|
4
|
1
|
1
|
2
|
2
|
2
|
2
|
4096
|
There is a 1 in 4096 chance that two R1b 12 marker
haplotypes could converge. This is not
the probability that one marker will change. This is the probability that all 12 markers
will change enough to match another haplotype.
These are very good odds and the reason why a 12-marker test is
practically useless.
With a high probability that 12 STR markers will converge, haplotypes
start to blend together. Two different
haplogroups or family lines will appear to be the same. Converging also means that when we calculate
the time to the most recent common ancestor (TMRCA), it will look like less
time has passed. Convergence makes a 12-marker
test result unusable for genealogical matching, haplogroup prediction and TMRCA
calculations. The Chicken Littles are
correct, we have a problem with 12 marker STR results.
What about 17 markers, a quasi-industry standard for science
papers? Taking the same approach with statistics
and probability, a 17-marker yDNA R1b result has a 1 in 2 million chance of
converging with another haplotype. Each
haplogroup has slightly different odds.
There is a 1 in 500,000 chance of an R1a 17 marker haplotype converging. Those odds are better than any lottery. Convergence is still a problem at 17 markers.
When Dienekes Pontikos proclaimed the death of yDNA STRs, he
was commenting on the attempt to get good TMRCA dates from 10-marker
results. I agree, you can’t get valid
TMRCA dates from 10-markers. When Wang,
et al, determined that convergence compromises haplogroup prediction, they were
correct, 17 marker haplotypes can converge to make one haplogroup look like
another.
In a quick analysis of 4,300 unique 37-marker R1b haplotypes,
the average genetic distance is 17 steps for 37 markers. That means there are 17 mutations required
for convergence in a 37-marker haplotype.
Nearly half of the markers in the haplotypes would need to change. When we look at the probability of 25-marker
haplotype convergence, the chances are 1 in 84 million. Considering there are about 3.6 billion men
on the planet, one in 84 million is still in the realm of possibility. By the time we get to 37-markers, the odds
are 1 in 49 trillion.
There is a 1 in 49 trillion
chance that all the necessary mutations will occur in order for two 37-marker
haplotypes to converge. The odds are likely
much higher. I’ve only looked at the probable
values for each marker and I haven’t taken into account the STR mutation rates,
the possibility that a marker will change over time.
There is essentially no such thing as convergence when 37 or
more markers are tested and researched.
If you eliminate the possibility of convergence by using 37 STR markers,
then immediately TMRCA calculation become more accurate and haplotypes from
different haplogroups no longer resemble each other. The reports of the death of yDNA STR results have
been greatly exaggerated.
I can’t tell you why researchers are currently stuck on 17
markers. I can tell you that any research
using less than 37 markers runs the risk of convergence in their data, which in
turn could lead to the wrong conclusions.
I still consider genetic genealogy to be in its infancy. Every month new research papers are published
and the new concepts introduced are latched onto immediately. It is understandable that papers from over a
decade ago used a dozen STRs and a handful of SNPs, that was the height of
technology. If the latest technology and
best data are not being used in today’s research papers, is that equivalent to
scientific negligence? Or, am I missing
something and this is a case of scientific ignorance on my part?
I agree with your point that convergence is reduced with more STRs tested, but you're only talking here about high-level convergence between, for instance, R1b and other parallel haplogroups. With matches based on genetic distance convergence is always a non-trivial risk even out to 111 markers because the right combination of even a few allele values can make a person show a closer genetic distance to people who don't share his terminal SNP than to people that do. Convergence is reduced by more STRs, but it's not eliminated. That doesn't mean Y-STRs are dead, but verification through SNPs is critical.
ReplyDeleteHi Dave,
DeleteOver the years of reading papers, talking to folks and better SNP testing, I've noticed a trend discounting the value of STRs. The best way to stop a trend is to identify and write about it. I like a complete approach - STRs can be ambiguous without SNPs and SNPs are not the end all be all. As I mention below, my goal was to point out the problems with using less than 37 markers in research.
Thanks,
Mike
Convergence is a very real problem at 37 markers and in some cases at 67 markers. It seems to be a particular problem within R1b where people seemingly match at 37 markers but SNP testing indicates that they are in different subclades of R1b. I've got a number of project members who have R1b matches in different subclades (eg, P312 and U106). There was also a recent study by Larumuseau et al which discussed the issue: http://onlinelibrary.wiley.com/doi/10.1111/ahg.12050/abstract They used 38-marker haplotypes. There are some resources on this ISOGG Wiki page: http://www.isogg.org/wiki/Convergence The prevalence of convergence is not known, largely because the majority of people in the FTDNA database have not had SNP testing done to determine which subclade they belong to.
ReplyDeleteHi Debbie,
DeleteThe goal for my post was to point out the problems with using less than 37 markers in research. I was only able to read the abstract for Larumuseau - I was pleased to see that they used 38 markers. With R1b it is difficult to separate convergence from lack of divergence. I have not seen any exact matches at 37 or higher with different SNPs during my research.
Thanks,
Mike
This comment has been removed by the author.
ReplyDeleteMy guess the reliance of 17 markers may be based on one of two reasons: 1). the technology limitations at in-house labs or 2) the cost per unit for the tests.
ReplyDeleteThis is something I have been seeking for some time. Thank you so much for posting it as I was becoming somewhat frustrated at the number of nay-sayers who, in spite of being R1b, could not understand that if I was not employed by FTDNA, WHY I was requesting even a small upgrade to Y-67, but preferably Y-111. No matter what I said or what I explained, they insisted that Y-37 was all that was needed, if that!!!
ReplyDeleteJim, The reason why the academic studies have only used 17 markers is that they are looking for answers to different questions to us. They are looking at differences within and between populations rather than trying to answer genealogical questions. At the population level data from more people gives you more information than high-resolution data from a smaller number of people. However, the academic studies also do SNP testing though never in enough detail as we might like. However, in a few years this is no longer likely to be an issue as whole genome sequencing becomes the norm.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteFiz o Y dna 37 pelo FTDNA, o resultado foi J1 M267, pelos fósforos todos são do Oriente Médio Árabe. Distancia genética 1-4. o que deixou-me intrigado foi que no DYS 19 todos são 14 para cima, eu sou 13 e no Y GATA H4 a maioria são 11 e eu sou 10. Alguém pode me explicar isso?
ReplyDeleteMarcador DYS393 DYS390 DYS19 ** DYS391 DYS385 DYS426 DYS388 DYS439 DYS389I DYS392 DYS389II ***
ReplyDeleteValor 12 23 13 11 13-19 11 17 11 13 11 30
PAINEL 2 (13-25)
Marcador DYS458 DYS459 DYS455 DYS454 DYS447 DYS437 DYS448 DYS449 DYS464
Valor 19 8-9 11 11 26 14 20 25 12-14-16-17
PAINEL 3 (26-37)
Marcador DYS460 Y-GATA-H4 YCAII DYS456 DYS607 DYS576 DYS570 CDY DYS442 DYS438
Valor 10 10 22 a 22 14 14 18 18 32-36 11 10
Alguém pode avaliar esses valores para mim?