Showing posts with label SNP. Show all posts
Showing posts with label SNP. Show all posts

Friday, December 5, 2014

DNA Convergence and Chicken Little

   For me, the topic of convergence in yDNA first came up early in 2014.  I had just posted a paper and one of the comments was – “What about convergence?”  I said to myself, “What convergence?”  I admit I had to look up the topic.

Convergence: A term used in genetic genealogy to describe the process whereby two different haplotypes mutate over time to become identical or near identical resulting in an accidental or coincidental match. - Turner A & Smolenyak M 2004.

My response back to the comment was - “All of the haplotypes in my paper are unique.”  My data did not exhibit convergence. 
Convergence casts a shadow on genetic genealogy
   I started to poke around on the topic of convergence within yDNA STR haplotypes and the immediate impression that I got was that folks were ready to give up on STRs in favor of SNPs and the sky was falling.  Chicken Little was running around in the genetic genealogy circles.  Here is a small sample:

Y-STRs are effectively dead” - Dienekes Pontikos, 2011

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction” – Wang, et al, 2013

   Okay, convergence happens, but it’s an illusion.

   Let’s take a big step backwards in this story.  Did you know that most scientific papers relating to genetic genealogy use 17 STR markers or less?  Some use as few as 9 or 10.  For any of you who ever took one of the original 12 STR marker tests, you know that the results were essentially useless for anything except deep haplogroup association and history.

   Many researchers in the last couple of years are using the AmpFLSTR® Yfiler® to get their 17 marker results.  This equipment is approved for forensic cases.  Research papers are not forensic cases and researchers don’t need to limit themselves to 17 markers.  Thirty-seven marker yDNA tests have been available since 2004.

   Why does the number of STR markers matter?  I’m going to release my inner math geek to help explain.  If we look at marker DYS19, usually listed first in science papers and third in Family Tree DNA results, it can have a value within the range of 7 to 22 across all haplogroups.  Looking at R1b specifically, DYS19 ranges from 10 to 17 and statistically at two standard deviations (2 sigma) the range of values narrows to 13, 14 and 15.  From a probability point of view, there is a 1 in 3 chance that DYS19 will be 13, 14 or 15.  Making the odds even better in our favor, 95% of the time DYS19 for R1b will already be 13, 14 or 15.  This means there is a 1 in 2 chance that DYS19 could change to another value on its way to converging with another haplotype.

   Taking standard deviation into account to determine the possible number of values for the STR markers and then multiplying each probability gives the odds that a haplotype could converge.
STR
DYS393
DYS390
DYS19
DYS391
DYS385a
DYS385b
DYS426
DYS388
DYS439
DYS389i
DYS392
DYS389ii

Total
# of possible
marker values
2
4
2
2
2
4
1
1
2
2
2
2

4096

   There is a 1 in 4096 chance that two R1b 12 marker haplotypes could converge.  This is not the probability that one marker will change.  This is the probability that all 12 markers will change enough to match another haplotype.  These are very good odds and the reason why a 12-marker test is practically useless. 

   With a high probability that 12 STR markers will converge, haplotypes start to blend together.  Two different haplogroups or family lines will appear to be the same.  Converging also means that when we calculate the time to the most recent common ancestor (TMRCA), it will look like less time has passed.  Convergence makes a 12-marker test result unusable for genealogical matching, haplogroup prediction and TMRCA calculations.  The Chicken Littles are correct, we have a problem with 12 marker STR results.

   What about 17 markers, a quasi-industry standard for science papers?  Taking the same approach with statistics and probability, a 17-marker yDNA R1b result has a 1 in 2 million chance of converging with another haplotype.  Each haplogroup has slightly different odds.  There is a 1 in 500,000 chance of an R1a 17 marker haplotype converging.  Those odds are better than any lottery.  Convergence is still a problem at 17 markers.

   When Dienekes Pontikos proclaimed the death of yDNA STRs, he was commenting on the attempt to get good TMRCA dates from 10-marker results.  I agree, you can’t get valid TMRCA dates from 10-markers.  When Wang, et al, determined that convergence compromises haplogroup prediction, they were correct, 17 marker haplotypes can converge to make one haplogroup look like another.

   In a quick analysis of 4,300 unique 37-marker R1b haplotypes, the average genetic distance is 17 steps for 37 markers.  That means there are 17 mutations required for convergence in a 37-marker haplotype.  Nearly half of the markers in the haplotypes would need to change.  When we look at the probability of 25-marker haplotype convergence, the chances are 1 in 84 million.  Considering there are about 3.6 billion men on the planet, one in 84 million is still in the realm of possibility.  By the time we get to 37-markers, the odds are 1 in 49 trillion.

   There is a 1 in 49 trillion chance that all the necessary mutations will occur in order for two 37-marker haplotypes to converge.  The odds are likely much higher.  I’ve only looked at the probable values for each marker and I haven’t taken into account the STR mutation rates, the possibility that a marker will change over time. 

   There is essentially no such thing as convergence when 37 or more markers are tested and researched.  If you eliminate the possibility of convergence by using 37 STR markers, then immediately TMRCA calculation become more accurate and haplotypes from different haplogroups no longer resemble each other.  The reports of the death of yDNA STR results have been greatly exaggerated.


   I can’t tell you why researchers are currently stuck on 17 markers.  I can tell you that any research using less than 37 markers runs the risk of convergence in their data, which in turn could lead to the wrong conclusions.  I still consider genetic genealogy to be in its infancy.  Every month new research papers are published and the new concepts introduced are latched onto immediately.  It is understandable that papers from over a decade ago used a dozen STRs and a handful of SNPs, that was the height of technology.  If the latest technology and best data are not being used in today’s research papers, is that equivalent to scientific negligence?  Or, am I missing something and this is a case of scientific ignorance on my part?

Tuesday, December 2, 2014

DNA, SNP, STR, OMG!

(Originally published May 2014 in Going In-Depth)

   Oh my gosh, there are many acronyms in genetic genealogy.  You have to agree that using the acronym DNA is better than writing deoxyribonucleic acid repeatedly.  Although, when we talk about using DNA for genealogy and we only use acronyms, they start to lose their meaning and become just another ‘thing’.  “Hey, I’ve got a SNP.  Do you have a SNP?”  “I dunno, let me check.”  Maybe I’m weird.  I like to understand what all the acronyms mean and how they play a part in the larger picture.

   Let’s start with some DNA basics.  We have DNA in every cell except the red blood cells.  Inside the nucleus of our cells, we have 46 chromosomes or 23 pairs (nuclear DNA).  One set of 23 comes from dad and one set comes from mom.  If we took the tightly coiled DNA from one cell and stretched it out it would be about six feet long.  In that six-foot double helix from one cell, there are over 3 billion base pairs.  If you picture our double helix DNA as a twisted ladder, each rung is a base pair made up from four nucleotides (DNA building blocks).  The rungs are made from either an adenine-thymine rung or a cytosine-guanine rung.



   When we talk about DNA, we often also talk about mitochondrial DNA.  Mitochondria exist outside of the nucleus as an energy source for the cell and have their own independent DNA.  Mitochondrial DNA has just over 16,000 base pairs in comparison to the 3 billion base pairs in our nuclear DNA.  We inherit our mitochondrial DNA only from our mothers.

   DNA is divided into coding regions (genes that define proteins for such things as eye color) and non-coding regions (sometimes called junk DNA).  The coding region that defines us is less than 2% of our overall DNA and within that, there are less than 25,000 genes.  A gene is a sequence of nucleotides averaging about 23,000 base pairs.  One of the largest genes, which encodes for the Caspr2 protein, has over 2.3 million base pairs.


   Within the 3 billion base pairs of our DNA there are variations (normally occurring mutations), where one base pair has been replaced with another base pair.  As an example, it was adenine (A) and now its guanine (G).  This is a single nucleotide polymorphism or SNP (pronounced snip).  There are over 15 million SNPs in our DNA.  Once a SNP occurs, it is usually permanent in the population.  The farther back in time that the SNP occurred, the more people will have that particular mutation.  To be considered a SNP, it has to exist in greater than 1% of the population.  They are found in both the coding and non-coding regions of our DNA.  In the coding regions, SNPs are often markers for genes.

   Let’s divide our DNA into four groups.  Group one, the autosomes, are the first 22 pairs of chromosomes.  The next two groups, the sex chromosomes, are one X and one Y if you are male and two Xs if you are female.  That gives us yDNA and xDNA.  The last DNA group is mitochondrial.  All types of DNA have SNPs.  Autosomal SNPs are used for health and ethnicity.  Mitochondrial and Y-DNA SNPs are used to determine world haplogroups.  While there are 1,000s of X SNPs, there doesn’t seem to be much research around them.

   SNPs have no effect on health, but their presence may predict a health risk.  If you had an autosomal test from 23andMe (prior to the FDA ruling), they would have delivered health information with your results.  They were able to report SNPs in the coding region associated with gene combinations responsible for health risks, like cancer or Alzheimer’s or basic information, like eye and hair color.  Even though you cannot get health information from 23andMe currently, you can still use your autosomal results with Promethease from SNPedia.com to research your health risks.
   Combinations of SNPs are analyzed to determine ancestry-informative markers (AIM – another new acronym for you).  AIMs are used to estimate the ethnicity or at least the geographic origins of your ancestors.  When you receive ethnicity results from an autosomal test, it will be based on the AIMs that the test company are using.  They don’t all use the same markers, so results will vary.  There are even 42 SNPs associated with having Neandertal ancestry.
   SNPs are used to organize us into larger branches of the human family tree (haplogroups).  Our maternal family tree is organized into 26 branches (A through Z) using mitochondrial DNA.  Our paternal tree is similarly organized into 20 branches (A through T) using yDNA SNPs.   As an example, take four men (I use men because the scenario works for both mitochondrial DNA and yDNA), Abe, Bob, Chaz and Dave.  Test each of them for three SNPs, X, Y and Z.  You find that they all test positive for SNP Z, Abe and Chaz test positive for X and Bob and Dave test positive for Y.  You can start to see the branches and the beginning of a tree.



   The first yDNA and mtDNA trees were built using only a few dozen SNPs.  Today, the paternal and maternal haplogroup trees are much more detailed, based on thousands of SNPs.  Complete SNP testing has been available for mitochondrial DNA for a number of years.  Starting last year, complete SNP testing is available for yDNA from companies like FamilyTreeDNA with their Big Y test.  Previously yDNA SNP tests were designed to look for specific SNPs.  With advances in technology, they can now look for all the SNPs across over 12 million yDNA base pairs.

   Just to add another acronym to the pile, there are also STRs or short tandem repeats (aka microsatellites).  STRs are short sequences of base pairs that repeat.  These repeats are found in autosomal, y and x DNA.  You may have heard the term CODIS if you watch Crime/Drama shows on television.  CODIS is the FBI’s Combined DNA Index System (more acronyms).  When DNA is collected for CODIS, they typically test for 13 STR markers across the autosomes.  When you have a yDNA STR test done, genetic genealogy companies test for up to 111 markers only on the Y chromosome.  They will also perform a basic SNP test to identify your paternal haplogroup.  SNPs and STRs are different in that SNPs appear to be permanent changes in our DNA and STRs are variable.  STRs are identified by location on the chromosome and by the number of times that the repeat occurs.  The number of repeats per STR can change over time, sometimes increasing, sometimes decreasing in number or increasing then decreasing again (known as a back mutation).  The combined set of STR markers is your haplotype and may be unique to your surname or span multiple surnames.  With the advances in yDNA SNP testing, SNPs will be found that are unique to your surname, which could make STR testing obsolete.

   We all have DNA: 23 chromosomes in our cell nuclei, half from mom and half from dad.  We also have mitochondrial DNA from our moms.  Less than 2% of our DNA is in the form of genes, which define who we are.  SNPs can be used to identify our “good” and “bad” genes.  SNPs can also help identify our ethnicity and build our paternal and maternal family trees.  STRs can organize us down to the paternal surname level.  When folks start talking DNA, don’t be afraid to question them about, “What kind of DNA?”, “What does that SNP indicate?” or “What type of STR is being tested?”.  We’ll never get away from using acronyms to simplify how we communicate genetic genealogy.  That doesn’t mean we need to let the acronyms simplify the meanings to a point where the science is lost.  Every little bit of knowledge adds to our understanding of ourselves.


© Michael Maglio

Wednesday, September 24, 2014

DNA Mysteries: Iberian R1b-V88 in Africa

   When I first heard about R1b in Africa, my immediate assumption was that the predominantly Celtic haplogroup must have been a recent transplant.  I ran some of the V88 haplotypes against the big databases (FTDNA & ySearch) expecting to see matches to European men within the African colonial timeframe.  It wasn’t that easy.  Common ancestor analysis put the R1b Africans (V88) thousands of years removed from the rest of their European R1b cousins.  Where did they come from?  How did they get there?


   I started with the given that the R1b defining mutations (SNPs) occurred in the Iberian Peninsula.  The jury is still out on this hypothesis.  There have been scientific papers for and against Iberian origins of R1b.  My own work (Iberian Origins of R1b) supports an origin prior to the Neolithic expansion.  Could V88 have made a straight-line migration from Iberia to the Lake Chad region of Africa?  Could V88 have crossed the Straits of Gibraltar, travelled across the Sahara, which 7,000 years ago was a savannah well populated with animals for hunting, and arrived at Lake Mega-Chad?  That was my early premise.  I was wrong.

   The distribution of V88 is much larger than any of the scientific papers would indicate.  While I agree with the work that’s been done correlating the spread of V88 with the spread of Chadic languages (Cruciani et al 2010), the Chadic population is only a subset.  Nobody takes into consideration the V88 populations in Europe and the Middle East.  If they do, it is a sideways glance to say were ignoring them because they don’t fit into what we are trying to prove.  If you don’t look at the entire picture, your conclusions will be skewed.

   I wanted the largest selection of V88 Y-DNA records with at least 37 markers tested.  I started with Family Tree DNA projects that had the records SNP tested.  Those haplotypes were run against the ySearch database to identify highly related records with no SNP testing.  The initial gathering of records picked up individuals with SNP M73.  These were removed.  The key differentiator between V88 and M73 was DYS464a&b.  V88 was typically 12,12 and M73 was 15,15.  Thirty-seven or more STR markers are helpful in identifying additional related haplotypes and even more necessary in determining the relationship between records.  Most studies only looks at SNPs or a small handful of STR markers.  This is shortsighted.  Imagine a reference population of 100 records all with the same SNP.  Without enough STR markers you can’t tell whether you are looking at one haplotype with minor 1 or 2 step variations or 100 unique haplotypes.  That’s the difference between a founder event starting with as few as one individual or a group with greater diversity and age.

   My final set of 119 records has at least 37 STR markers, V88 SNP testing or is highly related via STR and has the geographic location of the most distant known ancestor.  The records are processed through PHYLIP to generate a phylogenetic tree.  The phylogenetic tree give a visual depiction of the relationships in the dataset and an approximate number of years back to common ancestors, represented as the nodes between the records.


All of this is very standard genetic genealogy.  I add a twist (Biogeographical Multilateration) by converting the years back to a common ancestor to a distance using Cavalli-Sforza’s migration rate of 1 to 1.2 km per year.  This is enough for me to solve a series of cascading equations giving me the locations of the common ancestors.  Looking back at the phylogenetic tree shows us how all the nodes and locations are connected, essentially the flow of migration.


   The out of Iberia event took place about 7,700 ± 1,600 years ago.  TMRCA calculations have been shown to be very inconsistent.  Some folks use a constant mutation rate and some use rates per marker.  I include a TMRCA to give a relative chronology.  While the majority of R1b is known for its Western Atlantic migrations, V88 took a path along the Mediterranean coast and down the Adriatic.  While none of the V88 records indicated Crete as an ancestral location, it appears multiple times as a common ancestor location.  The data shows Crete as a stepping-stone in the Mediterranean as V88 migrated to the Nile River Valley.  The back to Africa event(s) occurred roughly 5,500 ± 1,000 years ago.


The majority of the Chadic records (Cameroon, Chad and Nigeria) have relatively close genetic connections to individuals in the Middle East (mainly Saudi Arabia).  The Chadic and Middle Eastern records tie back to common ancestors along the upper Nile.  There is a significant lack of information to understand what impact R1b-V88 had on the Nile Valley cultures.  Considering that there was only 1 out of 119 records with an exact Nile River location, I would venture a guess that V88 didn’t integrate well.

   While the V88 back to Africa migration has captured much attention, the data shows a more fascinating event.  There was a V88 re-migration back to Europe from Africa.   The back to Europe event took place about 3,200 ± 1,000 years ago.  Again, Crete played a role as a stepping-stone as V88 entered the Eastern Adriatic region and spread into Central and Eastern Europe.  Someone will probably notice that many of the V88 in Eastern Europe are Jewish and that the date for leaving the Nile region is close to the time of Exodus.  There is nothing in any of the data to indicate that this was the Jewish Exodus from Egypt.  The V88 group in Eastern Europe is closely related and there is phylogenetic evidence to support that this may have been a founder event with a single male or small group of closely related males.  There is no evidence to support that those founders were Jewish when they left Africa.


   By looking at the big picture, including all the data and letting the data illustrate the patterns, we can unravel what appears to be the mysterious appearance of R1b in Central Africa.  Along the way, we can uncover a previously unknown re-migration from Africa to Europe.  Too often haplogroup data is treated as discrete buckets of information living in a vacuum with no interaction to other haplogroups and no internal relationships.  Every DNA record is connected to every other record in a network.  Each haplotype is a vector with location and direction.  The sooner we treat genetic records as a network analysis, the sooner we will solve more DNA mysteries.

Out of Iberia and back to Africa.  Followed by a return to Europe.

Reference:

Maglio, MR (2014)  Y Chromosome Haplogroup R1b-V88: Biogeographical Evidence for an Iberian Origin (Link)


Tuesday, April 29, 2014

Exploring Rollo's Roots: DNA Leads the Way


   It’s been nearly a year since I wrote about William the Conqueror’s DNA.  Based on a study of men with surnames historically associated with William and their corresponding Y-DNA, I concluded that I identified the genetic signature of the first Norman King of England.  Now it’s time to get back to William and more specifically his 3rd great grandfather, Rollo.  To be honest, the 37 marker Y-DNA haplotype that I published is really connected to Richard the Fearless, William’s great grandfather.  Genealogically, the surnames in the study trace back to Richard.  As long as there was no hanky-panky, William the Conqueror has the same Y-DNA as Richard.  What that also means is that Richard has the same Y-DNA as his grandfather, Rollo.

   Based on the work done in my previous paper, the following haplotype is that of William the Conqueror (and Richard the Fearless)-


DYS393
DYS390
DYS19
DYS391
DYS385a
DYS385b
DYS426
DYS388
DYS439
DYS389i
DYS392
DYS389ii
13
24
14
11
11
14
12
12
12
13
13
29

DYS458
DYS459a
DYS459b
DYS455
DYS454
DYS447
DYS437
DYS448
DYS449
DYS464a
DYS464b
DYS464c
DYS464d
17
9
10
11
11
25
15
19
29
15
15
17
17

DYS460
Y-GATA-H4
YCAIIa
YCAIIb
DYS456
DYS607
DYS576
DYS570
CDYa
CDYb
DYS442
DYS438
11
11
19
23
15
15
17
17
36
37
12
12


   There is an assumption, inherent in genetic genealogy, that there weren’t any non-paternal events between the generations that separate Rollo and William and that this haplotype is that of Rollo as well.  One of the goals for this Rollo study is to get more accurate with his haplotype by narrowing the dataset to only those records with 67 markers.  The second goal is to determine Rollo’s haplogroup R SNP.  The best I was able to determine for William was R-P312, which is a fairly high level SNP.  My third goal is to determine Rollo’s origin using my TribeMapper analysis.  Whether Rollo is Danish or Norwegian has been disputed for hundreds of years.

   I picked up where I left off with William.  There were 152 Y-DNA records that made it into the William the Conqueror Modal Haplotype (WCMH).  For each of these records a 67 marker test result and SNP testing result were added to the analysis, where the data was available.  I threw out any record that didn’t have enough data and retained the ones that grouped into a single SNP of R-DF13 (just downstream of R-L21).  Based on these final 25 records, I have identified the 67 marker Rollo Norman Modal Haplotype (RNMH) as follows:

DYS393
DYS390
DYS19
DYS391
DYS385a
DYS385b
DYS426
DYS388
DYS439
DYS389i
DYS392
DYS389ii
13
24
14
11
11
14
12
12
12
13
13
29

DYS458
DYS459a
DYS459b
DYS455
DYS454
DYS447
DYS437
DYS448
DYS449
DYS464a
DYS464b
DYS464c
DYS464d
17
9
10
11
11
25
15
19
29
15
15
17
17

DYS460
Y-GATA-H4
YCAIIa
YCAIIb
DYS456
DYS607
DYS576
DYS570
CDYa
CDYb
DYS442
DYS438
11
11
19
23
15
15
17
17
36
37
12
12

DYS531
DYS578
DYF395S1a
DYF395S1b
DYS590
DYS537
DYS641
DYS472
DYF406S1
DYS511
DYS425
DYS413a
DYS413b
11
9
15
16
8
10
10
8
10
10
12
23
23

DYS557
DYS594
DYS436
DYS490
DYS534
DYS450
DYS444
DYS481
DYS520
DYS446
DYS617
DYS568
16
10
12
12
16
8
12
22
20
13
12
11

DYS487
DYS572
DYS640
DYS492
DYS565
13
11
11
12
12

Based on this modal haplotype and the associated SNP, a broader collection of genetic cousin records were identified to be used with my new TribeMapper analysis (Biogeographical Multilateration).




   This map shows the geographic distribution of Rollo’s cousins.  The large number of points along the coast of Normandy is a good sign.  If the majority of points were in Eastern Europe, I would have to revisit my whole hypothesis about William the Conqueror.  It is best not to try to interpret any relationships until we look at them through the lens of a phylogenetic tree.



   The TribeMapper analysis takes into consideration the mapped location, the tree node connections and the time between common ancestors.  The time is converted to distance based on the demic diffusion migration rate.  The distance is plotted to ‘triangulate’ the geographic location of each common ancestor.  This is a process called multilateration.

   The earliest documented origins for Rollo come from Dudo of Saint-Quentin in 1015 and William of Jumièges in 1060.  Both ‘histories’ were commissioned by the House of Normandy and attribute a Danish origin to Rollo.  Commissioned biographies can border on mythology.   The Norwegian Orkneyinga Saga, from the 13th century, gives Rollo a Norwegian origin. 

   I’ve run the analysis with Rollo’s record as an unknown location.  TribeMapper allows us to back into the location for any unknown point.  What we get is a highly constrained location for Rollo’s ancestor, in the middle of Denmark.  The data then shows that Rollo may have lived within 226 km of that paternal ancestor.  The red circle illustrates the range for Rollo.  This covers the majority of Denmark.  The data also shows that Rollo’s ancestors, going back at least 12 generations were also in Denmark.



   We can give the Norwegians some credit also.  The ancestors of Rollo’s ancestors were Nowegian, with an origin on the west coast of Norway.  Rollo’s ancestors were responsible for multiple branches of migration into Europe.  This includes a back migration into Norway that then went on to invade Scotland.



   This was accomplished with small sample of 65 records for simplification.  Much larger data sets could determine the genetic flow in a greater geographic and chronologic view.  Additional records within the same SNP grouping could result in a more accurate origin for Rollo.  Records that are genetically upstream from the SNP and STR group used, may identify the nomadic migrations prior to the Western Norway settlement.


   I’ve run this simulation multiple times, getting the same results.  I’m comfortable calling Rollo – “The Dane”.

Reference:

Maglio, MR (2014) Biogeographical Origins and Y-chromosome Signature for the House of Normandy  (Link)