Monday, December 8, 2014

Atrocities and Assimilation: Crusader DNA in the Near East

   This paper got its start back in February of this year while I was researching R1b-DF100 for my posting, The Third Brother.  Among the data, primarily Western European haplotypes, was a single Armenian record.  The R1b-L11>DF100 group that I was working with had as one of their theories that L11 was a fairly recent, 3,000 to 4,000 years, arrival from the Near East and that the Armenian record was part of that evidence.  I looked at the Armenian record, ran a phylogenetic test on it, the L11 group and some similar Near East records.  The Armenian record fell squarely within a Baltic cluster on the tree with a rough TMRCA of about 1,200 years.  This Armenian was clearly more European than Armenian, at least on the paternal line.  My comment back to the L11 group was that their Armenian was probably the descendant of a Crusader based on the timing and directionality.

   In September, I ran across Pierre Zalloua’s paper - Y-ChromosomalDiversity in Lebanon Is Structured by Recent Historical Events (2008).  He and the other authors had put together a good correlation between Crusader DNA and haplogroup R1b in Lebanon.  The paper also correlated haplogroup J and the Muslim expansion.  The paper received quite a bit of feedback about haplogroup J and little or no mention about haplogroup R1b.  Considering the extent of the Crusader’s presence in the Near East from 1096 to 1343, if they left DNA behind it would have been spread farther than Lebanon. 

   The real question is not – if they left DNA behind.  There is significant literature that details the atrocities; raping and pillaging was standard operating procedure for the Crusaders.  There are also numerous accounts of assimilation.  During the Crusader’s 247-year occupation and roughly eight generations, they married local women and raised families.  The real question is did Crusader DNA survive to modern time. 

Crusader DNA Distribution
   If Crusader DNA survived, it would be spread from Istanbul to Jerusalem and beyond.  The graphic above shows the potential for DNA distribution during the Crusader occupation (red) and the distribution over the past 918 years (gray).  My research focused on the following Near East countries - Armenia, Georgia, Iran, Iraq, Israel, Jordan, Lebanon, Palestine, Saudi Arabia, Syria and Turkey.

   Here is something I found bizarre.  Zalloua and team published their paper in 2008.  Every researcher looking at Near East R1b should be taking a lesson and validating that their data is not of Crusader origin.  Obviously, Crusader DNA wasn’t restricted to Lebanon.  In 2010, Balaresque, et al and again in 2011, Myres, et al, published papers using Near East R1b data (Turkish).  Forty-two percent of the Turkish R1b haplotypes from Balaresque and Myres was identical to Zalloua’s Lebanese R1b data.  This didn’t seem to raise any flags as Balaresque and Myres used the Turkish data to suggest a Near East origin and Neolithic expansion for R1b.  These folks must not talk to each other.  Two of Zalloua’s team members went on to work with Balaresque and Myres on their papers.  The first thing I would have said was – “Considering what Zalloua found, we need to validate the origins of the Turkish data further back than one or two generations”.

   When presenting an analysis it is always good to show comparison data.  I collected R1b data and haplogroup G and J data from multiple Family Tree DNA projects.   I have a higher comfort factor that G and J are associated with the Neolithic expansion, so they were used as a basis for comparison.  For each 37-marker Near East record obtained, I used the haplotype to query a larger set of related records from ySearch (I call this haplotype aggregation).  A Near East set and a Western European set of data was developed for each haplogroup.  I then compared each individual Near East haplotype against the entire Near East set and the entire Western Europe set.  You would expect that the Near East haplotypes would be more closely related to their peers in the Near East set.

   The haplogroup J data tells the best story.  The results cluster down J1-M267 and J2-M172 lines.  The neutral line (diagonal triangles) represents zero affinity towards the Near East or Western Europe.  Points falling to the right of neutral show an affinity toward the Near East and to the left of neutral, an affinity towards Western Europe.


   J1 haplotypes (diamonds), which are rare in Europe, are closely related to their peers in the Near East.  The J1 data only shows an affinity toward the Near East.  The trend line for J1 indicates a fairly stationary population pattern with no suggestion of migration to Western Europe.  A trend line that doesn’t cross the neutral represents a strong peer affinity and little or no migration between the Near East and Western Europe.  J2 data (squares) shows a tipping point at which the more distantly related records lean toward the Near East and the closely related records lean toward Western Europe.  That transition shows a TMRCA of about 3,900 ± 800 years.  The tipping point indicates a point in time where the Near East J2 haplotypes became more common in Western Europe, illustrating a migration. 


   Haplogroup G shows very similar results as J2. Haplogroups J2 and G have been associated with the Neolithic spread of agriculture from the Near East to Western Europe.  Both J2 and G present a consistent distribution from distant relationship (high variance) to closer relationship (low variance).  The trend lines for J2 and G represent migration events from the Near East to Western Europe.  The trend line for J1 represents no migration event.  These results are consistent with other published information.

   Haplogroup R1b does not exhibit either a migration or a non-migration pattern.  The haplotypes cluster in a fairly homogenous group.  There is a slight lean toward Western Europe and essentially no continuum from high variance to low variance.  The more distantly related haplotypes don’t exist in the Near East.  The Near East individuals are just as related to the Western European individuals as they are to their own peers.  The approximate TMRCA for the R1b Near East – Western European group is 1,800 ± 500 years.


   Through atrocities and assimilation, Western European DNA from Crusaders was permanently introduced into the Near East less than 1,000 years ago.  Western European and Near East R1b haplotypes are highly and recently related.  The data indicates that within the last 2,000 years there was a migration from one geography to the other.  There is no documented migration in the past 2,000 years that would account for Western European R1b populations coming from the Near East and replacing indigenous European populations.  The introduction of Western European DNA into the Near East by Crusaders accounts for the west to east genetic flow.

   The sampling practices of research studies are questionable.  The origin of participants is typically only validated for one or two previous generations.  This is equivalent to not knowing the origin for study participants.  Sampling needs to be undertaken with a genetic genealogy approach and 37 markers or greater.  The population genetics approach of less than 17 markers, poor origin validation and haplogroup generalization needs to change.

   Previous papers (Balaresque & Myres) that have used Near East R1b data as the basis of their research are suspect.  In light of the introduction of Crusader DNA into the Near East within the past 1,000 years, any theory on a Neolithic origin for haplogroup R1b will have to be re-evaluated.

Reference:

Maglio, MR (2014) Y-Chromosomal Haplogroup R1b Diversity in Near East is Structured by Recent Historical Events (Link)


© Michael R. Maglio

Friday, December 5, 2014

DNA Convergence and Chicken Little

   For me, the topic of convergence in yDNA first came up early in 2014.  I had just posted a paper and one of the comments was – “What about convergence?”  I said to myself, “What convergence?”  I admit I had to look up the topic.

Convergence: A term used in genetic genealogy to describe the process whereby two different haplotypes mutate over time to become identical or near identical resulting in an accidental or coincidental match. - Turner A & Smolenyak M 2004.

My response back to the comment was - “All of the haplotypes in my paper are unique.”  My data did not exhibit convergence. 
Convergence casts a shadow on genetic genealogy
   I started to poke around on the topic of convergence within yDNA STR haplotypes and the immediate impression that I got was that folks were ready to give up on STRs in favor of SNPs and the sky was falling.  Chicken Little was running around in the genetic genealogy circles.  Here is a small sample:

Y-STRs are effectively dead” - Dienekes Pontikos, 2011

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction” – Wang, et al, 2013

   Okay, convergence happens, but it’s an illusion.

   Let’s take a big step backwards in this story.  Did you know that most scientific papers relating to genetic genealogy use 17 STR markers or less?  Some use as few as 9 or 10.  For any of you who ever took one of the original 12 STR marker tests, you know that the results were essentially useless for anything except deep haplogroup association and history.

   Many researchers in the last couple of years are using the AmpFLSTR® Yfiler® to get their 17 marker results.  This equipment is approved for forensic cases.  Research papers are not forensic cases and researchers don’t need to limit themselves to 17 markers.  Thirty-seven marker yDNA tests have been available since 2004.

   Why does the number of STR markers matter?  I’m going to release my inner math geek to help explain.  If we look at marker DYS19, usually listed first in science papers and third in Family Tree DNA results, it can have a value within the range of 7 to 22 across all haplogroups.  Looking at R1b specifically, DYS19 ranges from 10 to 17 and statistically at two standard deviations (2 sigma) the range of values narrows to 13, 14 and 15.  From a probability point of view, there is a 1 in 3 chance that DYS19 will be 13, 14 or 15.  Making the odds even better in our favor, 95% of the time DYS19 for R1b will already be 13, 14 or 15.  This means there is a 1 in 2 chance that DYS19 could change to another value on its way to converging with another haplotype.

   Taking standard deviation into account to determine the possible number of values for the STR markers and then multiplying each probability gives the odds that a haplotype could converge.
STR
DYS393
DYS390
DYS19
DYS391
DYS385a
DYS385b
DYS426
DYS388
DYS439
DYS389i
DYS392
DYS389ii

Total
# of possible
marker values
2
4
2
2
2
4
1
1
2
2
2
2

4096

   There is a 1 in 4096 chance that two R1b 12 marker haplotypes could converge.  This is not the probability that one marker will change.  This is the probability that all 12 markers will change enough to match another haplotype.  These are very good odds and the reason why a 12-marker test is practically useless. 

   With a high probability that 12 STR markers will converge, haplotypes start to blend together.  Two different haplogroups or family lines will appear to be the same.  Converging also means that when we calculate the time to the most recent common ancestor (TMRCA), it will look like less time has passed.  Convergence makes a 12-marker test result unusable for genealogical matching, haplogroup prediction and TMRCA calculations.  The Chicken Littles are correct, we have a problem with 12 marker STR results.

   What about 17 markers, a quasi-industry standard for science papers?  Taking the same approach with statistics and probability, a 17-marker yDNA R1b result has a 1 in 2 million chance of converging with another haplotype.  Each haplogroup has slightly different odds.  There is a 1 in 500,000 chance of an R1a 17 marker haplotype converging.  Those odds are better than any lottery.  Convergence is still a problem at 17 markers.

   When Dienekes Pontikos proclaimed the death of yDNA STRs, he was commenting on the attempt to get good TMRCA dates from 10-marker results.  I agree, you can’t get valid TMRCA dates from 10-markers.  When Wang, et al, determined that convergence compromises haplogroup prediction, they were correct, 17 marker haplotypes can converge to make one haplogroup look like another.

   In a quick analysis of 4,300 unique 37-marker R1b haplotypes, the average genetic distance is 17 steps for 37 markers.  That means there are 17 mutations required for convergence in a 37-marker haplotype.  Nearly half of the markers in the haplotypes would need to change.  When we look at the probability of 25-marker haplotype convergence, the chances are 1 in 84 million.  Considering there are about 3.6 billion men on the planet, one in 84 million is still in the realm of possibility.  By the time we get to 37-markers, the odds are 1 in 49 trillion.

   There is a 1 in 49 trillion chance that all the necessary mutations will occur in order for two 37-marker haplotypes to converge.  The odds are likely much higher.  I’ve only looked at the probable values for each marker and I haven’t taken into account the STR mutation rates, the possibility that a marker will change over time. 

   There is essentially no such thing as convergence when 37 or more markers are tested and researched.  If you eliminate the possibility of convergence by using 37 STR markers, then immediately TMRCA calculation become more accurate and haplotypes from different haplogroups no longer resemble each other.  The reports of the death of yDNA STR results have been greatly exaggerated.


   I can’t tell you why researchers are currently stuck on 17 markers.  I can tell you that any research using less than 37 markers runs the risk of convergence in their data, which in turn could lead to the wrong conclusions.  I still consider genetic genealogy to be in its infancy.  Every month new research papers are published and the new concepts introduced are latched onto immediately.  It is understandable that papers from over a decade ago used a dozen STRs and a handful of SNPs, that was the height of technology.  If the latest technology and best data are not being used in today’s research papers, is that equivalent to scientific negligence?  Or, am I missing something and this is a case of scientific ignorance on my part?

Tuesday, December 2, 2014

DNA, SNP, STR, OMG!

(Originally published May 2014 in Going In-Depth)

   Oh my gosh, there are many acronyms in genetic genealogy.  You have to agree that using the acronym DNA is better than writing deoxyribonucleic acid repeatedly.  Although, when we talk about using DNA for genealogy and we only use acronyms, they start to lose their meaning and become just another ‘thing’.  “Hey, I’ve got a SNP.  Do you have a SNP?”  “I dunno, let me check.”  Maybe I’m weird.  I like to understand what all the acronyms mean and how they play a part in the larger picture.

   Let’s start with some DNA basics.  We have DNA in every cell except the red blood cells.  Inside the nucleus of our cells, we have 46 chromosomes or 23 pairs (nuclear DNA).  One set of 23 comes from dad and one set comes from mom.  If we took the tightly coiled DNA from one cell and stretched it out it would be about six feet long.  In that six-foot double helix from one cell, there are over 3 billion base pairs.  If you picture our double helix DNA as a twisted ladder, each rung is a base pair made up from four nucleotides (DNA building blocks).  The rungs are made from either an adenine-thymine rung or a cytosine-guanine rung.



   When we talk about DNA, we often also talk about mitochondrial DNA.  Mitochondria exist outside of the nucleus as an energy source for the cell and have their own independent DNA.  Mitochondrial DNA has just over 16,000 base pairs in comparison to the 3 billion base pairs in our nuclear DNA.  We inherit our mitochondrial DNA only from our mothers.

   DNA is divided into coding regions (genes that define proteins for such things as eye color) and non-coding regions (sometimes called junk DNA).  The coding region that defines us is less than 2% of our overall DNA and within that, there are less than 25,000 genes.  A gene is a sequence of nucleotides averaging about 23,000 base pairs.  One of the largest genes, which encodes for the Caspr2 protein, has over 2.3 million base pairs.


   Within the 3 billion base pairs of our DNA there are variations (normally occurring mutations), where one base pair has been replaced with another base pair.  As an example, it was adenine (A) and now its guanine (G).  This is a single nucleotide polymorphism or SNP (pronounced snip).  There are over 15 million SNPs in our DNA.  Once a SNP occurs, it is usually permanent in the population.  The farther back in time that the SNP occurred, the more people will have that particular mutation.  To be considered a SNP, it has to exist in greater than 1% of the population.  They are found in both the coding and non-coding regions of our DNA.  In the coding regions, SNPs are often markers for genes.

   Let’s divide our DNA into four groups.  Group one, the autosomes, are the first 22 pairs of chromosomes.  The next two groups, the sex chromosomes, are one X and one Y if you are male and two Xs if you are female.  That gives us yDNA and xDNA.  The last DNA group is mitochondrial.  All types of DNA have SNPs.  Autosomal SNPs are used for health and ethnicity.  Mitochondrial and Y-DNA SNPs are used to determine world haplogroups.  While there are 1,000s of X SNPs, there doesn’t seem to be much research around them.

   SNPs have no effect on health, but their presence may predict a health risk.  If you had an autosomal test from 23andMe (prior to the FDA ruling), they would have delivered health information with your results.  They were able to report SNPs in the coding region associated with gene combinations responsible for health risks, like cancer or Alzheimer’s or basic information, like eye and hair color.  Even though you cannot get health information from 23andMe currently, you can still use your autosomal results with Promethease from SNPedia.com to research your health risks.
   Combinations of SNPs are analyzed to determine ancestry-informative markers (AIM – another new acronym for you).  AIMs are used to estimate the ethnicity or at least the geographic origins of your ancestors.  When you receive ethnicity results from an autosomal test, it will be based on the AIMs that the test company are using.  They don’t all use the same markers, so results will vary.  There are even 42 SNPs associated with having Neandertal ancestry.
   SNPs are used to organize us into larger branches of the human family tree (haplogroups).  Our maternal family tree is organized into 26 branches (A through Z) using mitochondrial DNA.  Our paternal tree is similarly organized into 20 branches (A through T) using yDNA SNPs.   As an example, take four men (I use men because the scenario works for both mitochondrial DNA and yDNA), Abe, Bob, Chaz and Dave.  Test each of them for three SNPs, X, Y and Z.  You find that they all test positive for SNP Z, Abe and Chaz test positive for X and Bob and Dave test positive for Y.  You can start to see the branches and the beginning of a tree.



   The first yDNA and mtDNA trees were built using only a few dozen SNPs.  Today, the paternal and maternal haplogroup trees are much more detailed, based on thousands of SNPs.  Complete SNP testing has been available for mitochondrial DNA for a number of years.  Starting last year, complete SNP testing is available for yDNA from companies like FamilyTreeDNA with their Big Y test.  Previously yDNA SNP tests were designed to look for specific SNPs.  With advances in technology, they can now look for all the SNPs across over 12 million yDNA base pairs.

   Just to add another acronym to the pile, there are also STRs or short tandem repeats (aka microsatellites).  STRs are short sequences of base pairs that repeat.  These repeats are found in autosomal, y and x DNA.  You may have heard the term CODIS if you watch Crime/Drama shows on television.  CODIS is the FBI’s Combined DNA Index System (more acronyms).  When DNA is collected for CODIS, they typically test for 13 STR markers across the autosomes.  When you have a yDNA STR test done, genetic genealogy companies test for up to 111 markers only on the Y chromosome.  They will also perform a basic SNP test to identify your paternal haplogroup.  SNPs and STRs are different in that SNPs appear to be permanent changes in our DNA and STRs are variable.  STRs are identified by location on the chromosome and by the number of times that the repeat occurs.  The number of repeats per STR can change over time, sometimes increasing, sometimes decreasing in number or increasing then decreasing again (known as a back mutation).  The combined set of STR markers is your haplotype and may be unique to your surname or span multiple surnames.  With the advances in yDNA SNP testing, SNPs will be found that are unique to your surname, which could make STR testing obsolete.

   We all have DNA: 23 chromosomes in our cell nuclei, half from mom and half from dad.  We also have mitochondrial DNA from our moms.  Less than 2% of our DNA is in the form of genes, which define who we are.  SNPs can be used to identify our “good” and “bad” genes.  SNPs can also help identify our ethnicity and build our paternal and maternal family trees.  STRs can organize us down to the paternal surname level.  When folks start talking DNA, don’t be afraid to question them about, “What kind of DNA?”, “What does that SNP indicate?” or “What type of STR is being tested?”.  We’ll never get away from using acronyms to simplify how we communicate genetic genealogy.  That doesn’t mean we need to let the acronyms simplify the meanings to a point where the science is lost.  Every little bit of knowledge adds to our understanding of ourselves.


© Michael Maglio