Monday, February 2, 2015

Before They Sailed: Mayflower DNA

Please share the details about this first of its kind book that will identify the DNA and trace the genetic ancestry of twenty families that sailed aboard the Mayflower. 

The story behind the story….

Who were the Mayflower passengers before they were pilgrims? Where did they come from? England has a long history of migrations and invasions. Were the Pilgrim’s ancestors Anglo-Saxons, Normans or Vikings? This book will use traditional genealogy and DNA to answer those questions and more.

The DNA of the Mayflower descendants will lead us on a path of discovery that will first allow us to confirm the genetic identity of each Pilgrim and then open a door to the past, before they sailed.

Following the DNA

The DNA falls into three Mayflower categories: descendants, genetic cousins and genetic ancestors. These categories will be defined in more detail in the book. The descendant's DNA gives us the ability to confirm the genetic identity of each Pilgrim and gives us deep ethnicity. The identity allows us to find genetic cousins in England, giving us the location of each Pilgrim's ancestral city and village. When we expand the search for cousins beyond England, we get a view of genetic ancestors, tribes and migrations across Europe.

Mayflower Surnames

The book will identify the DNA and trace the genetic ancestry of the following Mayflower families: Alden, Billington, Bradford, Brewster, Chilton, Cooke, Doty, Eaton, Fuller, Hopkins, Howland, Mullins, Rogers, Samson, Soule, Standish, Tilley, Warren, White and Winslow.

Looking for Descendants

For this project to be a success, I'm looking for both direct paternal line and maternal line descendants of the Pilgrims. If you have a solid genealogy back to the Mayflower, I would like to include your story. It would be great if you already have your y-DNA or mitochondrial DNA tested.


In addition to Mayflower ancestry, the book will illustrate the basics of genetic genealogy. How can we tell if two people are related? Where does the ethnicity come from? How do we know where someone's ancestors lived 500, 1,000 or 2,000 years ago? Why do the Anglo-Saxons, Celts, Normans and Vikings matter in a discussion about Pilgrims?


There will be a tremendous amount of data and information that gets generated during the research phase of this book. All of the info is valuable, but not all of it will make it into the book. The "members-only" Fieldnotes section of the companion website will contain all of the background and detail the progress for those who want to follow along. Even after the book is published, new information will be added as more is learned about the Mayflower Pilgrims and their DNA.

What will be produced?

This full-color book will be produced in a digital format, a paperback edition and a limited hardcover edition. There will be a companion website with behind the scenes details exclusive to members. A presentation for genealogy and history conferences is also planned.

Wednesday, January 28, 2015

Ghosts of DNA Past: Irish Kings

   In 2006, Laoise T. Moore and the folks at Trinity College in Dublin published a paper famous for identifying the modal haplotype of Irish High King Niall of the Nine Hostages.  In their work, they used seventeen Y-DNA STR markers.  While time to most recent common ancestor (TMRCA) calculations have accuracy issues, having only 17 markers gives a common ancestor over 2,000 years ago.   What the Trinity folks really accomplished was the identification of Niall’s paternal ancestor from over 400 years earlier.  The media in 2006 had a field day in their interpretation that most of Ireland is descended from Niall.  “Niall may be the most prolific male in Irish history.”  Also at 17 markers, there is a very high probability of convergence.  Through normal mutations, haplotypes can change over time to appear similar or identical to other haplotypes.  The lower the number of markers, the higher the chance of convergence.  At that time only high level SNPs were tested to determine haplogroup.  Without terminal SNPs it would have been impossible to recognize convergence, if it existed in the samples.

   In my research on the Kings of Ireland, I have used 67 markers to reduce the chance of convergence and to calculate the age of common ancestors on the descendant side of the target rather than the ancestor side.  I will demonstrate traditional median-joining networks and novel “tribal” markers for the identification of four historic Kings of Ireland.  Did Trinity get Niall’s haplotype correct with the limited data they had at the time?

Ghost:  a manifestation of a dead person

Modal haplotype:  a derived haplotype based on the DNA tests of a group of people

   A modal haplotype is a ghost of a person.  When we look at multiple DNA test results and calculate the mode, by definition we are just taking the values that appear most often.  There is no way to determine if the modal haplotype is the actual haplotype of the historic individual we are researching (short of historic samples).  While the modal is not perfect, it will be close enough at 67 markers for us to determine the genetic “ghost”.

   The septs of Ireland provide us an opportunity to develop genetic genealogy techniques and processes.  Irish surnames are typically patronymic.  The surnames generally take the form of Mac Cárthaigh (McCarthy), meaning son of Cárthaigh or Ui Néill (O’Neill), meaning grandson / descendant of Néill.  Irish septs serve as a collective of related families with shared ancestry and patronymic surnames.  Multiple septs then belong to larger dynasties such as the Eóganachta and the Dál gCais.

   If septs are patrilineal, then Y-DNA haplotypes should be consistent across sept surnames.  Research on the Uí Néill haplotype started with a geographical selection and then a subsequent reduction by sept surnames (Moore et al 2006).  For each target sept, affiliated surnames were identified.  In the case of Uí Néill, the following surnames and associated Y-DNA STR records were accessed from Family Tree DNA projects: O’Neill, Gallagher, Doherty and O’Donnell.  The selection includes 600 records and 5 common European haplogroups.

   Median-joining networks have been in use for over a decade for the visualization of genetic relationships.  The use of them at 67 STR markers has been rare, but it should be the norm.  This first image has the central cluster of a median joining network based on 25 STR markers from the Uí Néill group.  It is just a single cluster with no differentiation.

Figure 1 - Using only 25 STR markers, the Uí Néill network collapses to a single cluster.

When we look at the same group using 67 markers, we get four distinct clusters, each with their own SNP.  The cluster at the far right is predominantly R-L159 and the cluster at the lower right has R-P311/R-L151 nodes.  The cluster at the left contains all of the Uí Néill dynastic surnames, has the majority of nodes and is SNP R-M222, which is consistent with earlier studies.

Figure 2 - View of the Uí Néill network torso showing four distinct clusters.  Three groups on the right are O’Neill only.

As a double check to make sure that I wasn’t seeing some other phenomena, I analyzed three random Irish surnames; Duffy, Kelly and McCormick.  The random sample produced over ten unique clusters with no surname overlap.  This comparison shows that septs are patrilineal and that Y-DNA haplotypes are consistent across sept surnames. 

Figure 3 - Median-joining network of yDNA sampled from three random Irish surnames; Duffy, Kelly and McCormick.  

Re-evaluating the Uí Néill data also shows that Trinity was correct in their identification of a 17-marker Uí Néill haplotype.  New data and new techniques allow us to produce a 67-marker haplotype.

Figure 4 - Sixty-seven STR Uí Néill Modal Haplotype (Niall of the Nine Hostages).

   A different technique that I’d like to illustrate involves the fact that not all STR markers are created equal.  This method takes advantage of “slow” mutating STR markers.  Each marker has its own mutation rate.  By selecting the 15 “slowest” markers with an average mutation rate of 0.00024, a virtual tribal haplotype is created that would be stable within the last 2,000 years (90% probability of 80 generations).  This is an order of magnitude lower than the average rate of 0.0029 used as a constant in typical TMRCA calculations.  The “tribal” markers isolated are DYS426, DYS388, DYS392, DYS455, DYS454, DYS578, DYS590, DYS641, DYS472, DYS594, DYS436, DYS490, DYS450 and DYS640.

   To manipulate the “tribal” haplotype of 15 microsatellites faster the resulting values are concatenated into a string – ex. 12121411119168108101212811.  The “tribal” haplotypes are summarized per surname and plotted to illustrate majority and affinity.

Figure 5 - Uí Néill dynastic haplotypes converted into 15 marker “tribal” haplotypes and summarized.

   The Uí Néill dataset resolved into 37 unique “tribal” haplotypes.  Figure 5 shows that haplotype 12121411119168108101212811 is the most dominant across the Uí Néill surnames.  As with the median-joining network analysis, this “tribal” haplotype is consistent with SNP R-M222. 

   I repeated these two techniques for the Uí Briúin sept using the following surnames and associated Y-DNA records: O’Brien, Hogan, Kennedy and McMahon.  The selection includes 615 records.  The Mac Cárthaigh dataset has the following surnames: McCarthy, Callaghan, Donovan and Sullivan.  The selection includes 319 records.  The Ua Conchobhair data has the following surnames: O’Connor, McManus, Reilly and Rourke.  The selection includes 352 records.

For more details, see my paper at

Figure 6 - Sixty-seven STR Uí Briúin Modal Haplotype (Brian Boru).

Figure 7 - Sixty-seven STR Mac Cárthaigh Modal Haplotype (McCarthy Eoganachta Kings).

Figure 8 - Sixty-seven STR Ua Conchobhair Modal Haplotype (Last High King Roderick O'Connor).

   Here are a couple of interesting insights from my research.  Niall Noígíallach was High King of Ireland around 378 CE and founder of the Uí Néill dynasty.  Historically, his half-brother Brión, was one of the founders on the Connachta dynasty and an ancestor of the last High King of Ireland, Ruaidrí Ua Conchobair.  If their genealogies are correct, the evidence is in their descendant’s DNA.  The data shows that Uí Néill and Ua Conchobair share the same SNP, R-M222.  The Uí Néill and Ua Conchobair modals are a 6-step match at 67 markers.  There is a 99% probability of a relationship not further than 1,260 years ago.  The results make a strong case for the validity of this historic genealogy.

   Brian Boru, High King of Ireland in 1002 CE, belonged to the Dál gCais dynasty and Tadhg Mac Cárthaigh, the first King of Desmond, belonged to the Eóganachta dynasty.  Ancient genealogies have the Eóganachta and Dál gCais dynasties descended from Ailill Aulom, the son-in-law of legendary king Conn of the Hundred Battles.  The Mac Cárthaighs and Uí Briúins do not share the same SNP (R-L226 vs. R-CTS4466), but by descent they would share a common R-DF13 ancestor.  The Mac Cárthaigh and Uí Briúin modals are an 11-step match at 67 markers.  There is a 99% probability of a relationship not further than 1,920 years ago.  This puts a Mac Cárthaigh-Uí Briúin common ancestor as a contemporary of the legendary Conn.

   New and improved genetic genealogy techniques are invaluable for the identification of historic individuals and the reconstruction of distant family trees at the macro level.


Maglio, MR (2015) Identifying Y-Chromosome Dynastic Haplotypes: The High Kings of Ireland Revisited (Link)

Monday, December 8, 2014

Atrocities and Assimilation: Crusader DNA in the Near East

   This paper got its start back in February of this year while I was researching R1b-DF100 for my posting, The Third Brother.  Among the data, primarily Western European haplotypes, was a single Armenian record.  The R1b-L11>DF100 group that I was working with had as one of their theories that L11 was a fairly recent, 3,000 to 4,000 years, arrival from the Near East and that the Armenian record was part of that evidence.  I looked at the Armenian record, ran a phylogenetic test on it, the L11 group and some similar Near East records.  The Armenian record fell squarely within a Baltic cluster on the tree with a rough TMRCA of about 1,200 years.  This Armenian was clearly more European than Armenian, at least on the paternal line.  My comment back to the L11 group was that their Armenian was probably the descendant of a Crusader based on the timing and directionality.

   In September, I ran across Pierre Zalloua’s paper - Y-ChromosomalDiversity in Lebanon Is Structured by Recent Historical Events (2008).  He and the other authors had put together a good correlation between Crusader DNA and haplogroup R1b in Lebanon.  The paper also correlated haplogroup J and the Muslim expansion.  The paper received quite a bit of feedback about haplogroup J and little or no mention about haplogroup R1b.  Considering the extent of the Crusader’s presence in the Near East from 1096 to 1343, if they left DNA behind it would have been spread farther than Lebanon. 

   The real question is not – if they left DNA behind.  There is significant literature that details the atrocities; raping and pillaging was standard operating procedure for the Crusaders.  There are also numerous accounts of assimilation.  During the Crusader’s 247-year occupation and roughly eight generations, they married local women and raised families.  The real question is did Crusader DNA survive to modern time. 

Crusader DNA Distribution
   If Crusader DNA survived, it would be spread from Istanbul to Jerusalem and beyond.  The graphic above shows the potential for DNA distribution during the Crusader occupation (red) and the distribution over the past 918 years (gray).  My research focused on the following Near East countries - Armenia, Georgia, Iran, Iraq, Israel, Jordan, Lebanon, Palestine, Saudi Arabia, Syria and Turkey.

   Here is something I found bizarre.  Zalloua and team published their paper in 2008.  Every researcher looking at Near East R1b should be taking a lesson and validating that their data is not of Crusader origin.  Obviously, Crusader DNA wasn’t restricted to Lebanon.  In 2010, Balaresque, et al and again in 2011, Myres, et al, published papers using Near East R1b data (Turkish).  Forty-two percent of the Turkish R1b haplotypes from Balaresque and Myres was identical to Zalloua’s Lebanese R1b data.  This didn’t seem to raise any flags as Balaresque and Myres used the Turkish data to suggest a Near East origin and Neolithic expansion for R1b.  These folks must not talk to each other.  Two of Zalloua’s team members went on to work with Balaresque and Myres on their papers.  The first thing I would have said was – “Considering what Zalloua found, we need to validate the origins of the Turkish data further back than one or two generations”.

   When presenting an analysis it is always good to show comparison data.  I collected R1b data and haplogroup G and J data from multiple Family Tree DNA projects.   I have a higher comfort factor that G and J are associated with the Neolithic expansion, so they were used as a basis for comparison.  For each 37-marker Near East record obtained, I used the haplotype to query a larger set of related records from ySearch (I call this haplotype aggregation).  A Near East set and a Western European set of data was developed for each haplogroup.  I then compared each individual Near East haplotype against the entire Near East set and the entire Western Europe set.  You would expect that the Near East haplotypes would be more closely related to their peers in the Near East set.

   The haplogroup J data tells the best story.  The results cluster down J1-M267 and J2-M172 lines.  The neutral line (diagonal triangles) represents zero affinity towards the Near East or Western Europe.  Points falling to the right of neutral show an affinity toward the Near East and to the left of neutral, an affinity towards Western Europe.

   J1 haplotypes (diamonds), which are rare in Europe, are closely related to their peers in the Near East.  The J1 data only shows an affinity toward the Near East.  The trend line for J1 indicates a fairly stationary population pattern with no suggestion of migration to Western Europe.  A trend line that doesn’t cross the neutral represents a strong peer affinity and little or no migration between the Near East and Western Europe.  J2 data (squares) shows a tipping point at which the more distantly related records lean toward the Near East and the closely related records lean toward Western Europe.  That transition shows a TMRCA of about 3,900 ± 800 years.  The tipping point indicates a point in time where the Near East J2 haplotypes became more common in Western Europe, illustrating a migration. 

   Haplogroup G shows very similar results as J2. Haplogroups J2 and G have been associated with the Neolithic spread of agriculture from the Near East to Western Europe.  Both J2 and G present a consistent distribution from distant relationship (high variance) to closer relationship (low variance).  The trend lines for J2 and G represent migration events from the Near East to Western Europe.  The trend line for J1 represents no migration event.  These results are consistent with other published information.

   Haplogroup R1b does not exhibit either a migration or a non-migration pattern.  The haplotypes cluster in a fairly homogenous group.  There is a slight lean toward Western Europe and essentially no continuum from high variance to low variance.  The more distantly related haplotypes don’t exist in the Near East.  The Near East individuals are just as related to the Western European individuals as they are to their own peers.  The approximate TMRCA for the R1b Near East – Western European group is 1,800 ± 500 years.

   Through atrocities and assimilation, Western European DNA from Crusaders was permanently introduced into the Near East less than 1,000 years ago.  Western European and Near East R1b haplotypes are highly and recently related.  The data indicates that within the last 2,000 years there was a migration from one geography to the other.  There is no documented migration in the past 2,000 years that would account for Western European R1b populations coming from the Near East and replacing indigenous European populations.  The introduction of Western European DNA into the Near East by Crusaders accounts for the west to east genetic flow.

   The sampling practices of research studies are questionable.  The origin of participants is typically only validated for one or two previous generations.  This is equivalent to not knowing the origin for study participants.  Sampling needs to be undertaken with a genetic genealogy approach and 37 markers or greater.  The population genetics approach of less than 17 markers, poor origin validation and haplogroup generalization needs to change.

   Previous papers (Balaresque & Myres) that have used Near East R1b data as the basis of their research are suspect.  In light of the introduction of Crusader DNA into the Near East within the past 1,000 years, any theory on a Neolithic origin for haplogroup R1b will have to be re-evaluated.


Maglio, MR (2014) Y-Chromosomal Haplogroup R1b Diversity in Near East is Structured by Recent Historical Events (Link)

© Michael R. Maglio

Friday, December 5, 2014

DNA Convergence and Chicken Little

   For me, the topic of convergence in yDNA first came up early in 2014.  I had just posted a paper and one of the comments was – “What about convergence?”  I said to myself, “What convergence?”  I admit I had to look up the topic.

Convergence: A term used in genetic genealogy to describe the process whereby two different haplotypes mutate over time to become identical or near identical resulting in an accidental or coincidental match. - Turner A & Smolenyak M 2004.

My response back to the comment was - “All of the haplotypes in my paper are unique.”  My data did not exhibit convergence. 
Convergence casts a shadow on genetic genealogy
   I started to poke around on the topic of convergence within yDNA STR haplotypes and the immediate impression that I got was that folks were ready to give up on STRs in favor of SNPs and the sky was falling.  Chicken Little was running around in the genetic genealogy circles.  Here is a small sample:

Y-STRs are effectively dead” - Dienekes Pontikos, 2011

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction” – Wang, et al, 2013

   Okay, convergence happens, but it’s an illusion.

   Let’s take a big step backwards in this story.  Did you know that most scientific papers relating to genetic genealogy use 17 STR markers or less?  Some use as few as 9 or 10.  For any of you who ever took one of the original 12 STR marker tests, you know that the results were essentially useless for anything except deep haplogroup association and history.

   Many researchers in the last couple of years are using the AmpFLSTR® Yfiler® to get their 17 marker results.  This equipment is approved for forensic cases.  Research papers are not forensic cases and researchers don’t need to limit themselves to 17 markers.  Thirty-seven marker yDNA tests have been available since 2004.

   Why does the number of STR markers matter?  I’m going to release my inner math geek to help explain.  If we look at marker DYS19, usually listed first in science papers and third in Family Tree DNA results, it can have a value within the range of 7 to 22 across all haplogroups.  Looking at R1b specifically, DYS19 ranges from 10 to 17 and statistically at two standard deviations (2 sigma) the range of values narrows to 13, 14 and 15.  From a probability point of view, there is a 1 in 3 chance that DYS19 will be 13, 14 or 15.  Making the odds even better in our favor, 95% of the time DYS19 for R1b will already be 13, 14 or 15.  This means there is a 1 in 2 chance that DYS19 could change to another value on its way to converging with another haplotype.

   Taking standard deviation into account to determine the possible number of values for the STR markers and then multiplying each probability gives the odds that a haplotype could converge.

# of possible
marker values


   There is a 1 in 4096 chance that two R1b 12 marker haplotypes could converge.  This is not the probability that one marker will change.  This is the probability that all 12 markers will change enough to match another haplotype.  These are very good odds and the reason why a 12-marker test is practically useless. 

   With a high probability that 12 STR markers will converge, haplotypes start to blend together.  Two different haplogroups or family lines will appear to be the same.  Converging also means that when we calculate the time to the most recent common ancestor (TMRCA), it will look like less time has passed.  Convergence makes a 12-marker test result unusable for genealogical matching, haplogroup prediction and TMRCA calculations.  The Chicken Littles are correct, we have a problem with 12 marker STR results.

   What about 17 markers, a quasi-industry standard for science papers?  Taking the same approach with statistics and probability, a 17-marker yDNA R1b result has a 1 in 2 million chance of converging with another haplotype.  Each haplogroup has slightly different odds.  There is a 1 in 500,000 chance of an R1a 17 marker haplotype converging.  Those odds are better than any lottery.  Convergence is still a problem at 17 markers.

   When Dienekes Pontikos proclaimed the death of yDNA STRs, he was commenting on the attempt to get good TMRCA dates from 10-marker results.  I agree, you can’t get valid TMRCA dates from 10-markers.  When Wang, et al, determined that convergence compromises haplogroup prediction, they were correct, 17 marker haplotypes can converge to make one haplogroup look like another.

   In a quick analysis of 4,300 unique 37-marker R1b haplotypes, the average genetic distance is 17 steps for 37 markers.  That means there are 17 mutations required for convergence in a 37-marker haplotype.  Nearly half of the markers in the haplotypes would need to change.  When we look at the probability of 25-marker haplotype convergence, the chances are 1 in 84 million.  Considering there are about 3.6 billion men on the planet, one in 84 million is still in the realm of possibility.  By the time we get to 37-markers, the odds are 1 in 49 trillion.

   There is a 1 in 49 trillion chance that all the necessary mutations will occur in order for two 37-marker haplotypes to converge.  The odds are likely much higher.  I’ve only looked at the probable values for each marker and I haven’t taken into account the STR mutation rates, the possibility that a marker will change over time. 

   There is essentially no such thing as convergence when 37 or more markers are tested and researched.  If you eliminate the possibility of convergence by using 37 STR markers, then immediately TMRCA calculation become more accurate and haplotypes from different haplogroups no longer resemble each other.  The reports of the death of yDNA STR results have been greatly exaggerated.

   I can’t tell you why researchers are currently stuck on 17 markers.  I can tell you that any research using less than 37 markers runs the risk of convergence in their data, which in turn could lead to the wrong conclusions.  I still consider genetic genealogy to be in its infancy.  Every month new research papers are published and the new concepts introduced are latched onto immediately.  It is understandable that papers from over a decade ago used a dozen STRs and a handful of SNPs, that was the height of technology.  If the latest technology and best data are not being used in today’s research papers, is that equivalent to scientific negligence?  Or, am I missing something and this is a case of scientific ignorance on my part?

Tuesday, December 2, 2014


(Originally published May 2014 in Going In-Depth)

   Oh my gosh, there are many acronyms in genetic genealogy.  You have to agree that using the acronym DNA is better than writing deoxyribonucleic acid repeatedly.  Although, when we talk about using DNA for genealogy and we only use acronyms, they start to lose their meaning and become just another ‘thing’.  “Hey, I’ve got a SNP.  Do you have a SNP?”  “I dunno, let me check.”  Maybe I’m weird.  I like to understand what all the acronyms mean and how they play a part in the larger picture.

   Let’s start with some DNA basics.  We have DNA in every cell except the red blood cells.  Inside the nucleus of our cells, we have 46 chromosomes or 23 pairs (nuclear DNA).  One set of 23 comes from dad and one set comes from mom.  If we took the tightly coiled DNA from one cell and stretched it out it would be about six feet long.  In that six-foot double helix from one cell, there are over 3 billion base pairs.  If you picture our double helix DNA as a twisted ladder, each rung is a base pair made up from four nucleotides (DNA building blocks).  The rungs are made from either an adenine-thymine rung or a cytosine-guanine rung.

   When we talk about DNA, we often also talk about mitochondrial DNA.  Mitochondria exist outside of the nucleus as an energy source for the cell and have their own independent DNA.  Mitochondrial DNA has just over 16,000 base pairs in comparison to the 3 billion base pairs in our nuclear DNA.  We inherit our mitochondrial DNA only from our mothers.

   DNA is divided into coding regions (genes that define proteins for such things as eye color) and non-coding regions (sometimes called junk DNA).  The coding region that defines us is less than 2% of our overall DNA and within that, there are less than 25,000 genes.  A gene is a sequence of nucleotides averaging about 23,000 base pairs.  One of the largest genes, which encodes for the Caspr2 protein, has over 2.3 million base pairs.

   Within the 3 billion base pairs of our DNA there are variations (normally occurring mutations), where one base pair has been replaced with another base pair.  As an example, it was adenine (A) and now its guanine (G).  This is a single nucleotide polymorphism or SNP (pronounced snip).  There are over 15 million SNPs in our DNA.  Once a SNP occurs, it is usually permanent in the population.  The farther back in time that the SNP occurred, the more people will have that particular mutation.  To be considered a SNP, it has to exist in greater than 1% of the population.  They are found in both the coding and non-coding regions of our DNA.  In the coding regions, SNPs are often markers for genes.

   Let’s divide our DNA into four groups.  Group one, the autosomes, are the first 22 pairs of chromosomes.  The next two groups, the sex chromosomes, are one X and one Y if you are male and two Xs if you are female.  That gives us yDNA and xDNA.  The last DNA group is mitochondrial.  All types of DNA have SNPs.  Autosomal SNPs are used for health and ethnicity.  Mitochondrial and Y-DNA SNPs are used to determine world haplogroups.  While there are 1,000s of X SNPs, there doesn’t seem to be much research around them.

   SNPs have no effect on health, but their presence may predict a health risk.  If you had an autosomal test from 23andMe (prior to the FDA ruling), they would have delivered health information with your results.  They were able to report SNPs in the coding region associated with gene combinations responsible for health risks, like cancer or Alzheimer’s or basic information, like eye and hair color.  Even though you cannot get health information from 23andMe currently, you can still use your autosomal results with Promethease from to research your health risks.
   Combinations of SNPs are analyzed to determine ancestry-informative markers (AIM – another new acronym for you).  AIMs are used to estimate the ethnicity or at least the geographic origins of your ancestors.  When you receive ethnicity results from an autosomal test, it will be based on the AIMs that the test company are using.  They don’t all use the same markers, so results will vary.  There are even 42 SNPs associated with having Neandertal ancestry.
   SNPs are used to organize us into larger branches of the human family tree (haplogroups).  Our maternal family tree is organized into 26 branches (A through Z) using mitochondrial DNA.  Our paternal tree is similarly organized into 20 branches (A through T) using yDNA SNPs.   As an example, take four men (I use men because the scenario works for both mitochondrial DNA and yDNA), Abe, Bob, Chaz and Dave.  Test each of them for three SNPs, X, Y and Z.  You find that they all test positive for SNP Z, Abe and Chaz test positive for X and Bob and Dave test positive for Y.  You can start to see the branches and the beginning of a tree.

   The first yDNA and mtDNA trees were built using only a few dozen SNPs.  Today, the paternal and maternal haplogroup trees are much more detailed, based on thousands of SNPs.  Complete SNP testing has been available for mitochondrial DNA for a number of years.  Starting last year, complete SNP testing is available for yDNA from companies like FamilyTreeDNA with their Big Y test.  Previously yDNA SNP tests were designed to look for specific SNPs.  With advances in technology, they can now look for all the SNPs across over 12 million yDNA base pairs.

   Just to add another acronym to the pile, there are also STRs or short tandem repeats (aka microsatellites).  STRs are short sequences of base pairs that repeat.  These repeats are found in autosomal, y and x DNA.  You may have heard the term CODIS if you watch Crime/Drama shows on television.  CODIS is the FBI’s Combined DNA Index System (more acronyms).  When DNA is collected for CODIS, they typically test for 13 STR markers across the autosomes.  When you have a yDNA STR test done, genetic genealogy companies test for up to 111 markers only on the Y chromosome.  They will also perform a basic SNP test to identify your paternal haplogroup.  SNPs and STRs are different in that SNPs appear to be permanent changes in our DNA and STRs are variable.  STRs are identified by location on the chromosome and by the number of times that the repeat occurs.  The number of repeats per STR can change over time, sometimes increasing, sometimes decreasing in number or increasing then decreasing again (known as a back mutation).  The combined set of STR markers is your haplotype and may be unique to your surname or span multiple surnames.  With the advances in yDNA SNP testing, SNPs will be found that are unique to your surname, which could make STR testing obsolete.

   We all have DNA: 23 chromosomes in our cell nuclei, half from mom and half from dad.  We also have mitochondrial DNA from our moms.  Less than 2% of our DNA is in the form of genes, which define who we are.  SNPs can be used to identify our “good” and “bad” genes.  SNPs can also help identify our ethnicity and build our paternal and maternal family trees.  STRs can organize us down to the paternal surname level.  When folks start talking DNA, don’t be afraid to question them about, “What kind of DNA?”, “What does that SNP indicate?” or “What type of STR is being tested?”.  We’ll never get away from using acronyms to simplify how we communicate genetic genealogy.  That doesn’t mean we need to let the acronyms simplify the meanings to a point where the science is lost.  Every little bit of knowledge adds to our understanding of ourselves.

© Michael Maglio

Wednesday, September 24, 2014

DNA Mysteries: Iberian R1b-V88 in Africa

   When I first heard about R1b in Africa, my immediate assumption was that the predominantly Celtic haplogroup must have been a recent transplant.  I ran some of the V88 haplotypes against the big databases (FTDNA & ySearch) expecting to see matches to European men within the African colonial timeframe.  It wasn’t that easy.  Common ancestor analysis put the R1b Africans (V88) thousands of years removed from the rest of their European R1b cousins.  Where did they come from?  How did they get there?

   I started with the given that the R1b defining mutations (SNPs) occurred in the Iberian Peninsula.  The jury is still out on this hypothesis.  There have been scientific papers for and against Iberian origins of R1b.  My own work (Iberian Origins of R1b) supports an origin prior to the Neolithic expansion.  Could V88 have made a straight-line migration from Iberia to the Lake Chad region of Africa?  Could V88 have crossed the Straits of Gibraltar, travelled across the Sahara, which 7,000 years ago was a savannah well populated with animals for hunting, and arrived at Lake Mega-Chad?  That was my early premise.  I was wrong.

   The distribution of V88 is much larger than any of the scientific papers would indicate.  While I agree with the work that’s been done correlating the spread of V88 with the spread of Chadic languages (Cruciani et al 2010), the Chadic population is only a subset.  Nobody takes into consideration the V88 populations in Europe and the Middle East.  If they do, it is a sideways glance to say were ignoring them because they don’t fit into what we are trying to prove.  If you don’t look at the entire picture, your conclusions will be skewed.

   I wanted the largest selection of V88 Y-DNA records with at least 37 markers tested.  I started with Family Tree DNA projects that had the records SNP tested.  Those haplotypes were run against the ySearch database to identify highly related records with no SNP testing.  The initial gathering of records picked up individuals with SNP M73.  These were removed.  The key differentiator between V88 and M73 was DYS464a&b.  V88 was typically 12,12 and M73 was 15,15.  Thirty-seven or more STR markers are helpful in identifying additional related haplotypes and even more necessary in determining the relationship between records.  Most studies only looks at SNPs or a small handful of STR markers.  This is shortsighted.  Imagine a reference population of 100 records all with the same SNP.  Without enough STR markers you can’t tell whether you are looking at one haplotype with minor 1 or 2 step variations or 100 unique haplotypes.  That’s the difference between a founder event starting with as few as one individual or a group with greater diversity and age.

   My final set of 119 records has at least 37 STR markers, V88 SNP testing or is highly related via STR and has the geographic location of the most distant known ancestor.  The records are processed through PHYLIP to generate a phylogenetic tree.  The phylogenetic tree give a visual depiction of the relationships in the dataset and an approximate number of years back to common ancestors, represented as the nodes between the records.

All of this is very standard genetic genealogy.  I add a twist (Biogeographical Multilateration) by converting the years back to a common ancestor to a distance using Cavalli-Sforza’s migration rate of 1 to 1.2 km per year.  This is enough for me to solve a series of cascading equations giving me the locations of the common ancestors.  Looking back at the phylogenetic tree shows us how all the nodes and locations are connected, essentially the flow of migration.

   The out of Iberia event took place about 7,700 ± 1,600 years ago.  TMRCA calculations have been shown to be very inconsistent.  Some folks use a constant mutation rate and some use rates per marker.  I include a TMRCA to give a relative chronology.  While the majority of R1b is known for its Western Atlantic migrations, V88 took a path along the Mediterranean coast and down the Adriatic.  While none of the V88 records indicated Crete as an ancestral location, it appears multiple times as a common ancestor location.  The data shows Crete as a stepping-stone in the Mediterranean as V88 migrated to the Nile River Valley.  The back to Africa event(s) occurred roughly 5,500 ± 1,000 years ago.

The majority of the Chadic records (Cameroon, Chad and Nigeria) have relatively close genetic connections to individuals in the Middle East (mainly Saudi Arabia).  The Chadic and Middle Eastern records tie back to common ancestors along the upper Nile.  There is a significant lack of information to understand what impact R1b-V88 had on the Nile Valley cultures.  Considering that there was only 1 out of 119 records with an exact Nile River location, I would venture a guess that V88 didn’t integrate well.

   While the V88 back to Africa migration has captured much attention, the data shows a more fascinating event.  There was a V88 re-migration back to Europe from Africa.   The back to Europe event took place about 3,200 ± 1,000 years ago.  Again, Crete played a role as a stepping-stone as V88 entered the Eastern Adriatic region and spread into Central and Eastern Europe.  Someone will probably notice that many of the V88 in Eastern Europe are Jewish and that the date for leaving the Nile region is close to the time of Exodus.  There is nothing in any of the data to indicate that this was the Jewish Exodus from Egypt.  The V88 group in Eastern Europe is closely related and there is phylogenetic evidence to support that this may have been a founder event with a single male or small group of closely related males.  There is no evidence to support that those founders were Jewish when they left Africa.

   By looking at the big picture, including all the data and letting the data illustrate the patterns, we can unravel what appears to be the mysterious appearance of R1b in Central Africa.  Along the way, we can uncover a previously unknown re-migration from Africa to Europe.  Too often haplogroup data is treated as discrete buckets of information living in a vacuum with no interaction to other haplogroups and no internal relationships.  Every DNA record is connected to every other record in a network.  Each haplotype is a vector with location and direction.  The sooner we treat genetic records as a network analysis, the sooner we will solve more DNA mysteries.

Out of Iberia and back to Africa.  Followed by a return to Europe.


Maglio, MR (2014)  Y Chromosome Haplogroup R1b-V88: Biogeographical Evidence for an Iberian Origin (Link)

Tuesday, August 12, 2014

Iberian R1b Y-DNA: First Movers in Europe

   The disputed origins of haplogroup R1b, most commonly thought of as Celtic, remains split between Iberia prior to the end of the last ice age and various West Asian locations after the ice age.  A new view on the R1b homeland comes out every year.  With all we know about DNA, shouldn’t we be coming to a consensus?  Typically, I refer to R1b as Celtic to help an audience make the connection between lettered haplogroups and culture or ethnicity.  I also add the caveat that Celtic is a misleading label.   R1b is supergroup of cultures including; Iberian, Gallic, Celtic, Germanic and Scandinavian.  To attribute empires or nationalities to R1b would be foolish, as R1b is tens of thousands of years older than any known empire.

   Perhaps I’m naïve.  I like simple, logical answers.  The earliest publications on R1b described their ancestor R1, entering Europe from central Asia during a warm period about 30,000 – 40,000 years ago.  The last ice age forced R1 to split and take refuge south in Iberia and the Balkans.  Time and separation gave us the mutations R1b in Iberia and R1a in the Balkans.  That split is roughly what we see today in those regions.  That’s clean and simple.  The real world is much more complex.  R1b and R1a were not alone in Europe.  Their interactions with the other major European haplogroups- E, G, I, J and N has to be taken into consideration.  We can’t analyze R1b as if it were in a vacuum.

   Let’s take y-DNA haplogroups out of the picture for a moment.  We know that modern humans survived and flourished in the Iberian refuge during the end of the last ice age, based on mitochondrial DNA studies.  [Could someone please run some y-DNA tests on those samples?]  The tribes in western Europe, whoever they were, had a 1,000 to 2,500 year head start over the tribes in central and eastern Europe on repopulating the continent.  The ice sheets melted and retreated earlier on the west coast than in the rest of Europe.  This gave the inhabitants of the Iberian refuge an advantage – a “first-mover” advantage gained by being the first to move north.  These first-movers gained a land-monopoly.  A tribe with a first-mover advantage and over a 1,000 year head start should have been hard to displace from western Europe.  In other anthropological situations, those original inhabitants are forced into niche locations by invading populations, but very rarely are displaced completely.  What we see on the west coast of Europe, is a very strong R1b presence and no niche haplogroups of a significant age.  From this point of view, either R1b is the original Iberian inhabitant or R1b completely decimated another earlier haplogroup that had a 1,000 year geographical head start.  I like simple.  R1b was in Iberia first.
   Let’s throw some data at the problem.   The R1b haplogroup population is enormous.  The majority fall into SNPs R-P312 (Celto-Iberian) and R-U106 (Celto-Germanic).  There is so much information there that it tends to be noise.  If you want to get to the root of R1b (R-M343), you need to work with the branches that are closest to the root - R-L278*, R-V88, R-M73*, R-YSC0000072/PF6426 and R-L23.

• • R1b   M343
• • • R1b1   L278
• • • • R1b1a   P297
• • • • • R1b1a1   M73
• • • • • R1b1a2   M269
• • • • • • R1b1a2a   L23
• • • • R1b1c   V88
[• • • • • • • • • R1b1a2a1a1   U106 - too far downstream]
[• • • • • • • • • R1b1a2a1a2   P312 - too far downstream]

   I collected 250 records that matched these SNPs or were genetically close by STR haplotype.  These records were mapped based on user-reported most distant ancestor location.

   This is not a connect the dot exercise.  Just because two or more records appear geographically close doesn’t mean that they are genetically close.  These 250 records have to be treated like a network.  If this were Facebook, these folks would be randomly associated through family, business, school or neighbor connections.  These are y-DNA records.  There is a relationship between every pair.  Each pair has a different common ancestor, with a different number of generations to get back to that ancestor.  Here is an example of what that relationship looks like across multiple pairs.  The number represents years back to a common ancestor (TMRCA).

When all of the interrelations are taken into consideration, the group of records can be displayed as a relationship tree of who is older or younger and who is more closely related to whom (phylogenetic tree).

   Now we have who, where, when and how the records are connected.  At this point it does become a connect the dots exercise.  I’ve used a biogeographical analysis to connect very specific sets of dots based on the calculated interrelation of the entire group.

   The R1b genetic family tree has a trunk and many branches.   The trunk of the R1b data is firmly rooted in Iberia.  The main core of the tree stretches along the western Atlantic coast of Europe and branches across Europe and even back into Asia.  The results that I found support the work of the earliest pioneers in the field and conflict with the latest publications.

   Every analysis has its limitations.  The work that I’ve done looks back at the R1b family about 8,000 years.  The scarcity of data only allowed for me to predict the origin of R-L278, which is currently one branch below the main root of R-M343.    I can’t tell where R1b was between the times that R1 split into R1b and R1a, yet.

   In my analysis, I have included R-V88.  They are a curious group of R1b found in Africa and the Middle East.  I will be treating R-V88 in a separate write-up to do justice to a very interesting back migration story.  The R-V88 article can be found here.


Maglio, MR (2014)  Biogeographical Evidence for the Iberian Origins of R1b-L278 via Haplotype Aggregation (Link)