Phase III Analysis: 37 STRs
from YSearch 23 August 2005
John McEwan
(First posted 3rd
Sept 2005, updated 11th 15th 18th 19th Sept, 4th
Feb, 11th March 2006)
Objective
The objective of this analysis is to:
·
Fully investigate R1b haplogroup STR clusters present
in the Ysearch database
·
Provide sufficient data to allow corrected
geographical density plots derived from surnames
·
Allow others to quickly identify distantly related haplotypes
and associated subcluster for their haplotype
Disclaimer
These analyses simply provide me with working
hypotheses. I make no claim for their accuracy or the robustness of their
topologies, nor the use others make of them. Specifically, R1b has been broken
down into many sub-clusters, but most of these are entirely speculative with
the exception of R1bSTR19 (Irish), R1bSTR22 (Frisian/Germanic), and R1bSTR47
(Scots) where independent analysis supports the broad classification. As a
broad rule of thumb if the stem of a cluster is less than 0.1 then some doubt
exists, and if it is less than 0.04 considerable doubt exists.
Data and analysis
The haplotypes in Ysearch
were downloaded on the 23rd August 2005. In total there were 15,358
individuals. The data was then edited in Excel
to remove: those without a full 37 FTDNA marker results, obvious haplotypes
that did not represent individuals (e.g. modal or researcher in name) and one
record ZZHNZ that appears to be an obvious input error. Locus 389ii was recoded
by subtracting the value of 389i from it so it is in a more appropriate form
for analysis. The resulting data set consisted of 3996 records. This data set
was then manually coded into given country
of origin and region of origin.
Those records from the America’s were classified as other for region of origin.
Names were concatenated with their Ysearch
listed haplogroup and their unique ID to ease future comparison
(e.g. Courtney_R1b_E3HBX). In some
cases names were slightly edited.
The multivariate distance matrix Nei’s Na was calculated using Populations and saved in Phylip format. This file was then
edited on the first line from <tab>n<tab>n<CR>
to n<CR> and read into Quicktree and clustered using Nei’s
Neighbor joining algorithm. Those interested in more details are referred to
the websites and scientific papers associated with these programs. An extremely
useful reference for those wishing to delve deeply is Takezaki and Nei 1996 Genetics 144:389-399.
The resultant Newick format
file was then:
1.
processed through a script called Unroll to extract unique IDs and their tree order. This allows the
source Excel file to be sorted in cluster order.
2.
edited to select or delete sections of the tree file
manually in PFE using the match
braces function and saved as 6 seprate files. Phylograms were created from the
resultant files using TreeView and
saved as emf files which were then
edited in Powerpoint before printing
as PDF files using PrimoPDF. The edits included identification of likely
boundaries between haplogroups and in some cases subclades.
The links to the phylograms are shown below.
They are pdf files so you will need a pdf viewer. Note you can search within
the pdf viewer for a surname or unique ID within the phylogram.
Results to date
Phlyograms
The phylograms show clear and obvious structure and in general the
analysis has accurately resolved high level haplogroups as per the YCC 2005
tree. As expected some of the deepest clades with few members are less well
resolved due to saturation of marker changes, high variability within the
haplogroup, and the distance measure used. Some user provided haplogroup
assignments also appear to be incorrect. Several of these have been tested via
Whit Athey’s haplotype predictor and found to be wildly inconsistent. It would
be interesting to investigate whether these individuals have been SNP tested.
PDF
files of phylograms of 3996 Ysearch 37 STR haplotypes by Haplogroup
R1b group 1: tree
order 1-659 (R1bSTR1-15 updated 15th
Sept)
R1b group 2: tree
order 660-1466 (R1STR16-27
includes Irish and Frisian cluster updated 15th Sept)
E,J,Q,N,R1a (less I, G):
tree order 1447-1717 & 2614-2889 (updated 18th Sept)
I,G: tree
order 1718-2613 (updated 18th Sept)
R1b group 3: tree
order 2890-3142 (R1bSTR28-35 updated
15th Sept)
R1b group 4: tree
order 3143-3996 (R1bSTR36-49
includes Scots cluster updated 15th Sept)
Viewing
and searching the phylograms
·
They are PDF files so you need a PDF viewer, which
most browsers have by default
·
Typically you will have to magnify the images up to
800% or greater to make out the individual names and identifiers. Use the PDF
viewer magnify button normally located at the top of your browser screen
·
Using the search button in the PDF viewer allows you
to search for names and unique IDs. You have to search each PDF separately
·
Typically, those individuals closest to you and with
the shortest separating branches have closely related Y chromosome haplotypes.
·
You can use the text select tool to copy a selection
of individual identifiers. If you paste these in Excel you will obtain a list
of identifiers which you can edit and use to extract raw haplotypes from
Ysearch for your own detailed analysis in applications such as Y-DNA comparison
utility http://www.mymcgee.com/tools/yutility.html
·
The shape and structure of the branches provides an
indication of the evolution of the subcluster.
·
Detailed queries about these results are best sent to
the DNA Genealogy list server.
Cluster
Nomenclature
·
Where possible, nomenclature from the YCC SNP based
haplogroups has been used if a reasonable match can be found.
·
Please note individual haplotypes cannot be placed
unambiguously into SNP based haplogroups, there is some level of error, and this
varies depending on the distinctness and age of the clusters and on the
haplotype concerned. Ken Nordtvelt,
and Bonnie Schrack helped annotate I
and J haplogroups respectively and there help is gratefully appreciated. Whit Athey’s predictor was used to
independently identify and find edges of various haplogroups. R1a, has not yet
been clustered although distinct clusters are observed.
·
If a cluster has a “common name” that is widely used
in genealogy circles then this has also been provided. Please note this common
name is often geographic, but it only means haplotype belonging to the cluster
are more frequent in that area, not that it is diagnostic or solely located in
that region.
·
Where there is an apparent cluster, but little, or no
work has been undertaken defining the cluster then I have given it a placeholder identification until it can
be further defined. This is a two-part name a YCC haplogroup description
following by a STR cluster number within that haplogroup: e.g. I1aSTR10.
Comparison
with Whit Athey’s haplotype predictor
Whit Athey has batch
calculated haplogroup predictions for all 3996 individuals using their 37
markers. When these results were compared to those obtained by clustering a
very sharp and clear delineation between groups was obtained. This approach
also identified several groups, both known and unknown that were outside the
haplotype predictor criteria. When these were excluded the concordance between
the two methods was extremely high. The method also identified a small number
of individuals whose user supplied haplotype or haplogroup may be incorrect.
Further details are expanded in the link below.
Comparison with haplotype predictor methodology
Validation
against SNP results
SNP genotypes have the useful property that for short time periods
(<100,000) years they can essentially be considered unique event
polymorphisms. This provides a “black and white” test of whether two
individuals share a common ancestor after that mutation occurred. They are also
used to classify individuals into haplogroups. Unfortunately, most of the
haplogroup designations in Ysearch are not SNP based and there is not formal
source to record those SNPs that have been done. If they were available, they
would provide an absolute independent check of the accuracy of the clustering
algorithm.
Modal
locus values for haplogroup and sub-haplogroup cluster defined by the 3rd
phase analysis
Tables of modal values, genetic distances and TMRCA estimates for all
the haplogroups and clusters are available from the link below. Please read the
notes before using the values especially regards 389ii values.
Tables of modal values from phase 3 analysis 1.2Mb last updated 19th Sept
2005 (you may wish to save it locally after downloading it)
Raw
data tabulation of user provided haplogroup and region of origin
A pivot table of the raw data by user provided haplogroup and region of
origin is provided below. The user provided origins were grouped into
geographical regions. Please note that 40% are not ascribed to a haplogroup and
52% do not have a ancestral region of origin, so only 29% of the data can be
assigned a value.
Tabulation of user supplied haplogroup by region
of origin.
Tabulations
of estimated haplogroup and STR cluster by region of origin
These tabulations are improved on those above because all individuals
are ascribed to a haplogroup or cluster based on STR homology and 48% of the
data can be assigned a value. A variety of tabulations are provided, including
those corrected for sampling biases across regions and as a proportion of the
haplogroup. However, extremely poor sampling of some regions provides
insufficient numbers for some estimates. No error estimates are provided, nor
correction for multiple closely related samples, but ad hoc tests for
significance of the observed patterns can be calculated if required.
Estimating
TMRCA and mutation rates for the phase 3 Y chromosome STR clusters via ASD
estimates
The TMRCA and marker variability within subcluster was investigated via
ASD estimates. Note these estimates differ as they are calibrated alternatively
and a different method is used.
Still
to be completed
Creation of summary phylograms showing the clusters and bootstrapping
the analysis to provide an indication of the robustness of the branch structure
joining the clusters. Also the geographic density of sub-clusters (corrected
for sampling bias) based on density derived from surname frequencies.
Conclusions
Using an appropriate software combination, it is possible to cluster
extremely large number of Y chromosome haplotypes to investigate their
substructure. It is hoped this will allow detailed investigation of the
geographic origin and subsequent spread of members of these subclusters.