Phase III Analysis: 37 STRs from YSearch 23 August 2005

Phase III Analysis: 37 STRs from YSearch 23 August 2005

John McEwan

(First posted 3^rd Sept 2005, updated 11^th 15^th 18^th 19^th Sept, 4^th Feb, 11^th March 2006)

Objective

The objective of this analysis is to:

· Fully investigate R1b haplogroup STR clusters present in the Ysearch database

· Provide sufficient data to allow corrected geographical density plots derived from surnames

· Allow others to quickly identify distantly related haplotypes and associated subcluster for their haplotype

Disclaimer

These analyses simply provide me with working hypotheses. I make no claim for their accuracy or the robustness of their topologies, nor the use others make of them. Specifically, R1b has been broken down into many sub-clusters, but most of these are entirely speculative with the exception of R1bSTR19 (Irish), R1bSTR22 (Frisian/Germanic), and R1bSTR47 (Scots) where independent analysis supports the broad classification. As a broad rule of thumb if the stem of a cluster is less than 0.1 then some doubt exists, and if it is less than 0.04 considerable doubt exists.

Data and analysis

The haplotypes in Ysearch were downloaded on the 23^rd August 2005. In total there were 15,358 individuals. The data was then edited in Excel to remove: those without a full 37 FTDNA marker results, obvious haplotypes that did not represent individuals (e.g. modal or researcher in name) and one record ZZHNZ that appears to be an obvious input error. Locus 389ii was recoded by subtracting the value of 389i from it so it is in a more appropriate form for analysis. The resulting data set consisted of 3996 records. This data set was then manually coded into given country of origin and region of origin. Those records from the America’s were classified as other for region of origin.

Names were concatenated with their Ysearch listed haplogroup and their unique ID to ease future comparison (e.g. Courtney_R1b_E3HBX). In some cases names were slightly edited.

The multivariate distance matrix Nei’s Na was calculated using Populations and saved in Phylip format. This file was then edited on the first line from <tab>n<tab>n<CR> to n<CR> and read into Quicktree and clustered using Nei’s Neighbor joining algorithm. Those interested in more details are referred to the websites and scientific papers associated with these programs. An extremely useful reference for those wishing to delve deeply is Takezaki and Nei 1996 Genetics 144:389-399.

The resultant Newick format file was then:

1. processed through a script called Unroll to extract unique IDs and their tree order. This allows the source Excel file to be sorted in cluster order.

2. edited to select or delete sections of the tree file manually in PFE using the match braces function and saved as 6 seprate files. Phylograms were created from the resultant files using TreeView and saved as emf files which were then edited in Powerpoint before printing as PDF files using PrimoPDF. The edits included identification of likely boundaries between haplogroups and in some cases subclades.

The links to the phylograms are shown below. They are pdf files so you will need a pdf viewer. Note you can search within the pdf viewer for a surname or unique ID within the phylogram.

Results to date

Phlyograms

The phylograms show clear and obvious structure and in general the analysis has accurately resolved high level haplogroups as per the YCC 2005 tree. As expected some of the deepest clades with few members are less well resolved due to saturation of marker changes, high variability within the haplogroup, and the distance measure used. Some user provided haplogroup assignments also appear to be incorrect. Several of these have been tested via Whit Athey’s haplotype predictor and found to be wildly inconsistent. It would be interesting to investigate whether these individuals have been SNP tested.

PDF files of phylograms of 3996 Ysearch 37 STR haplotypes by Haplogroup

R1b group 1: tree order 1-659 (R1bSTR1-15 updated 15^th Sept)

R1b group 2: tree order 660-1466 (R1STR16-27 includes Irish and Frisian cluster updated 15^th Sept)

E,J,Q,N,R1a (less I, G): tree order 1447-1717 & 2614-2889 (updated 18^th Sept)

I,G: tree order 1718-2613 (updated 18^th Sept)

R1b group 3: tree order 2890-3142 (R1bSTR28-35 updated 15^th Sept)

R1b group 4: tree order 3143-3996 (R1bSTR36-49 includes Scots cluster updated 15^th Sept)

Viewing and searching the phylograms

· They are PDF files so you need a PDF viewer, which most browsers have by default

· Typically you will have to magnify the images up to 800% or greater to make out the individual names and identifiers. Use the PDF viewer magnify button normally located at the top of your browser screen

· Using the search button in the PDF viewer allows you to search for names and unique IDs. You have to search each PDF separately

· Typically, those individuals closest to you and with the shortest separating branches have closely related Y chromosome haplotypes.

· You can use the text select tool to copy a selection of individual identifiers. If you paste these in Excel you will obtain a list of identifiers which you can edit and use to extract raw haplotypes from Ysearch for your own detailed analysis in applications such as Y-DNA comparison utility http://www.mymcgee.com/tools/yutility.html

· The shape and structure of the branches provides an indication of the evolution of the subcluster.

· Detailed queries about these results are best sent to the DNA Genealogy list server.

Cluster Nomenclature

· Where possible, nomenclature from the YCC SNP based haplogroups has been used if a reasonable match can be found.

· Please note individual haplotypes cannot be placed unambiguously into SNP based haplogroups, there is some level of error, and this varies depending on the distinctness and age of the clusters and on the haplotype concerned. Ken Nordtvelt, and Bonnie Schrack helped annotate I and J haplogroups respectively and there help is gratefully appreciated. Whit Athey’s predictor was used to independently identify and find edges of various haplogroups. R1a, has not yet been clustered although distinct clusters are observed.

· If a cluster has a “common name” that is widely used in genealogy circles then this has also been provided. Please note this common name is often geographic, but it only means haplotype belonging to the cluster are more frequent in that area, not that it is diagnostic or solely located in that region.

· Where there is an apparent cluster, but little, or no work has been undertaken defining the cluster then I have given it a placeholder identification until it can be further defined. This is a two-part name a YCC haplogroup description following by a STR cluster number within that haplogroup: e.g. I1aSTR10.

Comparison with Whit Athey’s haplotype predictor

Whit Athey has batch calculated haplogroup predictions for all 3996 individuals using their 37 markers. When these results were compared to those obtained by clustering a very sharp and clear delineation between groups was obtained. This approach also identified several groups, both known and unknown that were outside the haplotype predictor criteria. When these were excluded the concordance between the two methods was extremely high. The method also identified a small number of individuals whose user supplied haplotype or haplogroup may be incorrect. Further details are expanded in the link below.

Comparison with haplotype predictor methodology

Validation against SNP results

SNP genotypes have the useful property that for short time periods (<100,000) years they can essentially be considered unique event polymorphisms. This provides a “black and white” test of whether two individuals share a common ancestor after that mutation occurred. They are also used to classify individuals into haplogroups. Unfortunately, most of the haplogroup designations in Ysearch are not SNP based and there is not formal source to record those SNPs that have been done. If they were available, they would provide an absolute independent check of the accuracy of the clustering algorithm.

Modal locus values for haplogroup and sub-haplogroup cluster defined by the 3^rd phase analysis

Tables of modal values, genetic distances and TMRCA estimates for all the haplogroups and clusters are available from the link below. Please read the notes before using the values especially regards 389ii values.

Tables of modal values from phase 3 analysis 1.2Mb last updated 19^th Sept 2005 (you may wish to save it locally after downloading it)

Raw data tabulation of user provided haplogroup and region of origin

A pivot table of the raw data by user provided haplogroup and region of origin is provided below. The user provided origins were grouped into geographical regions. Please note that 40% are not ascribed to a haplogroup and 52% do not have a ancestral region of origin, so only 29% of the data can be assigned a value.

Tabulation of user supplied haplogroup by region of origin.

Tabulations of estimated haplogroup and STR cluster by region of origin

These tabulations are improved on those above because all individuals are ascribed to a haplogroup or cluster based on STR homology and 48% of the data can be assigned a value. A variety of tabulations are provided, including those corrected for sampling biases across regions and as a proportion of the haplogroup. However, extremely poor sampling of some regions provides insufficient numbers for some estimates. No error estimates are provided, nor correction for multiple closely related samples, but ad hoc tests for significance of the observed patterns can be calculated if required.

Tabulations of estimated haplogroup and STR cluster by region of origin, both in percentages of the region, percentage of a haplogroup within a region and as raw data.

Estimating TMRCA and mutation rates for the phase 3 Y chromosome STR clusters via ASD estimates

The TMRCA and marker variability within subcluster was investigated via ASD estimates. Note these estimates differ as they are calibrated alternatively and a different method is used.

Summary is here

Full table is here

Still to be completed

Creation of summary phylograms showing the clusters and bootstrapping the analysis to provide an indication of the robustness of the branch structure joining the clusters. Also the geographic density of sub-clusters (corrected for sampling bias) based on density derived from surname frequencies.

Conclusions

Using an appropriate software combination, it is possible to cluster extremely large number of Y chromosome haplotypes to investigate their substructure. It is hoped this will allow detailed investigation of the geographic origin and subsequent spread of members of these subclusters.