ABO allele frequencies: Why use the EM algorithm?

ABO allele frequencies: Why use the EM algorithm?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In textbooks and lecture notes and slides posted online, determining allele frequencies using blood type information (ABO), under the assumption of Hardy-Weinberg equilibrium, is accomplished using the EM algorithm. It seems to me, though, that this problem can be solved using basic algebra: If $a$, $b$, $o$ are the frequencies of the $A$, $B$, and $O$ alleles, respectively, and $p_A$, $p_B$, and $p_O$ are the proportions of the population in question displaying blood types $A$, $B$, and $O$, then

$$ p_A = a^2 + 2ao,quad p_B = b^2 + 2bo,quad ext{and}quad p_O = o^2. $$

Inverting the third equation gives $o$ in terms of $p_O$:

$$ o = sqrt{p_O}. $$

Plugging that into the first equation and rearranging, we get the quadratic equation.

$$ a^2 + 2sqrt{p_O}a - p_A = 0 $$

Solving using the quadratic formula and throwing away the negative solution, we get

$$ a = sqrt{p_O + p_A} - sqrt{p_O}. $$


$$ b = sqrt{p_O + p_B} - sqrt{p_O}. $$

Thus, we've solved for the allele frequencies $a$, $b$, and $o$ with only basic algebra.

In the standard textbook case in which $p_A = fa = 186/521$, $p_B = 38/521$, and $p_O = 284/521$, this gives

$$ a approx 0.21,quad bapprox0.05, quad ext{and}quad oapprox0.74, $$

which is close to what you get after a few iterations of the EM algorithm.

Question (finally): If the above calculation is correct (if not, please let me know!), what is a "simplest" non-synthetic (real data, from the literature) example of an allele frequency computation from phenotype data that actually requires a sophisticated technique like the EM algorithm?

An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data

Pooling is a cost effective way to collect data for genetic association studies, particularly for rare genetic variants. It is of interest to estimate the haplotype frequencies, which contain more information than single locus statistics. By viewing the pooled genotype data as incomplete data, the expectation-maximization (EM) algorithm is the natural algorithm to use, but it is computationally intensive. A recent proposal to reduce the computational burden is to make use of database information to form a list of frequently occurring haplotypes, and to restrict the haplotypes to come from this list only in implementing the EM algorithm. There is, however, the danger of using an incorrect list, and there may not be enough database information to form a list externally in some applications.


We investigate the possibility of creating an internal list from the data at hand. One way to form such a list is to collapse the observed total minor allele frequencies to “zero” or “at least one”, which is shown to have the desirable effect of amplifying the haplotype frequencies. To improve coverage, we propose ways to add and remove haplotypes from the list, and a benchmarking method to determine the frequency threshold for removing haplotypes. Simulation results show that the EM estimates based on a suitably augmented and trimmed collapsed data list (ATCDL) perform satisfactorily. In two scenarios involving 25 and 32 loci respectively, the EM-ATCDL estimates outperform the EM estimates based on other lists as well as the collapsed data maximum likelihood estimates.


The proposed augmented and trimmed CD list is a useful list for the EM algorithm to base upon in estimating the haplotype distributions of rare variants. It can handle more markers and larger pool size than existing methods, and the resulting EM-ATCDL estimates are more efficient than the EM estimates based on other lists.


Crop breeding began with the initial domestication of wild ancestors to generate phenotypes suited for human use [1]. Dozens of domestication-related genes have been characterized and contribute to our understanding of the genetic basis of crop domestication [2, 3]. DNA sequences targeted during domestication events exhibit significantly reduced nucleotide diversity due to artificial selection, and beneficial variants are now mostly fixed in landraces and modern germplasm [4]. Crop improvement involves the selection of additional sets of genes, and useful variants at these sequences also accumulated over time in improved germplasm [5, 6]. Various breeding goals and adaptation to diverse environments have caused a widely differing distribution of alleles across populations with more subtle effects on phenotypic morphology, as compared to variation in domestication-related alleles [7]. While this rich pool of potential genetic variants may further improve future crop yield, their minor effects on desirable traits complicate their identification and isolation in the analysis of small populations.

Approximately three percent of all maize genes were subjected to selection during domestication and improvement [8, 9]. The incorporation of useful alleles of these genes in breeding schemes is generally implemented via crossing between individuals to allow the accumulation and fine-tuning of phenotypic alterations that result from DNA recombination and the reshuffling of causal variants [10]. Thus, most artificial selection has essentially worked to reshape gene networks, rather than on single genes [11]. Although we generally define domestication and improvement as distinct phenomena, the genes and variants influencing each have been — and are still being — cooperatively selected and adopted to achieve trait improvement. Breeding success would benefit from a better understanding of this process, which remains elusive to date.

The improvement of quantitative traits such as flowering time, plant stature, grain yield, environmental adaptation, and biotic and abiotic stress resistance depends on the selection of biological interactions between multiple genes (polygenic interactions) [12]. To achieve the desired goal for a target trait, breeders develop populations by crossing a panel of breeding materials to generate novel combinations of favorable alleles and diversified polygenic interactions that can be selected for optimal traits [13, 14]. Most genes encoding polygenic traits contribute subtle effects to the overall quantitative trait expression, as the reshaped gene-interaction networks involve dozens to hundreds of genes. For this reason, genotype-to-phenotype (G2P) prediction or genomic selection (GS) models using whole-genome variations have been effective solutions to predict hybrid performance for plant breeding [15,16,17,18,19].

Maize (Zea mays) was one of the earliest crops to benefit from the power of heterosis by breeding filial one (F1) hybrids exhibiting superior vigor for plant growth and grain yield. The mystery of heterosis has been explored for over a century, but the underlying mechanism remains insufficiently understood [20]. One of the hypotheses for heterosis is the “dominance” model proposing that hybrid vigor of F1s is the result of dominance complementation of many recessive, slightly deleterious alleles at different loci in the parental genomes [21, 22]. This hypothesis was further validated by Yang et al., in which genome-wide identification of deleterious mutations were identified and proved that dominance complementation of deleterious alleles contributed to the formation of heterosis [23]. The second hypothesis for heterosis is overdominance, that the heterozygosity at individual locus causes the superior phenotype compared to either homozygous states [24]. There are several genes supporting overdominance in crops [13, 25, 26]. The development of molecular marker and next-generation sequencing (NGS) technologies has allowed large genomic-scale mapping studies in all major crops. These analyses, based on segregating populations very often derived only from two parents, have empowered the dissection of the genetic architecture of heterosis, mostly focusing on grain yield [13, 27,28,29,30,31,32].

The occurrence and strength of heterosis varies greatly, depending on the germplasm origins of the parental lines thus, the genetic diversity of a population created with only two parents will never be sufficient to identify all heterotic quantitative trait loci (QTLs). Additional limitations of bi-parental populations further restrict the effective detection of epistatic QTLs. For example, each segregating F2 population must be very large to ensure sufficient statistical power. Furthermore, any two interacting loci involved in heterosis must be segregating in the F2 population, or epistasis will not be detected. These two limitations are the major reasons why many previous studies of heterosis underestimated the role that epistasis plays [28, 33].

Here, we present a new genetic design that overcomes these limitations and analyzes multiple linked F1 populations. It was created by crossing inbred lines, developed as a synthetic population, with inbred lines typical of diverse heterotic groups from around the world. This design may allow the comprehensive dissection of heterotic QTLs and associated effects. Identification of heterosis-determining genes may refine our understanding of the mechanism behind heterosis formation. This new mechanistic knowledge may in turn accelerate the process of creating and fixing new heterotic patterns between different pools of germplasm, reducing genetic vulnerability and ultimately enhancing yield in maize improvement.

We constructed 30 F1 populations by crossing 1428 previously reported inbred lines from the CUBIC (Complete-diallel plus Unbalanced Breeding-derived Inter-Cross) synthetic population as a maternal pool [34] with 30 paternal testers from diverse heterotic groups. We performed genome-wide association studies (GWAS) on all populations to identify heterosis and trait-associated genes involved in maize improvement. Interrogation of the 42,840 F1 combinations uncovered the critical roles played by polygenic interactions and provided the framework to propose a theoretical model of the gene-regulation networks at work during floral transition. Based on this model, we used targeted genes and their associated effective variants to demonstrate the successful implementation of molecular design breeding (MDB) to facilitate selection of optimal genotypic combinations to fine-tune desired phenotypes. With the integration of G2P, GWAS and MDB on an actual breeding population, our work presents an exemplary solution to apply big data-driven decision-making strategies to target breeding for crop improvement (Fig. 1a).

Genetic design of the thirty F1 hybrid populations. a Flowchart illustrating the integration of genotype to phenotype (G2P), genome-wide association studies (GWAS), and molecular design breeding (MDB) to achieve big data-assisted targeted breeding. b Schematic illustration of the North Carolina-II design used to generate the 42,840 F1 combinations by crossing 1428 maternal lines with 30 paternal testers. The training set is composed of 8652 hybrids with field-measured phenotypes (dark gray background). The remaining 34,188 hybrids constitute the candidate set whose phenotypes were predicted with the G2P model. c Principle component analysis (PCA) diagram of the thirty F1 hybrid populations showing strong population stratification. d Effect of phenotypic stratification across the thirty F1 hybrid populations, exemplified here with days to tasseling (DTT), before (left) and after (right) normalizing absolute trait values to z-scores within each F1 population. e Strong heterosis performance of the Zheng58 and Jing724 F1 hybrids compared to their parental inbred lines for the three traits under study


Using simulations we explore different study designs that could occur for real data and evaluate the performance of NGSadmix by comparing the estimated admixture proportions to true admixture proportions as well as estimates of admixture proportions based on genotypes called from the simulated sequencing data. We then apply the method on low-depth sequencing data from the 1000 Genomes Project, while comparing the estimates to admixture proportions inferred from HapMap 3 genotype data as well as genotypes called from the sequencing data.


We considered two different population scenarios for our simulations: three closely related ancestral populations based on HGDP allele frequencies and three more distant populations based on HapMap 3 allele frequencies. For each of these sets of allele frequencies, we simulated four different study designs, denoted scenarios A, B, C, and D. We simulated sequencing data conditional on the admixture proportions and allele frequencies in the ancestral populations. Further details of the simulations can be found in Materials and Methods.

Scenario A: Variable depth:

Inspired by the observed average depth distribution in the 1000 Genomes data (see Figure S1), we simulated 100,000 SNPs for 50 samples with varying average depth based on HGDP frequencies, as described in Materials and Methods. The true admixture proportions are shown in the top panel of Figure 1 followed by the individual average sequencing depths. The other panels in Figure 1 show the estimated admixture coefficients using NGSadmix, the estimated admixture coefficients based on maximum-likelihood genotypes (ML), and the maximum posterior genotypes (HW). On filtered HW genotypes (HW filtered) we could not obtain convergence, and the estimates are shown in Figure S6. NGSadmix performs better than the analysis based on called genotypes no matter how the genotypes were called. The HW genotypes called by applying a prior based on the allele frequencies seem to perform better than calling genotypes based on the highest genotype likelihood (ML genotypes).

To quantify the performance beyond visual inspection we calculated the root mean square deviation (RMSD) of the estimated admixture proportions from the true admixture proportions (see leftmost group of bars in Figure 2). Using NGSadmix gives a RMSD of 0.16 while the best-performing method based on called genotypes has an RMSD of 1.18. Similarly, the largest deviation between estimated proportions and true proportions is shown in Figure S2.

Root mean square deviation (RMSD) from the true admixture proportions for the different estimated admixture proportions. Left, scenario A, B, and C simulations using HGDP frequencies and then scenario A, B, and C simulations using HapMap frequencies. Right, the 1000 Genomes five-population scenario sample assuming K = 3 ancestral populations and K = 4 ancestral populations and then the 1000 Genomes two-population scenario sample assuming K = 2 ancestral populations. Note that RMSD is shown on log-scale.

The same scenario was also simulated based on allele frequencies from three distinct populations from HapMap 3. All approaches perform better since the populations are easier to distinguish (see Figure S5). Interestingly, HW genotypes do worse than the ML genotypes for the more distant populations (see Figure 2). NGSadmix still outperforms the methods based on called genotypes.

Scenario B: Low depth:

When simulating a scenario with a low and equal depth of 2× for all individuals, calling genotypes does not show the large bias clearly visible in the variable-depth scenario (see Figure S3 and Figure S4). NGSadmix still performs better with an RMSD of 0.11 while the best-performing method based on called genotypes has an RMSD of 0.18 in the HGDP-based simulations (HGDP Low in Figure 2). Genotype callers based on multiple samples such as using allele frequency priors generally generate better genotype calls than individual genotype callers (Nielsen et al. 2011). However, for this simulation the called ML genotypes based solely on the individual genotype likelihoods give a slightly better result. This is true both for the closely related populations and for the distantly related populations.

Scenario C: High and low depth:

This scenario seeks to mimic a design where a reference panel is sequenced at high depth or genotyped using SNP chips while some individuals are sequenced at very low depth (see Figure S7 and Figure S8). NGSadmix gives approximately correct admixture proportions when simulating both distant and closely related populations. However, when calling genotypes the estimated admixture proportions for the closely related populations resemble the difference in sequencing depth more than the actual ancestry. These problems are clearly reflected in the RMSD and in the maximum difference between the true and estimated admixture proportions (High/Low in Figure 2 and Figure S2). For the HapMap populations the ML genotypes give good results compared to the HW and filtered genotypes.

Scenario D: Small admixture proportions:

In practice the contribution of a single ancestral population to an individual’s ancestry can be much lower than the proportions studied above. In this scenario we simulated individuals with sequencing depth between 0.5× and 6× and with a wide range of admixture proportions as low as 5% see top panel in Figure S9. NGSadmix identifies even low levels of ancestral contribution quite accurately for distantly as well as closely related populations (see Figure S9 and Figure S10). Figure S9 and Figure S10 illustrate how sequencing depth can severely bias the admixture proportions estimated from called genotypes. To further describe the precision of the estimated admixture proportions, we simulated 100 realizations of this scenario. For each simulation we also estimated the admixture proportions directly from the true simulated genotypes. Figure S11 and Figure S12 show that even though some accuracy is lost, NGSadmix does not perform much worse than estimates from the true simulated genotypes. From Figure S13 we see that NGSadmix generally outperforms estimating admixture proportions from called genotypes regardless of the admixture proportions. Figure S14 and Figure S15 show that the distribution of the coefficients estimated from called genotypes strongly depends on sequencing depth with small differences between admixture proportions.

1000 Genomes sequencing data

We tested the performance of our method on the 1000 Genomes low-coverage sequencing data. We selected 100 individuals from five distinct HapMap 3 populations that overlap with the 1000 Genomes low-coverage sequencing data. To evaluate the performance, we first used ADMIXTURE (Alexander et al. 2009) on publicly available HapMap 3 genotypes. Results assuming three ancestral populations are shown in Figure 3 and those for four populations are in Figure S17. For the same sites we generated genotype likelihoods based on the sequencing data. The admixture proportions estimated by NGSadmix are shown in the second panels of Figure 3 and Figure S17. The estimates based on sequencing data are almost indistinguishable from those based on HapMap 3 genotypes with the maximum observed difference in admixture proportion of 1.5% (labeled K = 3 in Figure S2). Analysis based on called genotypes also captures most of the admixture signal but for HW genotypes several of the nonadmixed individuals show a large amount of admixture (see Figure S16). Estimates based on called genotypes in general have a higher RMSD and a higher maximum deviance (labeled K = 3 in Figure 2 and Figure S2). However, estimates based on haplotype imputed genotypes performed only slightly worse than NGSadmix. Similar results are observed when assuming a higher number of populations (Figure S17).

Estimated admixture proportions from both HapMap 3 genotype data (top) and NGSadmix results (bottom) for low-depth sequencing data from the 1000 Genomes.

We also tested NGSadmix on Japanese and Han Chinese since these populations are more closely related and are much harder to distinguish. The inferred population structure might not be due to two distinct homogeneous ancestral populations but might instead be the result of a more complex population history. All of the methods perform worse in this data set. As observed in the simulated data, the estimated admixture proportions correlate with the sequencing depth for ML and HW genotypes, while the admixture proportions estimated by NGSadmix and haplotype imputed genotypes correspond fairly well to those inferred using HapMap 3 genotypes (see Figure S18).

So far only sites that are known to be polymorphic have been analyzed. For some populations a large set of known SNPs might not be available. Therefore, we inferred SNPs from the sequencing data and based the analysis on those SNPs. We chose to call SNPs from 50 10-Mb randomly selected regions. This resulted in >2 million inferred SNPs. The results based on this analysis can be seen in Figure S19. Note that for the admixed individuals we cannot expect the exact same results from two different sets of SNPs since alleles from one population will be present in long tracts along the genome. However, we still see a very high correlation with the analysis from the SNP chip data. We estimated genotype likelihoods using both Equation 1 and SAMtools (H. Li et al. 2009). Although they gave a very different number of polymorphic sites, they gave similar admixture proportion estimates.

Convergence and computational speed

Convergence may sometimes be a problem due to the large number of parameters that are estimated simultaneously. Therefore, each scenario was run multiple times with different random starting points. Using NGSadmix on genotype likelihoods, all runs typically converge to the same maximum. When calling genotypes from the sequencing data, convergence was in some scenarios more of a problem, and here we had to run ADMIXTURE many times to determine the maximum (see Table S1).

When applying NGSadmix on 100 individuals from the 1000 Genomes project with almost 3 million SNPs, the method took on average almost 5 hr, using 10 central processing unit (CPU) threads. When running ADMIXTURE on SNP chip genotypes from the same data, the run time was on average 0.5 hr, also using 10 CPU threads. For the 100 realizations of scenario D with 340 individuals the average time for NGSadmix for the HapMap frequencies was 12.2 min (SD = 1.9), using 6 CPUs.

Blood Groups Distribution and Gene Diversity of the ABO and Rh (D) Loci in the Mexican Population

Objective. To determine the frequency and distribution of ABO and Rh (D) antigens and, additionally, investigate gene diversity and the structure of Mexican populations. Materials and Methods. Blood groups were tested in 271,164 subjects from 2014 to 2016. The ABO blood group was determined by agglutination using the antibodies anti-A, Anti-B, and Anti-D for the Rh factor, respectively. Results. The overall distribution of ABO and Rh (D) groups in the population studied was as follows: O: 61.82% A: 27.44% B: 8.93% and AB: 1.81%. For the Rh group, 95.58% of people were Rh (D), and 4.42% were Rh (d). Different distributions of blood groups across regions were found additionally, genetic analysis revealed that the

allele showed an increasing trend from the north to the center, while the

allele tended to increase from the center to the north. Also, we found more gene diversity in both loci in the north compared with the center, suggesting population structure in Mexico. Conclusion. This work could help health institutions to identify where they can obtain blood products necessary for medical interventions. Moreover, this piece of information contributes to the knowledge of the genetic structure of the Mexican populations which could have significant implications in different fields of biomedicine.

1. Introduction

More than a century has passed since the discovery of ABO system by Karl Landsteiner in 1901 this knowledge has contributed to the understanding of some mechanisms basis of heredity, and today it still has a great conceptual and clinical interest [1] also, blood antigens had been related to predisposing individuals to some diseases like cancer, diabetes, infectious diseases, and heart illnesses [2–4] or protecting individuals against some diseases such as malaria and diabetes [5, 6]. Moreover, blood antigens had been used to evaluate ethnic diversity of human populations [7], for which they have been widely studied in population genetics [8, 9].

The ABO and Rh blood groups are the most relevant antigens because their incompatibility produces hemolysis [10] and hemolytic disease of the newborn in the case of the Rh group [11]. Furthermore, blood antigens play an important role in the success of transfusions and organ transplants [12] compatibility of ABO groups between donors and recipients is desirable to avoid immune responses against allograft and reducing the use of immunosuppressive therapies.

The main challenge is to understand how to promote tolerant immune responses against allograft tissues different factors such as age, viral serology, and gender had been studied to identify their relationship with allograft rejection. Additionally, the role of ethnicity has been poorly studied [13–17]. For the above, molecular biology has taken great importance to identify genetic variants present in several ethnic groups that could play an important role in the success of allograft transplants between donors and recipients [18].

There are few works about population genetics in Mexico [19–21]. The first studies were performed by Lisker and colleagues, in indigenous and mestizo populations by studying several blood antigens [9, 22] however, few populations were studied and currently there is lack of information about blood groups distribution in the country, whereby it is essential to get this information to help health institutions for the effective management of their blood banks that facilitate transplant medicine practices.

Here we report the distribution of ABO and D antigens in 17 states of the country. Additionally, we used the ABO and D loci as a genetic trait to investigate gene structure of Mexican populations. The above will provide information which would support national programs for blood and organ transplant in Mexico as well as increase the knowledge of Mexican genetics.

2. Material and Methods

2.1. Study Design

A cross-sectional study was conducted in patients who visited the clinics of Salud Digna para Todos in 17 states of Mexico from 2014 to 2016. The selection of participants was performed using a nonprobabilistic sampling with information on the blood group test. From each people, clinical history was obtained before screening for their demographic data. 271,164 subjects were selected between 0 and 90 years of both genders. Ethical approval was obtained from the Ethical and Research Committee of the Faculty of Medicine, Autonomous University of Sinaloa.

2.2. Sample Collection and Blood Groups Determination

From each patient, we got approximately 5 ml of peripheral venous blood with the BD Vacutainer® Blood Collection Tube with EDTA. Tubes were inverted for 8 to 10 times to mix well with the anticoagulant. Blood samples were centrifuged at 1000 to 1500 rpm for 10 min. Erythrocytes were separated for the determination of blood type. ABO blood group was determined from each sample by agglutination using anti-A and Anti-B antibodies (Immucor Inc., Norcross, GA, USA) Rh factor was determined by agglutination using Anti-D antibody (Immucor Inc., Norcross, GA, USA). All assays were performed with the Galileo Echo™ Blood Analyzer (Immucor Inc., Norcross, GA, USA) according to manufacturers’ recommendations.

2.3. Allelic Frequency and Gene Diversity Analysis

Allele frequencies were estimated according to Bernstein’s method (1925) [23] from the phenotypic data the expected frequency was calculated under the assumption of the Hardy-Weinberg equilibrium from Rh and ABO phenotypes, with the Expected-Maximization (EM) algorithm [24]. Gene diversity was analyzed according to Nei [25]. The Nei genetic distances [26] were calculated based on the gene frequencies data of the ABO and D loci, and a dendrogram was constructed using the neighbor-joining (NJ) clustering procedure with the POPTREEW software [27]. The gene frequencies were used for the Principal Components Analysis (PCA).

2.4. Statistical Analysis

Demographic and phenotypic data were analyzed with descriptive statistics proportions of blood groups are shown as a percentage with 95% CI. The chi-squared test was performed to compare differences between groups and categories.

values less than 0.05 were regarded as statistically significant. The Wilson score method without continuity correction was performed to calculate 95% CI. Data were analyzed with the Minitab V17 software (Minitab Inc.).

3. Results

3.1. Blood Groups Distribution by Age and Gender

The blood types distribution in 271,164 people studied revealed that O was the most frequent (61.82%), followed by A at 27.44% and B at 8.93%, and finally AB group was the less frequent at 1.81%. Moreover, the Rh (D) group was found in 95.58% of the people studied, and 4.42% were identified with the Rh (d) group (Figure 1).

The pooled ABO and Rh groups had the following distribution: the O Rh (D) type was the most frequent (59.26%), followed by A Rh (D) (26.08%), B Rh (D) (8.53%), O Rh (d) (2.56%), AB Rh (D) (1.71%), and A Rh (d) (1.35%). In contrast, B Rh (d) (0.40%) and AB Rh (d) (0.10%) were less frequent which were found in less than 1% of people studied (Table 1).

The distribution of pooled blood antigens among age and gender was analyzed it was found that they had similar frequencies in people ranges from 0 to 90 years (Table 1). Interestingly, slight differences were observed in some blood types in both genders.

A Rh (D) and AB Rh (D) blood types were more common in males (26.88% and 1.84%, resp.) than females (25.74% and 1.66%, resp.). Meanwhile, O Rh (D) was most prevalent in women (59.78%) than men (58.04%) (Table 1).

3.2. Geographic Distribution of Blood Groups

Blood groups were studied in 17 states which belong to 6 regions of Mexico the blood groups O and Rh (D) were the most frequent in all states analyzed. However, their frequencies change across the country (Figure 2). The blood type O Rh (D) was more frequent in Puebla (73.15%), Estado de Mexico (69.32%), and San Luis Potosi (66.18%) compared with Sinaloa (52.73%), Jalisco (54.86%), and Sonora (54.97%). Moreover, A Rh (D) blood type was more frequent in Sinaloa (30.52%), Nayarit (28.60%), and Sonora (28.29%) whereas in Puebla (18.34%), Estado de Mexico (20.48%), and Veracruz (21.34%) it was less prevalent (Figure 2 and Table 2).

of ABO and Rhesus blood groups in different states of Mexico. Pie charts summarize per-state average proportions of ABO pooled to Rh (D) group and bars show the proportions of ABO combined with Rh (d) blood group. National average of blood group: 4.42% for Rh (d) (gray color) and 95.58% for Rh (D), involving 61.82% for O (red color), 27.44% for A (green color), 8.93% for B (blue color), and 1.81% for AB (orange color) groups. BC = Baja California SON = Sonora SIN = Sinaloa DGO = Durango NAY = Nayarit COAH = Coahuila JAL = Jalisco MICH = Michoacan NL = Nuevo Leon GTO = Guanajuato AGS = Aguascalientes QRO = Queretaro SLP = San Luis Potosi VER = Veracruz MEX = Estado de Mexico PUE = Puebla CDMX = Ciudad de Mexico.

B Rh (D) was more frequent in Durango (10.86%), Aguascalientes (9.90%), and Nuevo Leon (9.88%) in contrast, this group was less prevalent in Puebla (5.97%), Estado de Mexico (7.00%), and San Luis Potosi (7.12%). O Rh (d) was more frequent in Sinaloa (3.73%), Sonora (3.61%), and Durango (3.09%) than Puebla (1.21%), San Luis Potosi (1.29%), and Estado de Mexico (1.36%). For AB Rh (D) we observed that it was more frequent in Durango (2.40%), Jalisco (2.24%), and Michoacan (2.22%) and less frequent in Puebla (0.85%), San Luis Potosi (0.97%), and Estado de Mexico (1.02%).

The A Rh(−) blood type was more frequent in Sinaloa (2.41%), Sonora (2.19%), and Jalisco (1.77%) than Puebla (0.39%), San Luis Potosi (0.32%), and Veracruz (0.56%) in which it was less frequent. The B Rh (d) blood type was more frequent in Nayarit (0.63%), Sinaloa (0.62%), and Sonora (0.55%) and was less prevalent in Puebla (0.08%), San Luis Potosi (0.16%), and Estado de Mexico (0.18%) finally, the AB Rh (d) group was more frequent in Michoacan (0.24%), Sonora (0.13%), and Sinaloa (0.12%), was absent in Nuevo Leon (0.00%) and San Luis Potosi (0.00%), and was less frequent in Puebla (0.02%) and Estado de Mexico (0.02%) (Table 2).

3.3. Genetic Analysis of the ABO and D Loci
3.3.1. Heterozygosity and Hardy-Weinberg Equilibrium at the ABO and D Loci

We analyzed the heterozygosity of the ABO and D loci in the sample studied (Table 3). The highest heterozygosities of the ABO locus were found in Sinaloa

, while the lowest one was found in Puebla , Estado de Mexico , and San Luis Potosi

. Similarly, for the D locus in Sinaloa , Sonora , and Durango the highest heterozygosities were observed, while the lowest heterozygosities were observed in Puebla , San Luis Potosi , and Estado de Mexico (Table 3).

According to these observations, populations were analyzed to know if they were in the Hardy-Weinberg equilibrium (HWE). For the ABO locus, significant deviations were observed in Jalisco (χ 2 = 6.03

) and Ciudad de Mexico (χ 2 = 5.42 ). In contrast, we found that the locus D was in HW equilibrium in all populations analyzed (Table 3).

3.3.2. Allelic Distribution

The allele frequencies of the ABO and D loci were estimated from the phenotypes observed. It was found that allele was more frequent in Sinaloa , Jalisco , and Sonora , while in Puebla , Estado de Mexico , and Veracruz it was less frequent (Table 3). Allele

was more frequent in Durango , Aguascalientes , and Jalisco , while in Puebla , Estado de Mexico , and San Luis Potosi this allele was less frequent (Table 3).

was more frequent in Puebla , Estado de Mexico , and San Luis Potosi , while this allele was less frequent in Sinaloa , Jalisco , and Sonora (Table 3).

For the Rhesus group, the allele was more frequent in Puebla , San Luis Potosi , and Estado de Mexico . In Sinaloa , Sonora , and Jalisco this allele was less frequent. The allele was more frequent in Sinaloa , Sonora , and Jalisco , while Puebla , San Luis Potosi , and Estado de Mexico were the states in which this allele was less frequent (Table 3).

The ABO and D loci were not distributed homogeneously among states to understand the variation observed we used the Principal Component Analysis (PCA) based on the allele frequencies of the ABO and D loci (Table 3). PC1 and PC2 explain 97.2% of the total variation of the ABO and Rh blood groups distribution. The PC1 differentiates populations with high frequencies of , , and alleles meanwhile, PC2 separates those with high proportions of and alleles according to this, four groups could be defined (Figure 3).

The first group includes the states of Coahuila, Queretaro, and Veracruz which have moderate frequencies of the and alleles (first quadrant). The second comprises Durango, Aguascalientes, Nuevo Leon, and Guanajuato which have higher proportions of the I B and I D alleles and moderate frequencies of allele (second quadrant).

Both groups have states with higher frequencies of the B Rh (D) and B Rh (d) blood types in the second group, there are states with moderate proportions of the AB blood type. A geographic clustering in these groups was not evident (Figure 3).

Interestingly, in the third and fourth group, a geographical clustering was observed the third group includes the states of Puebla, San Luis Potosi, Estado de Mexico, and Ciudad de Mexico (third quadrant) which have higher frequencies of and alleles and lower frequencies of the and alleles. These states are located in the east, north-center, and south-center of the country (Table 3). The fourth group has higher frequencies of and alleles and includes the states of Sinaloa, Sonora, Baja California, Michoacan, Jalisco, and Nayarit which belong to northwest and west of Mexico (Table 3).

We used the neighbor-joining (NJ) clustering procedure based on Nei’s genetic distances (DA) to analyze the relationship between populations studied. Two main clusters were identified the first includes the states of Puebla, Estado de Mexico, San Luis Potosi, Ciudad de Mexico, Veracruz, Queretaro, and Coahuila (which have higher frequencies of the allele Table 3). In the second Sinaloa, Sonora, Jalisco, Michoacan, Nayarit, Baja California, and Durango were included (which have higher frequencies of the and alleles in the case of Durango). The states of Aguascalientes, Guanajuato, and Nuevo Leon, also, were included in this group, since they have higher frequencies of the allele and are more related to Durango than the other states of this group (Figure 4).

3.3.3. Population Structure and Gene Diversity

The results of the PCA and the NJ clustering of the ABO and D allele’s frequencies evidence gene diversity among Mexican populations. To formally measure the genetic differentiation

, we cluster all population studied in 4 main regions and performed analysis for each locus and pooled loci. Groups are north (Baja California, Sonora, Sinaloa, Nuevo Leon, Durango, and Coahuila), west (Nayarit, Jalisco, and Michoacan), east (Puebla and Veracruz), and center (San Luis Potosi, Aguascalientes, Guanajuato, Queretaro, Estado de Mexico, and Ciudad de Mexico).

The overall gene diversity was higher at the ABO locus than the D locus similarly, the gene diversity within populations was higher in the ABO locus than the D locus . However, gene differentiation was higher in the D locus than the ABO locus (Table 4).

: genetic diversity within populations

: genetic diversity among populations

The regional analysis shows that the highest gene diversity and variability within populations for the ABO locus were found in the west ( ) meanwhile the lowest was observed in the east ( , ). For the D locus, the highest gene diversity and variability within the population were observed in the north ( , ), and the lowest one was found in the east ( , ) (Table 4).

The highest genetic differentiation for the ABO locus was found in the north and in the west for the D locus. Surprisingly, a negative value for the genetic differentiation parameter in the east was found, suggesting no differentiation in both loci in this region, which is consistent with low heterozygosities observed (Table 4).

4. Discussion

The study of blood groups is fundamental in the clinical practice due to the inherent relationship in transfusion medicine and organ transplants [12]. In Mexico, the rate of blood donations in 2014 increased from 15.66 per 1000 individuals to 17.33 per 1000 individuals in 2015 [28]. The above is due to the improvement in donor blood programs established in the country however, in blood banks it is challenging to get enough blood units, especially for the less frequent blood types.

For the above, it is necessary to implement effective programs among health institutions to get specific blood types and products according to their geographic distribution. However, the information about the proportions of the ABO and Rh (D) blood groups in Mexico is insufficient to meet this need here we report the distribution of ABO and Rh (D) blood groups in several areas of the country.

To our knowledge, this is the first multicenter study of the ABO and Rh (D) blood groups in Mexico, in which the overall distribution in both genders, in a wide age range, and in different states of the country has been analyzed. A total of 271,164 individuals from 17 states of Mexico were studied between the years 2014 and 2016. We found that the ABO groups distribution was O (61.82%), A (27.44%), B (8.93%), and AB (1.81%). Our observations were similar to previous reports in which the O group was the most frequent, followed by the A, B, and AB groups [29–35].

The frequencies of the ABO antigens in Mexican populations are different from those observed in other Latin American countries like Argentina, Bolivia, Brazil, and Dominican Republic [36]. Interestingly, the Rh (D) antigen was more frequent in Mexico (95.58%) than what is observed in other Latin American countries [36]. The frequency observed was slightly similar to those found in indigenous populations [37–39], reflecting the complex processes of the admixture giving rise to Mexican mestizo populations [9].

It was found that the frequencies of blood groups were similar among ages however, slight differences between genders were observed in the A Rh (D), AB Rh (D), and O Rh (D) blood types. The above could be explained by the sampling method used, which would result in the overrepresentation of females in the sample.

Previous studies have been conducted in Mexico to determine the local distribution of the ABO and Rh (D) blood groups a few of those works were performed in indigenous people [37, 40–42] and the majority in mestizos [29–35]. For this study, samples were obtained from metropolitan cities, most of which are composed of mestizo individuals variability in proportions of blood antigens was found in different areas of the country. The frequencies observed in Coahuila, Nuevo Leon, Jalisco, and Ciudad de Mexico were similar to that previously reported [30–34] however, for Durango, Puebla, and Guanajuato, proportions of blood antigens were different compared with our results [29, 31, 34, 35]. Moreover, the allele frequencies for both loci in previous works were different from those reported here. Additionally, populations studied in those reports were not in Hardy-Weinberg genetic equilibrium (HWE) in both loci [30–32, 34, 35] except in Puebla [29] and Coahuila [31].

Samples analyzed in this work were in HWE for the ABO locus except those coming from Jalisco and Ciudad de Mexico. The above could result from nonrandom sampling or internal migrations (that happens in this states by their socioeconomic development) because the sample size is big and other disturbance events have not been reported in these populations (i.e., inbreeding and mutations). Interestingly, we found that the Rh (D) locus was in HWE however, more studies are needed to corroborate our observations.

The above is important because if populations are in HWE this means that the observed frequencies of blood groups will be similar in each generation. This information will allow health institutions to obtain enough blood units since the site where it is more frequent to get a specific blood type with the confidence that these frequencies will be relatively constant is known, and it will be not necessary to investigate the distribution of blood groups in these populations again as soon.

Additionally, geographical cline of the ABO and D loci with remarkably high frequencies in the north and the center for the and , respectively, was identified more studies are needed to explain the possible causes underlying these cline distributions in the country. Different factors like migrations, nonrandom mating, and infectious diseases among others would confer evolutionary constraints over this genetic trait [4, 43, 44] it would be possible that both loci have some selection pressure resulting in their current distribution in Mexico however, this remains unexplored yet.

In this report, we evidenced regional differences of the blood groups distribution we suspect that these differences could be a result of differentiation between regions according to this, we studied the genetic structure of the population by using the ABO and D loci as genetic markers. Differentiation in Mexican populations was found among regions analyzed also a higher heterozygosity and gene diversity were observed in the north and west meanwhile, in the east and south-center we found low heterozygosity and gene diversity.

Despite the wide distribution of the ABO and D/d alleles, the estimation of interpopulation comparison ( and

) also evidences genetic differentiation between populations. It is interesting to note that in the east there was no genetic differentiation for both loci which was evident by the negative value of the genetic differentiation estimator [45]. The above would be possible by the lowest heterozygosity found in Puebla in which the highest frequencies of the and alleles were observed.

It would be interesting to investigate the reason for the reduction in heterozygosity of both loci in Puebla. Additionally, it is necessary to sample other populations of the east to corroborate our observations and extend this study to other regions of Mexico to know the countrywide distribution of the ABO and Rh (D) blood groups.

There are a few works about gene diversity in Mexico our results with the ABO and D loci as a genetic trait are consistent with them in which the genetic structure of indigenous and mestizo populations was explored with SNPs as genetic markers [19, 20]. Similar to ours, these works reported that populations in the north have higher heterozygosities with respect to those located in the center and the south of the country [20]. Additionally, they found genetic stratification in indigenous communities [19, 20].

Interestingly, this Native-American population substructure is recapitulated in the genomes of Mexican mestizos [19] which is consistent with our observations of genetic differentiation in Mexican populations across several regions of the country. It is important to take into account the fact that Mexicans are a mestizo population recently established, composed of the admixture of European, African, and majorly Amerindians [19, 20] where the allele is nearly fixed [37, 39, 46]. The above could explain the high frequencies of the allele in Mexico, especially in Puebla in which the Amerindian ancestry is more prevalent [29, 47] supporting our observations of low heterozygosity, suggesting low admixture in this population.

Currently, there are 68 indigenous groups in Mexico [48] which have their own cultural and economic systems that differ significantly from mestizo populations these people represent about 6.4% of the entire population [49]. Ruben Lisker performed the first works of Mexican genetics in indigenous populations in the 1960s [9, 22], in which he tried to know the degree of admixture as well as the main ancestral components present in these populations. Recently, some studies have been carried out at the molecular level with the aim of knowing the underlying relationships between indigenous and mestizos [19–21], to reconstruct the history of the Amerindian populations in the continent [50] and their development throughout the country [21]. Additionally, these works have explored the possible effects of the genetic content in the clinic context [19].

At this point, our work contributes to the knowledge of the gene diversity in Mexico by evidencing regional and geographic differentiation into the country. Also, we studied some populations that had not been previously analyzed, thus increasing the information of the population genetics in Mexico.

Here we show that people of the western part (including northwest populations) have a close genetic relationship between them similarly, populations of the south-center are more related to eastern part interestingly, east populations kept a distant genetic relationship with western ones. It would be interesting to analyze if there is any influence of gene diversity in clinical traits.

Previous work showed the impact of genetic variation in the accuracy of lung function assessment [19] it was reported that healthy people with genetic variants common in the east of Mexico had different results on the lung function test than did people from the west [19]. The above suggests that the same criteria to diagnose lung disease could not be applied in both populations because this would result in a misdiagnosis [19]. Additionally, other works have related genetic ancestry in Mexico to susceptibility to breast cancer [51] and diabetes [52]. Together these works show the effects of gene diversity on diagnostic tools and the risk to get some diseases that will have to be taken into account in the future to improve accuracy in biomedicine. Therefore, it is crucial to develop genomic medicine to impact on Mexico’s public health positively.

In transplant medicine, several works have studied the effects of genetic variants of a wide range of proteins including Human Leukocyte Antigens (HLA) in the risk of rejection in allograft transplants [15–17, 53, 54]. For example, in Mexico, some works have found a positive association between specific HLA haplotypes and acute kidney rejection [15, 17]. Interestingly, those immunogenic variants are widely distributed among indigenous and mestizo people [47, 55].

For the above, it would be possible to think that gene diversity could play an important role in transplant medicine in that case, genetically related populations could have lower organ-rejection rate than those with greater genetic distance. Therefore the knowledge of gene diversity could help to select suitable donors and estimate the success of organ transplants as well as the effectiveness of the immunosuppressive therapies to prevent acute rejections nevertheless, this remains unexplored yet.

This work has some limitations including the sampling method and the indirect determination of the ABO and D allele’s frequencies however, the large sample size and the uniformity in the blood group test ensure the results obtained, which provides a unique opportunity to estimate the blood groups distribution in Mexico. Likewise, we expected that this study helps in the establishment of regional and national programs for blood transfusions and organ transplants according to the distribution of blood antigens. Additionally, our results about gene diversity in 17 states of Mexico will expand the knowledge of anthropology of the country which will allow understanding the establishment of the current Mexican population and their relationship with different ethnic groups around the country.

5. Conclusions

This work will provide useful information for health institutions in the establishment of regional and national programs that speed up tissue transplants and blood transfusions needed in clinical practice. Likewise, it will contribute to the study of Mexican genetics by showing its differentiation among the country, which could have important implications in different fields of biomedicine such as transplant medicine and immunology, as well as the treatment and diagnosis of several pathologies present in the country. Additionally, this work is expected to generate deep interest in ethnologists and anthropologists related to the study of population’ genetics in Mexico, as well as physicians interested in the application of the molecular genetics in diagnosis and clinical practice.

Conflicts of Interest

The authors declare no conflicts of interest.


The authors thank the staff of Salud Digna clinics for their support during this work. This research was supported by an internal budget of Salud Digna para Todos.


  1. R. Owen, “Karl Landsteiner and the first human marker locus,” Genetics, vol. 155, no. 3, pp. 995–998, 2000. View at: Google Scholar
  2. S. K. Vasan, K. Rostgaard, A. Majeed et al., “ABO Blood group and risk of thromboembolic and arterial disease: a study of 1.5 million blood donors,” Circulation, vol. 133, no. 15, pp. 1449–1457, 2016. View at: Publisher Site | Google Scholar
  3. S. K. Rummel and R. E. Ellsworth, “The role of the histoblood ABO group in cancer,” Future Science OA, vol. 2, no. 2, 2016. View at: Publisher Site | Google Scholar
  4. D. J. Anstee, “The relationship between blood groups and disease,” Blood, vol. 115, no. 23, pp. 4635–4643, 2010. View at: Publisher Site | Google Scholar
  5. A. E. Fry, M. J. Griffiths, S. Auburn et al., “Common variation in the ABO glycosyltransferase is associated with susceptibility to severe Plasmodium falciparum malaria,” Human Molecular Genetics, vol. 17, no. 4, pp. 567–576, 2008. View at: Publisher Site | Google Scholar
  6. C. Zhang, Y. Li, L. Wang et al., “Blood group AB is protective factor for gestational diabetes mellitus: A prospective population-based study in Tianjin, China,” Diabetes/Metabolism Research and Reviews, vol. 31, no. 6, pp. 627–637, 2015. View at: Publisher Site | Google Scholar
  7. L. L. Cavalli-Sforza and M. W. Feldman, “The application of molecular genetic approaches to the study of human evolution,” Nature Genetics, vol. 33, supplement 3, pp. 266–275, 2003. View at: Publisher Site | Google Scholar
  8. J. Bangham, “Blood groups and human groups: Collecting and calibrating genetic data after World War Two,” Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, vol. 47, pp. 74–86, 2014. View at: Publisher Site | Google Scholar
  9. E. Suárez-Díaz, “Indigenous populations in Mexico: Medical anthropology in the work of Ruben Lisker in the 1960s,” Studies in History and Philosophy of Science Part C Studies in History and Philosophy of Biological and Biomedical Sciences, vol. 47, pp. 108–117, 2014. View at: Publisher Site | Google Scholar
  10. D. P. Simmons and W. J. Savage, “Hemolysis from ABO incompatibility,” Hematology/Oncology Clinics of North America, vol. 29, no. 3, pp. 429–443, 2015. View at: Publisher Site | Google Scholar
  11. S. J. Urbaniak and M. A. Greiss, “RhD haemolytic disease of the fetus and the newborn,” Blood Reviews, vol. 14, no. 1, Article ID 90123, pp. 44–61, 2000. View at: Publisher Site | Google Scholar
  12. J. D. Roback, B. J. Grossman, T. Harris, and C. D. Hillyer, American Association of Blood Banks: Technical Manual, American Association of Blood Banks, 2011.
  13. M. S. Reddy, J. Varghese, J. Venkataraman, and M. Rela, “Matching donor to recipient in liver transplantation: Relevance in clinical practice,” World Journal of Hepatology, vol. 5, no. 11, pp. 603–611, 2013. View at: Publisher Site | Google Scholar
  14. S. Feng, N. P. Goodrich, J. L. Bragg-Gresham et al., “Characteristics associated with liver graft failure: the concept of a donor risk index,” American Journal of Transplantation, vol. 6, no. 4, pp. 783–790, 2006. View at: Publisher Site | Google Scholar
  15. A. Torres-Machorro, N. Camorlinga-Tagle, C. Rodrguez-Ortiz et al., “Role of major histocompatibility complex and ethnicity in acute renal graft rejection,” Transplantation Proceedings, vol. 42, no. 6, pp. 2372–2375, 2010. View at: Publisher Site | Google Scholar
  16. M. P. Hernandez-Fuentes, C. Franklin, I. Rebollo-Mesa et al., “Long- and short-term outcomes in renal allografts with deceased donors: a large recipient and donor genome-wide association study,” American Journal of Transplantation, 2018. View at: Publisher Site | Google Scholar
  17. M. C. Riquelme-McLoughlin, J. Granados, V. Acuña-Alonzo et al., “Extended major histocompatibility complex haplotypes, ancestry and acute kidney transplant rejection in Mexicans,” Revista de Investigación Clínica, vol. 63, no. 4, pp. 370–375, 2011. View at: Google Scholar
  18. B. Almoguera, A. Shaked, and B. J. Keating, “Transplantation genetics: Current status and prospects,” American Journal of Transplantation, vol. 14, no. 4, pp. 764–778, 2014. View at: Publisher Site | Google Scholar
  19. A. Moreno-Estrada, C. R. Gignoux, J. C. Fernández-López et al., “The genetics of Mexico recapitulates Native American substructure and affects biomedical traits,” Science, vol. 344, no. 6189, pp. 1280–1285, 2014. View at: Publisher Site | Google Scholar
  20. I. Silva-Zolezzi, A. Hidalgo-Miranda, J. Estrada-Gil et al., “Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 106, no. 21, pp. 8611–8616, 2009. View at: Publisher Site | Google Scholar
  21. S. Romero-Hidalgo, A. Ochoa-Leyva, A. Garcíarrubio et al., “Demographic history and biologically relevant genetic variation of Native Mexicans inferred from whole-genome sequencing,” Nature Communications, vol. 8, no. 1, 2017. View at: Publisher Site | Google Scholar
  22. A. Barahona, “Medical genetics and the first studies of the genetics of populations in Mexico,” Genetics, vol. 204, no. 1, pp. 11–19, 2016. View at: Publisher Site | Google Scholar
  23. J. M. Nam and J. J. Gart, “Bernstein's and gene-counting methods in generalized ABO-like systems,” Annals of Human Genetics, vol. 39, no. 3, pp. 361–373, 1976. View at: Publisher Site | Google Scholar | MathSciNet
  24. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 39, no. 1, pp. 1–38, 1977. View at: Google Scholar | MathSciNet
  25. M. Nei, “Analysis of gene diversity in subdivided populations,” Proceedings of the National Acadamy of Sciences of the United States of America, vol. 70, no. 12, pp. 3321–3323, 1973. View at: Publisher Site | Google Scholar
  26. M. Nei, “Estimation of average heterozygosity and genetic distance from a small number of individuals,” Genetics, vol. 89, no. 3, pp. 583–590, 1978. View at: Google Scholar
  27. N. Takezaki, M. Nei, and K. Tamura, “POPTREEW: Web version of POPTREE for constructing population trees from allele frequency data and computing some other quantities,” Molecular Biology and Evolution, vol. 31, no. 6, pp. 1622–1624, 2014. View at: Publisher Site | Google Scholar
  28. PAHO, Supply of Blood for Transfusion in Latin American and Caribbean Countries 2014 and 2015, Pan American Health Organization, Washington, DC, USA, 2017.
  29. R. Lisker, R. Pérez‐Briceño, J. Granados, and V. Babinsky, “Gene frequencies and admixture estimates in the state of Puebla, Mexico,” American Journal of Physical Anthropology, vol. 76, no. 3, pp. 331–335, 1988. View at: Publisher Site | Google Scholar
  30. C. Zavala, M. A. Velazquez-Ferrari, C. Navarrete, J. Rosales-Corona, and R. Lisker, “Estimation of the number of females at risk of isoimmunization to the Rho (D) antigen in a sample of the population attended at the Instituto Mexicano del Seguro Social,” Archivos de Investigacion Medica, vol. 14, no. 2, pp. 199–206, 1983. View at: Google Scholar
  31. V. B. Rubén Lisker, E. Ramírez, R. Pérez-Briceño, and J. Granados, “Gene frequencies and admixture estimates in four Mexican urban centers,” Human Biology, vol. 62, no. 6, pp. 791–801, 1990. View at: Google Scholar
  32. R. Lisker, R. Perez‐Briceño, J. Granados et al., “Gene frequencies and admixture estimates in a Mexico City population,” American Journal of Physical Anthropology, vol. 71, no. 2, pp. 203–207, 1986. View at: Publisher Site | Google Scholar
  33. R. Garza‐Chapa, “Genetic distances for ABO and Rh(D) blood groups in the state of Nuevo Leon, Mexico,” Social Biology, vol. 30, no. 1, pp. 37–41, 1983. View at: Google Scholar
  34. G. C. Raúl, T. C. Rebeca, M. C. F. Ricardo, and H. L. G. Carlos, “Los grupos sanguíneos ABO y Rh (D) en poblaciones de la Región Lagunera México (Cálculo de la frecuencia de incompatibilidad simple y doble en matrimonios y maternofetal),” Salud Pública de México, vol. 26, no. 2, pp. 130–137, 1984. View at: Google Scholar
  35. C. Zavala and F. Salamanca, “Mothers at risk of alloimmunization to the Rh (D) antigen and availability of gamma-globulin at the Mexican Institute of Social Security,” Archives of Medical Research, vol. 27, no. 3, pp. 373–376, 1996. View at: Google Scholar
  36. Blood Type Frequencies by Country including the Rh Factor,
  37. A. S. Wiener, J. P. Zepeda, E. B. Sonn, and H. R. Polivka, “Individual blood differences in mexican indians, with special reference to the Rh blood types and Hr factor,” The Journal of Experimental Medicine, vol. 81, no. 6, pp. 559–571, 1945. View at: Google Scholar
  38. K. Landsteiner, A. S. Wiener, and M. G. Albin, “Distribution of the Rh factor in American Indians,” The Journal of Experimental Medicine, no. 1, pp. 73–78, 1942. View at: Google Scholar
  39. S. Sandoval, C. Henckel, and L. Givovich, “The blood groups, subgroups, and Rh factor of the Mapuche Indians of the Province of Cautin, Chile,” Blood, vol. 1, no. 6, pp. 555–559, 1946. View at: Google Scholar
  40. R. Lisker and V. Babinsky, “Admixture estimates in nine Mexican Indian groups and five East Coast localities,” Revista de Investigación Clínica, vol. 38, no. 2, pp. 145–149, 1986. View at: Google Scholar
  41. P. Iturbe-Chiñas, J. Jimenez-Hernandez, D. Peralta-Ortega, and J. Toribio-Jiménez, “Frecuencias de grupos sanguíneos ABO, Rh y grado de mestizaje en la Región Montaña, Guerrero, México,” Revista Médica del Hospital General de México, vol. 76, no. 4, pp. 217–233, 2013. View at: Google Scholar
  42. A. Sánchez-Boiso, R. I. Peñaloza-Espinosa, E. Castro-Sierra et al., “Genetic structure of three Native Mexican communities based on mtDNA haplogroups, and ABO and Rh blood group systems,” Revista de Investigación Clínica, vol. 63, no. 6, pp. 614–620, 2011. View at: Google Scholar
  43. J. Flegr, “Heterozygote advantage probably maintains rhesus factor blood group polymorphism: ecological regression study,” PLoS ONE, vol. 11, no. 1, Article ID e0147955, pp. 1–12, 2016. View at: Publisher Site | Google Scholar
  44. C. M. Cserti and W. H. Dzik, “The ABO blood group system and Plasmodium falciparum malaria,” Blood, vol. 110, no. 7, pp. 2250–2258, 2007. View at: Publisher Site | Google Scholar
  45. B. S. Weir and C. C. Cockerham, “Estimating F-statistics for the analysis of population structure,” Evolution, vol. 38, no. 6, pp. 1358–1370, 1984. View at: Publisher Site | Google Scholar
  46. E. Llop, H. Henríquez, M. Moraga, M. Castro, and F. Rothhammer, “Brief communication: Molecular characterization of O alleles at the ABO locus in Chilean Aymara and Huilliche Indians,” American Journal of Physical Anthropology, vol. 131, no. 4, pp. 535–538, 2006. View at: Publisher Site | Google Scholar
  47. R. Barquera, J. Zúñiga, R. Hernández-Díaz et al., “HLA class I and class II haplotypes in admixed families from several regions of Mexico,” Molecular Immunology, vol. 45, no. 4, pp. 1171–1178, 2008. View at: Publisher Site | Google Scholar
  48. Sistema de Información Cultural, “Pueblos indígenas,” View at: Google Scholar
  49. CONAPO, “Población indígena,” View at: Google Scholar
  50. K. Adhikari, J. C. Chacón-Duque, J. Mendoza-Revilla, M. Fuentes-Guajardo, and A. Ruiz-Linares, “The genetic diversity of the Americas,” Annual Review of Genomics and Human Genetics, vol. 18, no. 1, pp. 277–296, 2017. View at: Publisher Site | Google Scholar
  51. L. Fejerman, E. M. John, S. Huntsman et al., “Genetic ancestry and risk of breast cancer among U.S. Latinas,” Cancer Research, vol. 68, no. 23, pp. 9723–9728, 2008. View at: Publisher Site | Google Scholar
  52. J. C. Lara-Riegos, M. G. Ortiz-López, B. I. Peña-Espinoza et al., “Diabetes susceptibility in Mayas: Evidence for the involvement of polymorphisms in HHEX, HNF4α, KCNJ11, PPARγ, CDKN2A/2B, SLC30A8, CDC123/CAMK1D, TCF7L2, ABCA1 and SLC16A11 genes,” Gene, vol. 565, no. 1, pp. 68–75, 2015. View at: Publisher Site | Google Scholar
  53. M. C. Dessing, J. Kers, J. Damman, G. J. Navis, S. Florquin, and J. C. Leemans, “Donor and recipient genetic variants in NLRP3 associate with early acute rejection following kidney transplantation,” Scientific Reports, vol. 6, Article ID 36315, 2016. View at: Publisher Site | Google Scholar
  54. Y.-Z. Ge, R. Wu, T.-Z. Lu et al., “Combined effects of TGFB1 +869 T/C and +915 G/C polymorphisms on acute rejection risk in solid organ transplant recipients: A systematic review and meta-analysis,” PLoS ONE, vol. 9, no. 4, Article ID e93938, 2014. View at: Publisher Site | Google Scholar
  55. A. Arnaiz-Villena, J. Moscoso, J. Granados et al., “HLA genes in Mayos population from northeast Mexico,” Current Genomics, vol. 8, no. 7, pp. 466–475, 2007. View at: Publisher Site | Google Scholar


Copyright © 2018 Adrián Canizalez-Román et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


3.1 Behaviour of marker-based estimators in simulated pedigrees

Figure 2 illustrates the relationship between marker-based estimates and pedigree-based kinship predictions for simulated data. The top row shows results for one of the 10 pedigrees generated with 20 founders and monogamous mating. The bottom row shows results for one of the 10 pedigrees generated with 750 founders and random mating. The density of points is represented as hexagons of varying darkness: the darker, the more numerous.

A cursory inspection of Figure 2 shows that all marker-based estimates tend to underestimate the pedigree kinships when the number of founders is small (top row of Figure 2), and the three marker-based estimates have different properties.

r β (left column of Figure 2) is an estimate of β and closely follows the pedigree values r p = θ, with a constant downward discrepancy for the pedigree with 20 founders (top left panel) and almost perfectly for the pedigree with 750 founders (bottom left). The downward discrepancy for r β is due to the constraint that the average of all r β is 0 this can easily be corrected (by imposing the same constraint on rp and replacing it with , where is the mean kinship of all individuals in the pedigree), as shown in Weir and Goudet ( 2017 ). There is some scatter around the most common pedigree values of kinship (left column) corresponding to unrelated, half- and full-sibs, and this more pronounced in the larger pedigree (bottom left panel).

r u estimates γ and underestimates the pedigree kinship θ (middle column of Figure 2), and this is more pronounced for large values of pedigree kinship, and for pedigrees with few founders (top) than those with many (bottom). For the full-sib category (r p = 0.25), r u shows extreme scatter, and a similar effect, although less pronounced, is seen for half-sibs (r p = 0.125). For the large pedigree, the relation between r u and r p for low kinship values is very tight (bottom middle panel), more so than for r β .

The relation between r w , another estimate of γ, and r p (right column of Figure 2) is noisy for the small pedigree (top right), with common pedigree kinship classes very spread out. r w also tends to underestimate r p , the more so the larger the pedigree kinship. For the large pedigree (bottom right), the relation is much tighter the regression slope is close to 1 and the scatter is less than for r β . The extreme scatter for r p = 0.25 seen in r u for both the small and large pedigrees (middle column) is also seen in r w for small pedigrees (top right) but disappears when the pedigree is large (bottom right).

Figure 3 shows the correlations of r β , r w and r u with r p (in blue, red and black, respectively) as a function of variation in kinship. Pairs of individuals from pedigrees with few founders will have more chance of being related than pairs from larger pedigrees. The pedigrees with few founders will thus have more variation in kinship and will be located to the right-hand side of the graph, while pedigrees with many founders will be to the left. For a given number of founders, monogamous pedigrees (filled circles) will show more variation in kinship than random-mating pedigrees (+ symbols) and will thus be located to their right.

The correlation between r β and r p shows a very different pattern from that of the other two marker-based estimates: It increases as the variance in r p increases, while the correlations of r w and r u decrease as the standard deviation in r p increases above ≈ 0.02. The relation for the correlation between marker- and pedigree-based kinship is very tight for r β , with all the points falling on the same trajectory. This correlation is around 0.6 when the standard deviation in pedigree kinship is ≈ 0.009, and asymptotes at one as the standard deviation of pedigree kinship increases.

The pattern for r w and r u is almost the reverse. Correlation increases at first, when there is very little variance in pedigree kinship reaches a maximum at around 0.93 for r w and 0.96 for r u , when the standard deviation in pedigree kinship is between 0.02 and 0.03 and decreases linearly thereafter, although with greater scatter, as the standard deviation in r p increases above 0.03.

From this, it would seem that when the standard deviation in pedigree kinship is larger than ≈ 0.02 − 0.04, r β is a better estimator of pedigree kinship, while when the standard deviation in pedigree kinship is less than 0.03, r u is preferable (r u outperforms r w over the whole range). However, this is partly misleading as we saw in Figure 2 that r β is actually very close to r p over its whole range r u , while giving precise estimation for low values of r p , is not very good in estimating the kinships of full- and half-sibs.

The pattern observed for the correlation between marker and pedigree kinship for r w and r u is puzzling. The pattern observed with r β makes intuitive sense: If there is no variation in pedigree kinship, then it cannot be correlated with anything (correlation with a constant is 0 by definition). The larger the variation in pedigree kinship, the more there is to explain thus, marker-based estimates of kinship should be more correlated with pedigree values. We will return to this point later.

3.2 Number of loci

In Figure 3, we fixed the number of (variable) SNPs to 42k. This is a fairly large number, though by no means exceptional nowadays. In Figure 4, we look at the effect of the number of markers, decreasing it to 20k, 10k, 5k, 1k, 500 and finally 100.

As the number of markers decreases, we see both quantitative and qualitative changes. The top row of Figure 4 (20k, 10k and 5k SNPs) shows slightly noisier versions of Figure 3, but otherwise no qualitative differences: For high variation in pedigree kinship, r β has a higher correlation with r p than r w or r u , while for low variation, r w and r u are more correlated with r p . Note that r w in particular reaches a plateau and starts decreasing when the variation in r p gets very low (for 10k and 5k SNPs).

The bottom row of Figure 4 (1k, 500 and 100 SNPs) looks different. First, the correlations between marker-based estimates and pedigree-based predictions are much lower, below 0.8 for 1k, 0.6 for 500 and 0.4 for 100 SNPs. Second, for the three marker-based kinship estimators, correlation with pedigree-based kinship increases as variation in pedigree kinship increases. In all three panels of the bottom row, the correlation between r β and r p is less than that for r w , which itself is less than that for r u .

3.3 Founders from two populations

In the following, we will focus on r β . We see in Figure 5 (left panel) that when the founders consist of individuals from two differentiated populations, the relation between estimates of kinship from markers and predicted values from pedigrees is noisy (r w and r u show a similar pattern data not shown). In particular, for pedigree kinships of 0, 0.125 and 0.25, we see a large range of marker-based estimates, larger than when the founders come from a single population (Figure 2). This is because the pedigree predictions r p assume all founders to be equally unrelated, whereas in reality, they are not: Pairs of founders coming from the same population are more related than pairs of founders coming from different populations.

Marker kinship as a function of pedigree kinship when the founders come from two populations. Left panel: Founders are assumed unrelated (r p ). Right panel: Founders kinships have been estimated from markers ( )

If genetic information is available for the founders, we can account for the heterogeneity of kinship among them by using their marker-based estimates of kinship as a seed to the algorithm calculating pedigree kinship. This is represented in Figure 5 (right panel). By seeding the pedigree kinship matrix with marker-based estimates of kinship r β for the founders, the scatter around the pedigree value is much reduced.

We can use the same principle when all founders come from one population. The scatter seen in Figure 2 is due to the founders being considered as identically unrelated. If we use the genotypes of the founders to estimate their kinship rather than assuming it to be 0, we obtained Figure 6. The correlation between r β and r p is much increased in all situations, and for the smaller number of typed SNPs (bottom row), r β is the most correlated with r p .

3.4 Real data applications

3.4.1 Great tit data set

The great tit pedigree is shallow, covering at most three generations, pedigree-based predictions of kinship have few categories, and some of these have very few observations:

Figure 7 displays violin plots of marker-based estimates of kinship as a function of the pedigree-based predictions. The three marker-based kinship estimators show very similar behaviour in this shallow pedigree. The modes of their distributions are aligned with the corresponding pedigree values (horizontal solid lines). Noteworthy is the fairly high proportion of predicted half-sibs from the pedigree (r p = 0.125) who are identified as full-sibs with marker-based estimates of kinship (r m = 0.25, middle panel of the bottom row). A similar pattern is seen for first cousins and half-sibs. The three marker-based estimators of kinship for the unrelated individuals show a unimodal distribution with all modes at r = 0, but long tails extending to 0.4. This is probably due to founders being related. For instance, individuals 17 and 557, both males and founders, have a pedigree-based kinship assumed to be 0, but their estimated . Descendants of these individuals will have their pedigree-based kinships underestimated.

3.4.2 Soay sheep data set

Next, we look at the sheep data set. Figure 8 presents the results. The top row shows the relation between pedigree-based predictions of kinship and marker-based estimates, r β , r u and r w from left to right. The correlations are 0.65, 0.73 and 0.71, respectively. While in all three panels we see a tendency for marker-based estimates to increase with the pedigree-based predictions, there is much scatter. In particular, for all three marker-based estimators, some pairs of individuals assumed to be unrelated with pedigree-based predictions have fairly high marker-based estimates, and some individuals with pedigree-based predictions of 0.25 (full-sibs or parent–offspring) have marker-based estimates around 0.

The second row of Figure 8 shows the relation between pedigree-based predictions and marker-based estimates from genotypes simulated along the pedigree. The relation is much tighter, particularly for r β and r w , while r u shows similar scatter to previously, particularly for pedigree kinship class 0.25.

The last row of Figure 8 compares the marker-based estimates of kinship based on simulations and observed data. For r β and r w , most of the points fall close to the one-to-one line, and points outside this envelope are easy to identify (for instance, the points for which and , or those for which while , bottom left), providing the opportunity to correct the pedigree. It would be much more difficult to use r u for such a correction.

3.4.3 Pig data set

This data set is the most complete: All individuals (bar founders) have both parents identified. Close to 55% of the 6,473 individuals in the pedigree have been genotyped. The first row of Figure 9 shows the relation between pedigree- and marker-based kinship for the three marker-based estimators. The relation is not as tight as that seen in Figure 2, and the correlation between marker- and pedigree-based kinships is 0.55, 0.55 and 0.56 for r β , r u and r w , respectively, for a standard deviation of pedigree-based kinship estimates of 0.02. There is therefore little separating the three marker-based estimators of kinship in terms of correlation, and r β shows the least scatter and bias of the three estimators (compare the top left panel (b) of Figure 9 with the top middle (u) and right (w) panels).

The marker-based estimates of kinship (particularly r β ) from simulated genetic data (middle row of Figure 9) match the pedigree-based kinships extremely well. The correlation between marker-based and pedigree-based predictions is 0.96, 0.85 and 0.87 for r β , r u and r w , respectively (the results are almost identical if using only 42k SNPs filtering on minor allele frequency (MAF) larger than 0.01 reduces the correlation for r β and r w and marginally increases it (0.85 to 0.87) for r u ). The last row of Figure 9 shows the relation between simulated marker-based estimates of kinship and observed marker-based estimates of kinship. The key point to take from this bottom row is best seen from the leftmost panel comparing the two r β : Among the simulated marker-based estimates of kinship close to 0, we observed two high-density spots of estimates from real data (the two dark spots), an indication that the founders may not come from a homogeneous stock.

Seeding the pedigree-based estimator of kinship with the marker-based estimation of kinship for the 81 genotyped founders does not significantly improve the relation between marker- and pedigree-based values (data not shown).

Variations on a Common STRUCTURE: New Algorithms for a Valuable Model

In this commentary, John Novembre discusses tools for the study of population structure, including the novel, fast algorithm fastSTRUCTURE presented by Raj et al. in the June issue of GENETICS.

Over the last 14 years, a highly influential tool for the study of population structure has been the admixture model of Pritchard, Stephens, and Donnelly (the “PSD” model, Pritchard et al. 2000) and its associated inference software STRUCTURE. In its basic form, this model does not explicitly consider the effects of mutation, drift, selection, or linkage. Nor is it a dynamical model, as it does not explictly have any temporal component. Viewed simply, it is just the Hardy–Weinberg model with two wrinkles: (1) subpopulations that differ in allele frequency, and (2) individuals whose genetic ancestry can be admixed, i.e., an individual can inherit alleles from more than one of the multiple subpopulations according to a probability vector of �mixture proportions.”

This simple model of population structure has proven incredibly useful. The original paper (also published in GENETICS) has garnered over 10,000 citations, and if one considers the impact of subsequent papers that elaborate on this approach directly (Falush et al. 2003, 2007 Hubisz et al. 2009) or use related approaches (e.g., Dawson and Belkhir 2001 Anderson and Thompson 2002 Corander et al. 2003 Wilson and Rannala 2003 Huelsenbeck and Andolfatto 2007), the sum impact is truly remarkable. Interestingly, the same underlying model has played an independent and important role for text classification and mining, where it is known as the latent Dirichlet allocation (LDA) model (see Blei et al. 2003, which alone has over 8,000 citations). For biologists, the broad impact of the PSD model stems from the suprisingly large number of questions in evolutionary biology and ecology in which a simple assessment of population structure proves to be an insightful exercise. As an example, consider how useful it is to identify genetic subpopulations when managing a species or studying its historical biogeography. Also, consider how useful it is in a disease mapping study to be able to identify hybrid individuals, to infer the source of tissues of uncertain origins, or to assess case/control population stratification. The PSD model and its close relatives have been important in all of these applications.

While the PSD model is relatively simple, performing inference with it poses substantial computational challenges. To understand the crux of the problem, remember the original PSD paper sought to infer the vector of ancestry proportions for each individual (a vector qi for individual i) assuming individuals are completely unlabeled with respect to origins. That is, any understanding of the subpopulation allele frequencies and the mixture proportions will need to be discerned from the genotype data alone. If one knew the population source of each allele in each individual, then the problem would be straightforward, but these are “missing data” or latent variables. The PSD paper approached the problem from a Bayesian perspective—it aimed to sample from the posterior on the qi's by integrating out the uncertainty in the unobserved subpopulation allele frequencies and allelic source variables. This integration is in a highly dimensional space (with dimensions proportional to the product of sample size, the number of loci, and the number of alleles per locus) and thus it is computationally impractical to carry out the integration exactly. Thankfully, a Gibbs sampling approach allows approximation of the integral, and this is used in the algorithm that underlies the resulting software STRUCTURE.

The last 10 years have seen great strides in the scale of our observations of genetic variation, and this has been a blessing for our learning but a curse for computation. STRUCTURE worked well for analyzing complete datasets until the advent of large SNP genotyping arrays with hundreds of thousands of SNPs in the mid-2000s (reviewed in Novembre and Ramachandran 2011). At that scale of data, the Gibbs sampler is simply too slow to be practically applied, and many researchers turned to using alternative approaches such as applying principal component analysis (PCA) to genotype data (Price et al. 2006). There are theoretic reasons why using PCA (and other forms of factor analysis) can provide insight to admixture proportions (Patterson et al. 2006 McVean 2009 Engelhardt and Stephens 2010), but the PSD model still is appealing as a probabilistic model-based approach for inference with admixed samples. For example, it explicitly considers how alleles within a genotype are a binomial sample from underlying subpopulation allele frequencies, rather than implicitly treating them as continuous variables (as PCA does), and thus can compute measures of uncertainty in an appropriate way.

To address the computational challenges that arise from applying the PSD model to SNP data, two groups recognized that the likelihood function underlying the model is amenable to efficient optimization techniques such that one can obtain maximum likelihood estimates of the ancestry proportions and allele frequencies. Tang et al. (2005) developed an EM algorithm distributed in their FRAPPE software, and another team (that I worked with) leveraged tools from convex optimization theory to develop the ADMIXTURE software (Alexander et al. 2009).

One partial drawback of the likelihood approaches is the inability to use Bayesian priors that favor the most “sensible” parameter estimates. For example, biological intuition suggests one should favor solutions in which each individual’s ancestry is drawn from one or at most a few populations rather than many. Similarily one might favor solutions where the allele frequencies in all the subpopulations are similar to one another (e.g., for populations that are weakly differentiated, such as FST < 0.1). A strict optimization of the likelihood in the PSD model does not produce such solutions unless the dataset is large maximum likelihood may suffer from symptoms of overfitting (e.g., erroneously inferring small proportions of ancestry from many populations to improve model fit) or from poorly estimated allele frequencies. Stated generally, in many high-dimensional inference problems, maximum likelihood solutions can benefit from regularization/penalization steps that are akin to imposing priors. For example, as an improvement to ADMIXTURE, Alexander and Lange (2011) introduced a penalized likelihood function that mimics the way a Dirichlet prior can create sparseness in the admixture coefficients and found it reduced biases substantially.

In the June issue of Genetics, Raj et al. (2014) present novel, fast algorithms that allow for elaborate Bayesian inference with the PSD model. The key innovation is that they attack the problem in a variational Bayes framework (for an introduction see Jordan et al. 1999). Variational Bayes avoids the difficult integration steps, which are typically computed using time-costly Gibbs samplers or Markov chain Monte Carlo techniques, by approximating the posterior in a strategic way. Importantly, an approximate posterior distribution (the “variational distribution”) is constructed, which is mathematically simple to work with (e.g., designed such that many terms factorize). It can be shown that by maximizing the variational distribution function with respect to the model parameters, one is maximimizing a lowerbound of the marginal likelihood, and thus finding parameters that fit the data well. The end result is that the challenging integrals of a standard Bayesian approach are replaced by functions that are easily computable, and only need to be optimized. In turn, the vast grab bag of tricks from numerical optimization (Nocedal and Wright 2006) can be used and parameter estimation can proceed quickly. In the text mining literature, variational methods have been used with success on the LDA model (Blei et al. 2003). Raj et al. (2014) report running times for their new variational algorithm, fastSTRUCTURE, that compete with ADMIXTURE (with small problems being solved roughly 10 times faster than STRUCTURE). Further, the run times are linear in the number of individuals, markers, and populations, so the approach will scale well to larger datasets. This speed comes at the expense of working with an approximation to the posterior, but in practice the resulting parameters are similar to those obtained in the full Bayesian inference.

A well-known, vexing problem for those using the PSD model has been how to appropriately choose the number of subpopulations (K) for the analysis and/or how to infer it directly from the data. As the number of parameters changes with K, this is a type of model selection problem, and several different approaches have been taken to attack it (e.g., Pritchard et al. 2000 Evanno et al. 2005 Alexander and Lange 2011). Raj et al. (2014) find a cross-validation approach that is deployed in the ADMIXTURE software does not work as well for choosing K with fastSTRUCTURE and so then they develop two metrics ( K ξ ∗ , K ∅ C ∗ ) that can help establish a likely range for K when using the variational approach. While not perfect, these metrics allow a reasonable inference of K when a dataset is large and structure is strong, but in more weakly structured populations, the inference of K will continue to be problematic.

Raj et al. (2014) also found that a new logistic prior for allele frequencies (that replaces the standard F model used in earlier versions of STRUCTURE) is beneficial when teasing apart subtle structure in data. The elaboration of such priors could allow more detailed modeling of population history to be layered into the PSD model. One possibility would be to consider elaborate hierarchical priors, such as the tree-based prior on population frequencies developed by Pickrell and Pritchard (2012). Raj et al. (2014) found in their preliminary analyses that such priors did not improve model fit in their applications, but further exploration in this arena could prove fruitful.

As Raj et al. (2014) note, the PSD model is a coarse model of more complex populations. For this reason, interpreting the results of inference under this model demands substantial care and critical thought to avoid pitfalls (e.g., Anderson and Dunham 2008). The results from these simple models need to be viewed with an awareness of the complex evolutionary processes potentially shaping genetic variation in any dataset. Ideally, as our field matures, increasingly explicit and robust models of complex population history will be brought to bear on inference from genomic-scale data. It is humbling that even relatively simple models require much careful work and attention to computational detail, but new frameworks for inference, such as the variational Bayes used here, and new tools from numerical optimization, give hope for exciting progress.


Table 1. Table 1. Overview of Patients Included in the Final Analysis.

The milestones of the study in the context of the peak outbreaks in Italy and Spain are shown in Figure 1 . Data on the age, sex, maximum respiratory support at any point during hospitalization, and relevant coexisting conditions (type 2 diabetes, hypertension, and coronary heart disease) in the patients who were included in the final analysis are shown in Table 1 and in Table S2 in Supplementary Appendix 1. Because we used the same genotyping platform (GSA) to obtain both data sets, we were able to perform a uniform quality control of the merged Italian and Spanish SNP data sets, thus reducing technical confounders to a minimum. A quantile–quantile (Q-Q) plot of the two meta-analyses (the main analysis and the analysis corrected for age and sex) showed significant associations in the tail of the distribution with minimal genomic inflation (λGC=1.015 for main analysis and λGC=1.006 for analysis corrected for age and sex) (Fig. S2 in Supplementary Appendix 1). We also carried out separate association analyses for the Italian and Spanish data sets (see the Supplementary Methods section and Fig. S3).


This is one of over 2,400 courses on OCW. Explore materials for this course in the pages linked along the left.

MIT OpenCourseWare is a free & open publication of material from thousands of MIT courses, covering the entire MIT curriculum.

No enrollment or registration. Freely browse and use OCW materials at your own pace. There's no signup, and no start or end dates.

Knowledge is your reward. Use OCW to guide your own life-long learning, or to teach others. We don't offer credit or certification for using OCW.

Made for sharing. Download files for later. Send to friends and colleagues. Modify, remix, and reuse (just remember to cite OCW as the source.)


Our study proposes for the first time that some HLA polymorphisms in the Italian population may be potentially associated to the different regional incidence and mortality for Covid-19, likely activating a better and more powerful antiviral response, with central-southern regions being most protected from the epidemic. Such evidence, obtained at the general population level, needs to be confirmed in retrospective case–control studies on wide cohorts of Covid-19 patients from all the Italian regions in order to define HLA polymorphisms as a factor involved in disease susceptibility. Moreover, since the bioinformatic predictions on HLA-viral peptides binding affinity alone are of limited functional significance, it is fundamental to identify through proper in vitro and in vivo studies, if such HLA genetic loci are effectively associated to the induction of a protective T and B-cell mediated antiviral immunity. Research efforts aimed to explore genetic associations with the immune response in Covid-19 could be particularly useful both at the epidemiological and clinical level, to identify patients most at risk to develop severe complications, that should hence have priority to vaccination access, when it will be available, and to evaluate the differential efficacy of the vaccination in subjects with different HLA genetic background. HLA typing, that can be easily done through cost-efficient methodologies, also along with Covid-19 testing, should hence be envisaged and encouraged at the clinical level and by policy makers through the creation of a national network that may collect DNA samples from patients from all regions.

Watch the video: EM algorithm: how it works (January 2023).