Why Pan_troglodytes-2.1.3 Assembly renamed as a Pre in ENSEMBL?

Why Pan_troglodytes-2.1.3 Assembly renamed as a Pre in ENSEMBL?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I began working with panTro3 assembly of the UCSC. This assembly corresponds to CGSC Build 2.1.3. See here:

Now, I have to cross annotations with ENSEMBL (Indeed, I need to use Biomart tool: But It seems that between Release 64 and 65 they passed directly from Chimp 2.1 to Chimp 2.1.4. I researched some information and found that Chimp 2.1.3 renamed as a Pre! See here: and here

Is it a way to solve this problem? How may I use Chimp 2.1.3 assembly in Biomart with GrCh37 as reference genome (I was using an archived versions of the Ensembl:, but seems not possible to cross between different releases of the ENSEMBL.

MAKER Tutorial for WGS Assembly and Annotation Winter School 2018

MAKER is an easy-to-use genome annotation pipeline designed to be usable by small research groups with little bioinformatics experience. However, MAKER is also designed to be scalable and is thus appropriate for projects of any size including use by large sequencing centers. MAKER can be used for de novo annotation of newly sequenced genomes, for updating existing annotations to reflect new evidence, or just to combine annotations, evidence, and quality control statistics for use with other GMOD programs like GBrowse, JBrowse, Chado, and Apollo.

MAKER has been used in many genome annotation projects (these are just a few):

  • Pinus taeda - Loblolly Pine - PubMed
  • Pinus lambertiana - Sugar Pine - PubMed
  • Fusarium circinatum - Pine Pitch Canker - ResearchGate
  • Latimeria menadoensis - African Coelacanth - PubMed
  • Atta cephalotes - Leaf-cutter Ant - PubMed
  • Linepithema humile - Argentine Ant - PubMed
  • Pogonomyrmex barbatus - Red Harvester Ant - PubMed
  • Solenopsis invicta - Fire Ant - PubMed
  • Pythium ultimum oomycete - PubMed
  • Petromyzon marinus - Sea Lamprey annotation and re-annotation - PubMed
  • Zea mays - maize re-annotation - PubMed

There are many more projects that use MAKER around the world.


The vertebrate lysozyme gene family has traditionally been considered to be composed of three genes: lysozyme c, lactalbumin, and calcium-binding lysozyme [1–4]. Lysozyme c, chicken-type (or conventional) lysozyme, is a bacteriolytic enzyme that is secreted into many body fluids of mammals (e.g., blood, tears, and milk) and is found at a high concentration in the eggs of many bird species [1, 2, 5]. Lysozyme c is widespread in nature its protein and gene sequences have been characterized from numerous diverse vertebrate and non-vertebrate species [3, 5, 6]. Lactalbumin is related to lysozyme, with around 40% amino acid identity and nearly identical three-dimensional structure, but lacks its bacteriolytic activity [1, 2, 4, 7]. Lactalbumin is expressed in lactating mammary glands, where it binds a calcium ion and modifies the activity of β-galactosyltransferase-1, such that the complex catalyzes the synthesis of lactose [2, 4, 7]. Lactalbumin has recently been shown to have a second activity in the gut, where it loses the calcium ion and binds a fatty acid this new form of lactalbumin appears to promote apoptosis of tumor cells, and thus has been renamed HAMLET (human lactalbumin made lethal to tumors) [8]. Lactalbumin appears to be found only in mammals, and is widely distributed in this group. Calcium-binding lysozyme has bacteriolytic activity like lysozyme c, but also shares with lactalbumin the ability to bind a calcium ion. Calcium-binding lysozymes appear to be relatively rare they have been found in the milk of only a few mammalian species (e.g., horse, dog, cat, seal, and echidna), as well as in the eggs (e.g., pigeon) and stomachs (e.g., hoatzin) of some bird species [3, 9]. Indeed, calcium-binding lysozyme genes have not been reported for the human or rodent genomes.

Previous phylogenetic analyses of lysozyme c, lactalbumin, and calcium-binding lysozyme sequences had suggested that the earliest divergences within this gene family occurred between lysozyme c and the ancestor of the genes for lactalbumin and calcium-binding lysozyme, and that this initial gene duplication may have preceded the divergence of the lineages leading to fish and mammals [10, 11]. The separation of the lactalbumin and calcium-binding lysozyme genes was proposed to be more recent, with some studies [9, 12] suggesting a divergence on the early mammalian lineage, which would be consistent with the restriction of the lactalbumin gene to mammals. In contrast, another study [11] suggested that the duplication generating the lactalbumin and calcium-binding lysozyme genes predated the bird-mammal divergence. Moreover, the orthology of the mammalian and avian calcium-binding lysozymes has even been questioned [3, 11]. Thus, the origin of these mammalian lysozyme-like genes remains an open question.

Recently, cDNAs for several additional lysozyme-like sequences have been identified from human testis cDNA libraries [13–15]. These cDNAs were found to be encoded by genes that are now annotated by Ensembl [16] as LYZL (lysozyme-like): LYZL2, LYZL4, LYZL6 and LYZL3 (Synonym SPACA3 SPACA, Sperm acrosome associated [15]. SPACA3 is also known as SPRSA [14] and SLLP1 [13]). The predicted protein sequences of some of these lysozyme-like sequences have amino acid substitutions at sites important for the catalytic activity of lysozyme, suggesting that these proteins would not be able to hydrolyze the glycosidic bonds of bacterial peptidoglycan [13, 15]. Since these four new lysozyme-like genes (LYZL2, LYZL4, LYZL6, and SPACA3) are expressed predominantly in the testes, it has been suggested that they might have a role in reproduction [13–15, 17]. Such a role has been shown for Lyzl4 and Spaca3 in mice [18, 19].

The identification of these LYZL genes in the human genome suggests that the mammalian lysozyme-like gene family is larger than previously appreciated, and raises the possibility that the lysozyme-like proteins encoded by these genes may have novel biological functions. Here we have used extensive similarity searches of the human and other vertebrate genomes. We thereby identified three additional intact lysozyme-like genes in the human genome these have been annotated in the databases, but not reported in the literature. We have also identified multiple lysozyme-like genes in the genomes of diverse vertebrates. Using a combination of phylogenetic and genomic neighborhood (or synteny) analyses, wherein the relationships of the genes that flank the lysozyme-like genes in diverse species were examined, we demonstrate that orthologs of the human lysozyme-like genes are found in the genomes of diverse mammalian species. Our analyses suggest that there were at least six, and perhaps as many as nine, diverse types (or subfamilies) of lysozyme-like genes in the genome of the common ancestor of all extant mammals, and that these diverse genes have been maintained on most mammalian lineages. This suggests that their protein products probably have essential biological functions that are yet to be identified.


Analysis of Upstream Regions

The promoter structure of the vtRNA genes and candidates was investigated using the meme suite ( Bailey et al. 2006). The meme program ( Bailey and Elkan 1994) implements an expectation–maximization algorithm for discovering significantly overrepresented approximate sequence patterns in a set of nonaligned input sequences. The mast program is then used to detect occurrences on the meme-derived patterns in novel sequences.

For our analysis, we used 500 nt of 5′ flanking sequence, the vtRNA gene itself, and 50 nt of 3′ flanking sequence. This sequence interval is chosen to cover the known promoter elements (distal sequence element [DSE] and proximal sequence element [PSE]) upstream and the terminator region downstream of the VTRNA gene. Patterns were learned from DNA surrounding the experimentally known vtRNAs and their syntenically conserved homologs in Mammalia (set A). Independently, a meme alignment was also performed at the syntenically conserved locus in teleosts (set B) and on all vtRNA candidates from amphioxus, tunicates, lamprey (Petromyzon marinus), shark (Callorhinchus milii), Latimeria menadoensis, frog (X. tropicalis), lizard (Anolis carolinensis), and chicken (Gallus gallus set C). For eutheria, several motif lengths and different models were explored with largely consistent results with respect to the similarities between candidates from the syntenically conserved vtRNA loci. Defaults were used for all other parameters. Full input sets and associated parameters are documented in the Supplementary Material online.

In order to compare the meme-motifs with known features of polymerase III promoters, we extracted the corresponding sequence motifs from the literature on vtRNAs and other polymerase III transcripts ( Geiduschek and Tocchini-Valentini 1988 Kickhoefer et al. 1993 Kickhoefer et al. 2003 Vilalta et al. 1994 van Zon et al. 2001 Englert et al. 2004).

Motifs identified by meme then served as an input for mast searches ( Bailey and Gribskov 1998) against vt RNA homologs with unknown genomic location, in particular shotgun traces and contigs of low-coverage genomes.

Human vtRNA Expression

Cell Culture

MCF-7, HEK-293, PC3, Du-145, and HeLa cells (ATCC) were grown in Dulbecco's modified Eagle's medium/high glucose with 10% FCS (Biochrom), 100 units/ml of penicillin, and 100 μg/ml of streptomycin (PAA). LNCaP cells (ATCC) were grown in RPMI1640 supplemented with 10% FCS (Biochrom), 100 units/ml of penicillin and 100 μg/ml of streptomycin (PAA), and 10 mM N-2-hydroxyethylpiperazine-N¢-2-ethanesulfonic acid (Biochrom). RWPE-1 cells (ATTC) were grown in keratinocyte serum-free medium (Gibco-BRL) supplemented with 5 ng/ml human recombinant EGF (Gibco-BRL) and 0.05 mg/ml bovine pituitary extract (Gibco-BRL). All cells were cultured at 37 °C in a humidified atmosphere of 5% CO2 in air.

Real-Time Quantitative Polymerase Chain Reaction

Total RNA was extracted from the different fractions by using TRIzol reagent according to the manufacturer's instructions (Invitrogen, Carlsbad, CA). Sequences of primers that were used to carry out the quantitative real-time polymerase chain reaction (qRT-PCR) are listed in the Supplementary Material online. In all, 5 μl total RNA of each fraction was reverse transcribed using random hexamer primers and the High Capacity Reverse Transcription kit (Applied Biosystems). The cDNA was diluted 1:12.5 and served as the template for qRT-PCR analysis using the TaqMan 9700 System (Applied Biosystems) with FAST SYBR green mastermix (Applied Biosystems). Sequences of primers are listed in the Supplementary Material online. All amplicons were confirmed by sequencing. For each vtRNA assay, a standard curve was calculated to check the polymerase chain reaction (PCR) efficiency. Expression of the vtRNA in the different cell lines was normalized to genomic DNA.

VtRNAs in Short Read Sequencing Libraries

The data sets analyzed here for vtRNA fragments were produced in the context of other projects and have been (HeLa data [ Friedländer et al. 2008]) or will be published in that context. In brief, total RNA was isolated from the frozen prefrontal cortex tissue using the TRIzol (Invitrogen) protocol with no modifications. Low molecular weight RNA was isolated, ligated to the adapters, amplified, and sequenced following the Small RNA Preparation Protocol (Illumina) with no modifications.

BGI cortex, rep1, and rep2 libraries represent three technical replicates of the same three pooled samples from prefrontal cortex of humans, chimpanzees, and rhesus macaques. In each case, prior to low molecular weight RNA isolation, total RNA from 20 male human individuals aged between 14 and 58 years, 5 chimpanzees aged between 7 and 44 years, and 5 rhesus macaques aged between 4 and 10 years was combined in equal amounts. Replication was carried out by independent processing of the mixed sample of 20 individuals starting from the low molecular weight RNA isolation step. Cerebellum library: total RNA from five male human individuals aged between 20 and 56 years, five chimpanzees aged between 7 and 44 years, and five rhesus macaques aged between 4 and 10 years was combined in equal amounts. Aging library: 14 sequencing lanes of sample containing RNA from the prefrontal cortex of 12 humans aged from 0 to 98 years were analyzed.

Short Read Mapping

The mapping of large libraries containing hundreds of thousands of short, inaccurate sequences to large mammalian genomes cannot be performed reliably and efficiently by commonly used heuristics such as Blat ( Kent 2002) or Blast ( Altschul et al. 1997). This is due to limitations in both computational resources and accuracy. We therefore used segemehl, a new mapping tool based on enhanced suffix arrays ( Abouelhoda et al. 2004), which was developed by Hoffmann et al. (Forthcoming). It uses an alternative heuristics based on the matching statistics ( Chang and Lawler 1990) to incorporate not only mismatches but also insertions and deletions.

We also mapped all deep sequencing libraries directly against our four candidate sequences using the Soap program ( Li et al. 2008) allowing up to one mismatch position and a seed size of 8.

Expression of Teleost vtRNAs

Genomic DNA and total RNA were isolated from fish liver tissue using DNAzol and TRIzol reagents (Invitrogen), respectively, following the manufacturer's protocols. Concentrations of DNA and RNA samples were determined by A260 measurement using the Nanodrop ND-1000 Nanodrop (Nanodrop Technologies). Putative teleost fish vtRNAs were PCR amplified from genomic DNA (0.5 μg/50 μl reaction) with Taq DNA polymerase (New England Biolabs) and gene-specific primers at 1 μM final concentration. Each primer (listed in the Appendix) was designed to anneal specifically to the respective teleost vtRNA genes.

The PCR was carried out with 1 cycle at 95 °C for 2 min, followed by 35 cycles of 94 °C for 20 s, 58 °C for 20 s, and 72 °C for 15 s, and finished with a final elongation at 72 °C for 2 min. The PCR products were gel purified and cloned into pZero vector (Invitrogen) for sequencing confirmation of the specific vtRNA genes amplified.

Reverse Transcriptase–Polymerase Chain Reaction

The expression of individual vtRNAs was verified by reverse transcriptase–polymerase chain reaction (RT-PCR). From 2 μg of total RNA, medaka (Oryzias latipes) or zebrafish (D. rerio), cDNA libraries were prepared using Thermoscript reverse transcriptase (Invitrogen) and a random hexamer primer following the manufacturer's instruction. Gene-specific primers were used to PCR amplify the putative vtRNA sequences from the cDNA libraries under conditions similar to the PCR condition for genomic DNA samples, except 3–5 additional cycles. The RT-PCR products were cloned into pZero vector and sequenced. The Mock RT reactions with reverse transcriptase enzyme omitted served as the negative control.

Northern Blot Analysis

The northern blotting was carried out as previously described ( Xie et al. 2008) with minor modifications. Briefly, 20 μg of total RNA and in vitro transcribed vtRNA (0.1 and 1 ng) were resolved on a 6% polyacrylamide/8 M urea denaturing gel and electrotransferred to a Hybond-XL membrane (Amersham Biosciences) at 0.5 A for 2 h. The riboprobes for northern blotting analysis and the size markers, medaka (MEDAKA1_s3838_742) and zebrafish (ZFISH7_14_454804) vtRNAs, were prepared by T7 in vitro transcription using PCR DNA as template ( Xie et al. 2008). PCR primer sequences are listed in the Supplementary Material online. The riboprobes were labeled internally with [α- 32 P] UTP in a T7 transcription reaction using a Maxiscript kit (Ambion). After transfer, the membrane was hybridized with riboprobes (1 × 10 6 cpm/ml) at 65 °C overnight in Ultrahyb buffer (Ambion) and washed twice at 65 °C in 1x standard saline citrate (SSC)/0.2% sodium dodecyl sulfate (SDS) for 10 min and twice in 0.2x SSC/0.1% SDS for 20 min. The blot was analyzed and quantitated using a phosphorimager, Bio-Rad FX Pro.

Data description

Participating teams ( Table 1) had four months in which to assemble genome sequences from a variety of NGS sequence data ( Table 2 and Additional file 1 ) that was made available via the Assemblathon website [ 29]. Each team was allowed to submit one competitive entry for each of the three species (bird, fish, and snake). Additionally, teams were allowed to submit a number of ‘evaluation’ assemblies for each species. These would be analyzed in the same way as competitive entries, but would not be eligible to be declared as ‘winning’ entries. Results from the small number of evaluation entries (3, 4 and 0 for bird, fish, and snake respectively) are mostly excluded from the Analyses sections below, but are referenced in the Discussion.

Assemblies were generated using a wide variety of software ( Table 1), with greatly varying hardware and time requirements. Details of specific version numbers, software availability, and usage instructions are available for most entries ( Additional file 2: Tables S2 and S3 ), as are comprehensive assembly instructions ( Additional file 3 ).

Assemblies were excluded from detailed analysis if their total size was less than 25% of the expected genome size for the species in question. Entries from the CoBig 2 and PRICE teams did not meet this criterion their results are included in Additional file 4 , but are not featured in this paper (however see Discussion for information regarding the genic content of the PRICE assembly). Most teams submitted a single file of scaffold sequences, to be split into contigs for contig-based analyses. However, a small number of teams (ABL, CSHL, CTD, and PRICE) submitted one or more entries that consisted only of contig sequences that had not undergone scaffolding.

The submitted assemblies for Assemblathon 2 are available from the Assemblathon website [ 29] and also from GigaDB [ 30]. All input reads have been deposited in sequence read archives under the accessions ERP002324 (bird), SRA026860 (fish), and ERP002294 (snake) see Additional file 5 for a detailed list of all associated sequence accessions. Details of the bird sequence data, as well as gene annotations, have also been described separately (manuscript in preparation, and data in GigaDB [ 31]). The assembled Fosmid sequences for bird and snake that were used to help validate assemblies are also available in GigaDB [ 32].

Further, source code for scripts used in the analysis are available from a Github repository [ 33]. Results for all of the different assembly statistics are available as a spreadsheet ( Additional file 4 ) or as a CSV text file ( Additional file 6 ). For details on additional files see ‘Availability of supporting data’ section.


DNA methylation is an important epigenetic mechanism involving the covalent binding of a methyl group to the 5 th carbon position of cytosine in CpG dinucleotides in vertebrates 1 . This mechanism is generally considered as a repressive epigenetic mark that inhibits gene expression 2 . DNA methylation plays a critical role in several biological processes such as embryonic development and gametogenesis 1,3,4 . The DNA methylation process is mediated by DNA methyltransferases (Dnmts): maintenance methyltransferase Dnmt1 5 and de novo methyltransferase Dnmt3 6 . The erasure of methylation marks can be achieved either passively through the inhibition of Dnmt1 during DNA replication and cell division 7 , or actively through the action of ten-eleven translocation (Tet) dioxygenase family mediated iterative oxidation of 5-methylcytosine (5mC) and thymine DNA glycosylase (Tdg)-dependent base excision repair (BER) 8,9,10 .

The DNA methylation/demethylation machinery has been well described in mammals. While dnmt1 is generally identified as single-copy gene during evolution 11 , multiple dnmt3 genes were found in vertebrates with different gains and losses among tetrapod lineages 12 . The mammalian dnmt3 family consists of four members: dnmt3a, dnmt3b, dnmt3c and dnmt3l 13,14 . dnmt3l serves as a catalytically inactive cofactor for de novo methylation, and is found to exist only in eutherian mammals and in some marsupials 15 , whilst dnmt3c, previously annotated as a pseudogene, was recently identified in rodent genomes 14 . In contrast, the discovery of active demethylation-related genes occurred fairly late, with three tet paralogs (tet1, tet2 and tet3) and a single tdg gene found in mammalian genomes 9,10,16 . The identification of well-conserved DNA methylation genes in vertebrates, including teleosts, suggests that these regulatory pathways may be conserved across vertebrates 17,18 . However, due to the additional round of whole genome duplication (WGD) event that occurred before the radiation of the teleost lineage [TGD, teleost-specific genome duplication, 320 Mya (million years ago)], an increase in the number of copies of these genes has been found in teleost species 19 . For instance, the de novo methyltransferase dnmt3 was shown to be more divergent in teleosts compared with mammals: despite the absence of dnmt3l in zebrafish (Danio rerio) genome, up to 6 dnmt genes were identified as orthologous to mammalian dnmt3a and dnmt3b genes 20,21,22 . Similarly, 4 to 9 different dnmt3 paralogs were reported to exist in the genome of other teleost species 23,24,25,26,27,28 .

It is generally accepted that a WGD event can provide additional genetic material for selection, and are thus associated with phenotypic diversity and evolutionary innovations 29 . Alternatively, duplicated genes can also originate from small scale duplications (SSD) which can produce different kinds of adaptations compared to WGD 30 . Following WGD or SSD events, duplicated genes can either be lost or retained with three distinct outcomes: conservation of the ancestral gene functions, sub-functionalisation, or neo-functionalisation 31 . Through these processes, the fixation of extra copies of DNA methylation genes in teleost genomes may contribute to the diversification and plastic adaptation of teleosts. However, to characterise these adaptation, it is essential to first understand the evolutionary origin of these additional copies.

The increasing availability of sequenced genomes of teleost species facilitates to establish a comprehensive comparative study of DNA methylation genes among different taxa. However, few studies have been done to clarify the evolutionary history of dnmt, tet and tdg genes in vertebrates 22,23,24,32 . Hence, the evolutionary history and the orthologous relationship of DNA methylation genes remain incomplete and unclear, especially when considering genomes with higher complexity, i.e. salmonid species, which experienced a fourth round of WGD (Salmonid specific genome duplication, SaGD, 100 Mya).

The present study aimed to refine the current knowledge concerning the evolutionary history of dnmt genes in vertebrates, and to update the existing story with all DNA methylation genes (dnmt, tet and tdg) for extended taxa within the chordate phylum. To conduct the present study, we selected representative species with sequenced genomes of different taxa from a WGD point of view. Rainbow trout (Oncorhynchus mykiss), a salmonid fish, was included as a model species, which is supposed to have the maximum copies of DNA methylation genes among chordates due to SaGD. To explore if WGD events lead to sub- or neo-functionalisation of DNA methylation genes, estimations of the expression patterns of DNA methylation genes were done during gametogenesis and early development in trout.


Collection of sequences

The major procedures are schematically outlined in Suppl. Fig. 18. The collecting of sequences was done by several approaches. First, previously collected sequences [23, 24, 29] were checked against the present and updated versions of the genomes to include potential revisions of the gene sequences. This was done by searching GenBank (nucleotide collection, whole genome contigs, transcriptome shotgun assemblies, and specific genome assemblies), using nucleotide BLAST [66, 67], and Ensembl, using the corresponding BLAST/BLAT option built into Ensembl, in both cases obtaining pairwise alignments between our old sequences and the present sequences in the databases. Second, annotated and named sequences were found by using “connexin”, “gap junction protein”, “gja”, “gjb”, “gjc”, “gjd”, “gje” and the relevant species names as search terms in GenBank and Ensembl. If there was a lack of certain expected sequences in a certain species, the genome assemblies for the species in question were searched using the corresponding (assumed) orthologs from other species. When needed, multiple alignments by MUSCLE [68] were done, e.g., to settle the probable borders between introns and exons and to determine the percentages of identities between different sequences (e.g., in Suppl. Figs. 13A and 13B). By the combination of approaches described above, we found several connexin sequences presently not predicted in the databases, and they were included in our analyses (marked by NP as described below under Naming terminology).

If the experimentally confirmed or predicted sequences were available in GenBank, their accession numbers were also collected (to ensure the unique naming of the sequences). Depending on species and gene in question, we have used the NCBI Reference Sequences whenever possible. Otherwise, gene/RNA names or numbers were collected from Ensembl. All sequences, with GenBank accession numbers or Ensembl gene numbers if relevant, are provided in Supplement Figs. 1–12.

Among teleosts, we have collected sequences from zebrafish (Danio rerio, abbreviated Dr), stickleback (Gasterosteus aculeatus, Ga) [69], Japanese pufferfish (Takifugu rubripes, often called Fugu rubripes, called Fugu in the text, and abbreviated Fr) [50, 70], green spotted pufferfish (Tetraodon nigroviridis, Tn) [71], Atlantic herring (Clupea harengus, Ch) [17, 62], Atlantic cod (Gadus morhua, Gm) [48, 72] and European, American or Japanese eel (Anguilla anguilla, Aa Anguilla rostrata, Ar or Anguilla japonica, Aj). For eel, we have chosen to refer to an improved Anguilla japonica assembly [73, 74] because it has by far the longest scaffolds, aided by other genome shotgun assemblies of A. japonica [75], A. anguilla [44] and A. rostrata [46], as well as transcriptome shotgun assemblies from A. anguilla [76,77,78] and A. japonica [79].

As a comparison for the fish sequences, and to follow the Zebrafish Nomenclature Conventions [52], we collected sequences from humans (Homo sapiens, Hs Suppl. Fig. 1), mouse (Mus musculus, Mm Suppl. Fig. 2), and opossum (Monodelphis domestica, Md Suppl. Fig. 3), and supplemented them with certain single sequences from platypus (Ornithorhynchus anatinus, Oa), koala (Phascolarctos cinereus), Tasmanian devil (Sarcophilus harrisii, Sh), wallaby (Notamacropus eugenii), large flying fox (Pteropus vampyrus, Pv), black flying fox (Pteropus alecto, Pa), Egyptian rousette (Rousettus aegyptiacus, Ra), aardvark (Orycteropus afer afer, Afer), manatee (Trichechus manatus, Tm), African elephant (Loxodonta africana, La) and armadillo (Dasypus novemcinctus, Dn) (Suppl. Figs. 4 and 12). All sequences are given in the Supplemental Information, where also the relevant database can be inferred according to the name/identity we have given the sequence.

Suggested deviations from the predicted sequences are indicated in the Supplemental Information. If the predicted sequences did not contain potential start and stop codons, we analyzed the genomes to extend the sequences to those codons, following the pattern established by connexins orthologs in other species. If the predicted sequences contained introns, we investigated whether moving the exon-intron borders improved the similarity between sequences and the established sequence patterns, even by including the whole intron as a part of the exon. In a few cases, we also suggested other types of modifications, following the patterns established for these sequences in other species. Furthermore, any unpredicted sequences (i.e., those not predicted in Ensembl or GenBank) we found during the present searches, were included.

Several pseudogenes exist in the gap junction gene family, also in humans [29]. With a single exception, obvious pseudogenes are not included in the analyses shown. The one exception is a novel human pseudogene (GenBank NG_026166 claimed as GJA4 pseudogene) that we did not detect in our previous analyses [23, 24, 29]. Additionally, orthologs to NG_026166 were extracted from the genomes of several mammalian species (Suppl. Fig. 12).

Naming terminology

To distinguish between human genes and other species, it is generally recommended that abbreviations for human gene names are spelled in upper case letters, while using lower case letters for other species. For the purpose of the present paper, this would be inconvenient as we often are referring to the gene groups, and we are here using upper case when referring to human genes and mammalian orthologous gene groups, while teleost genes in general are indicated by lower case letters. We also use upper case letters when we are referring to a whole orthologous group (i.e., mammalian plus teleost genes). There are some exceptions to the upper/lower case spelling, because when we refer to specific single genes, we use (as far as possible) the gene names given in GenBank or Ensembl.

To ensure uniqueness of every name used in the present work, we added the GenBank accession number or an abbreviated form of the Ensembl gene number to the names for which predictions were available in the present databases. Specific gene names were generally abbreviated as indicated by the database, or the abbreviations can be inferred from the database name. E.g., for XM_003965660, the full name (“definition”) is “Takifugu rubripes gap junction protein, alpha 9, 59 kDa (gja9), mRNA”. In this case, the name is given with both the Greek and size nomenclature, and the name is abbreviated in lower case in parentheses. Thus, we have here used the gene name Fr-gja9-cx59-XM_003965660. For XM_021466745, the full name is “Danio rerio connexin 55.5 (cx55.5), transcript variant X1, mRNA”. We here abbreviated the name to Dr-cx55.5-XM_021466745. For XM_011619942, the full name is “Takifugu rubripes gap junction alpha-10 protein-like (LOC1010664818), mRNA”, and it was abbreviated Fr-gja10like-XM_011619942. Where several transcript variants are experimentally shown or predicted, we only used transcript variant X1.

If the gene was predicted in the Ensembl database, but no name was available, we used a relevant gene name to indicate the correct group of sequences. For example, the Tetraodon gjb2/6-like sequence ENSTNIG00000010340 (with the corresponding transcript prediction ENSTNIT00000013438) had no name or description. We abbreviated the gene Tn-NN-cx30.3-G10340 (where NN = No Name). This is an example of a gene for which our transcript prediction differed from the database, as indicated in the Supplemental Information.

If the gene was not predicted in a species, but found in our Blast searches, it was suitably named but with the prefix NP (Not Predicted). One example is Tn-NP-cx30.3. Thus, Tetraodon has a total of four genes in the Cx30.3 group, two that have been predicted and are named in Ensembl, one that has been predicted but not named, and one that has not been predicted by the database (but by us).

To be able to follow certain very closely related groups of sequences in an easy manner, previously un-named (or unpredicted) sequences in the cx30.3 and gjd2 groups were named with the postfixes *1/*2/*3 for the purposes of the present manuscript.

Phylogenetic analyses

The phylogenetic analyses were performed in MEGA7 [80] or MEGA-X [81] using the conserved domains essentially as described in Cruciani and Mikalsen [24] because of the distant evolutionary relationship between mammals and fish. Here, we extended the previously defined conserved domains by 15 nucleotides in 3′-direction for the first conserved domain (i.e., into the sequence corresponding the intracellular loop), and by 15 nucleotides in both 5′- and 3′-direction for the second conserved domain. All sequences and the limits of the sequences used in the phylogenetic analyses are presented in the Suppl. Fig. 1–12, where previously defined conserved sequences [24] are marked in yellow, and the 15 nucleotide extensions are marked in grey.

The main questions for the phylogenetic analyses were related and also partly overlapping, and were as follows: (i) The connection between the naming of the teleost sequences (naming taken from the main databases GenBank and Ensembl) and their position in a specific orthologous group, i.e., do teleost orthologs have the same name? (ii) The (orthologous) relationships between the teleost sequences and the corresponding mammalian sequences. Is there a (reasonably) stable structure in the connexin gene family across the teleosts, i.e., do teleost connexins distribute into orthologous groups in a manner more or less similar to the mammalian sequences? (iii) The ohnologies among the teleost sequences. Note that our present questions do not concern the relatedness within the whole tree (i.e., the complete evolutionary history of the connexin gene family). The present knowledge of evolutionary history of the connexin gene family is graphically summarized in Fig. 4 in ref. [24]. The needed translation between the nomenclature systems is found in Table 3. This translation also includes the recently suggested “alphabetical” nomenclature in mammals [26].

Model selection was run in MEGA X using amino acid models. Settings were automatic tree building using Neighbor-Joining model and partial deletion using a site coverage cutoff of 95%. Minimal differences were found between the models estimated with similar Bayesian Information Criterion, but in general simpler models were preferred (Jones-Taylor-Thornton, Le-Gascuel and Dayhoff substitution matrices). We therefore ran the phylogenetic analyses with different construction methods (Maximum Likelihood, Maximum Parsimony and the two distance methods Neighbor-Joining and Minimum Evolution) using different substitution models as indicated in Suppl. Table 2. Several construction methods were used as they have different strengths and weaknesses with regard to the degree of relatedness of the sequences, the differences in evolutionary rates in different branches, how highly divergent sequences are behaving, etc. Settings for each particular analysis are available in Suppl. Table 2. Each method was used at both amino acid and nucleotide levels (the latter using only positions 1 and 2 in the codon), and in many cases with both bootstrap and interior branch statistics. In total, 21 statistical analyses were performed, and they are summarized in Suppl. Table 1, with the corresponding parameter settings in Suppl. Table 2. All these methods are included in the MEGA phylogenetic software. If all, or most, of the statistical comparisons supported a specific dichotomous relationship, we deemed the results more robust.

Materials and methods

Strains, media and plant infections

M. oryzae was routinely incubated in a controlled temperature room at 25°C with a 12h light/dark cycle. Media composition of complete medium (CM), minimal medium (MM), minimal medium without carbon (MM-C) or nitrogen (MM-N), and DNA extraction and hybridisation were all as previously described[76]. Growth tests were carried out with 7 mm plugs of mycelium from Guy11 and the M1422 mutant strains as initial inoculum. The wild-type Neurospora crassa strain and isogenic deletion mutant NCU05996 were obtained from the Fungal Genetics Stock Centre (FGSC, Kansas City, Missouri, USA). Vogel’s minimal medium was used for cultivation of N. crassa strains at 25°C with a 12h light/ dark cycle and for stock-keeping at 4°C ( Growth tests were carried out on Vogel plates with 5 mm plugs of mycelium from N. crassa wild-type (wt) and NcTPC1 KO strains. Plates were incubated at 25°C for 2 days. M. oryzae leaf and root infection assays were carried out, as previously described [30,77].

Conidiation, onion/leaf sheath penetration assays, cytorrhysis assay and glycogen/Nile red staining

Conidia were harvested using 2 ml of sterile water per plate after fungal cultures were incubated at 25°C for a period of 10 days on CM. Calculations were then carried out to determine the number of conidia generated per cm 2 of mycelium using a Neubauer counting chamber. Values are the mean ± SD from >300 conidia of each strain, which were measured using the ImageJ software [78]. Photographs were taken using the Zeiss Axioskop 2 microscope camera using differential interference contrast (DIC) microscopy or epifluorescence. Conidia were stained with 5μl calcofluor white (CFW) solution (Fluka) and incubated at 25°C for 30 minutes. Cell number per conidium was determined by visualizing septa with CFW. Appressorium-mediated penetration of onion epidermal strips was assessed using a procedure based on Chida and Sisler[79]. A conidial suspension at a concentration of 1 x 10 5 conidia mL -1 was prepared and dropped onto the adaxial surface of epidermal layers taken from onion. The strips were incubated in a moist chamber at 25°C and penetration events scored 24h to 48h later by recording images with an Olympus IX81 inverted microscope system. Leaf sheath assays were carried out as previously described [10]. Glycogen staining solution contained 60 mg of KI and 10 mg of I2 per milliliter of distilled water. Glycogen deposits are visible immediately. For cytorrhysis assays, 10 5 spores were allowed to form appressoria for 18h on coverslips prior the addition of external glycerol (1M or 3M). After 10 minutes in glycerol

500 appressoria were analyzed in each biological replica experiment was carried out by triplicate. To visualize lipid droplets, conidia were allowed to germinate in water on coverslips. After 0h, 2h, 9h and 12h water was removed and conidia directly stained with Nile red (Nile Red Oxazone (9-diethylamino-5Hbenzo[alpha]phenoxazine-5-one Sigma). Nile red was used to 2.5 mg/ml diluted in 50mM Tris/Maleate, pH 7.5 and polyvinylpyrrolidone (PVP) (2–3% w/v). Lipid droplets begin to fluoresce within seconds. Samples were visualized under a confocal laser scanning microscope using a 561 nm excitation wave length and a long pass emission filter (592–700 nm). All images were taken using the same parameters.

Generation of mutant strains by gene replacement

Gene deletion constructs were generated using multisite gateway technology (Invitrogen) as previously described[77,80]. TPC1, CON6, GH18, PEBP2 and NOXD coding sequences were replaced by the hygromycin resistance cassette and PEBP1 by the sulfonylurea resistance cassette in the gene replacement constructs. Primers for constructing entry plasmids are described in S4 Table. Fungal transformants generated by Agrobacterium-mediated transformation [81] were selected growing in DCM solid media supplied with 5-fluoro-2’-deoxyuridine (50μM) and 200μg/ml Hygromycin or 150μg/ml Chlorymuronethyl in the case of Δpebp1. DCM is 1.7 g yeast N-base without amino acids, 1.0 g NH4NO3, 2.0 g of L-asparagine and 10 g of D-glucose. Knockout strains were confirmed by PCR or Southern blotting using radioactive probes ( 32 P primers listed in S4 Table). Sequence data and gene numbers used in this study were taken from EnsemblFungi (Magnaporthe oryzae MG8

Generation and cellular localisation of fluorescently tagged proteins

To determine the localisation of Tpc1, live-cell imaging was performed using a M. oryzae Guy11 strain containing two constructs, histone H1 tagged with red fluorescent protein (H1:RFP tdTomato) to visualize nuclei [82], and TPC1:GFP. For the construction of a functional TPC1:GFP gene fusion, primers were designed in order to amplify the TPC1 (MGG_01285) promoter region and ORF from genomic DNA of M. oryzae Guy11 (S4 Table). The TPC1_GFP_F forward primer was designed approximately 1.3 kb upstream from the TPC1 start codon to include a substantial component of the promoter sequence. The TPC1_GFP_R reverse primer spanned the stop codon and contained a complementary region to the GFP sequence. GFP primers were designed to amplify the 1.4 kb sGFP:TrpC construct cloned in pGEMT. Both fragments were joined together by fusion nested PCR. The amplicons were cloned into pGEMT-easy digested with EcoRI. The 4.3 kb TPC1:GFP fragment was gel purified and cloned into pCB1532 that had previously been digested with EcoRI. The pCB1532 vector contains the 2.8 kb ILV1 gene, which encodes the acetolactate synthase-encoding allele bestowing resistance to sulfonylurea[83]. The resulting plasmid pCB1532-TPC1:GFP was used to transform protoplasts of M1422 mutant. For all rounds of PCR amplification, Phusion High-Fidelity DNA polymerase (Finnzymes, Thermo Fischer Scientific Inc.) was used, following the manufacturers’ guidelines for PCR conditions.

The GFP:MoATG8[34] and the FIM:GFP constructs were used to transform protoplasts of M1422 mutant. Protoplast generation and transformation were carried out as previously described[76]. The GFP:MoATG8 and the FIM:GFP protein fusion vectors were generated using the native M. oryzae MoATG8 gene (MGG_01062) and the native M. oryzae fimbrin-encoding gene (MGG_04478), respectively. Both fragments were cloned into pCB1532 vector that contains the 2.8 kb ILV1 gene, which encodes the acetolactate synthase allele conferring sulfonylurea resistance. Transformants showing identical growth and colony morphology to the background strain were selected for further examination using epifluorescence or confocal microscopy. At least three different transformants of each were independently analysed.

The TPC1:GFP gene fusion was cloned into pCB1532 vector (SUR R ) and used to transform protoplasts of Guy11 expressing Histone H1 fused with red fluorescent protein (H1:RFP)[33], and also introduced into isogenic Δpmk1, Δatg1 and Δatg8 mutants. Transformants were selected for further examination using confocal microscopy and verified as containing a single copy of the gene fusion construct by Southern blot hybridisation. At least three different transformants of each were used in all experiments.

RNA isolation and global gene expression profile using microarrays

Using a modified protocol of LiCl method[77], RNA was extracted from 8-day old fungal mycelia grown on cellophane placed on top of CM agar plates (S2E Fig). Two to three additional washes with phenol:chloroform were implemented to avoid RNA degradation from cellophane samples. RNA quality control was carried out with Agilent RNA 6000 Nano kit (ref. 5067–1504). Four biological replicates were independently hybridized for each transcriptomic comparison. Each of these replicates derived from three technical repetitions. Slides were Agilent Magnaporthe II Oligo Microarrays 4x44K (ref. 015060). Background correction and normalization of expression data were performed as previously described[77]. Hybridizations and statistical analysis were conducted by the Genomics Facility at the National Biotechnology Centre (Madrid, Spain). The GO term analysis was carried out with gProfiler[84]. Enriched motifs were not found when using the promoter regions of the 185 up-regulated genes. Microarray data are available in the ArrayExpress database (EMBL_EBI) under accession number E-MTAB-4127.

Yeast-two hybrid screen

In-Fusion Cloning based on in vitro homologous recombination was performed to generate vectors including NoxD and Tpc1 into the pGADT7 prey vector, and Nox1, Nox2 NoxR, Pmk1 and Mst12 into the pGBKT7 bait vector. Genes were amplified from M. oryzae cDNA derived from mycelia grown on liquid CM using primers with a 15bp overhang and restriction site complementary to the target vector (S4 Table). For NoxD, a 435bp fragment was amplified, for Nox1, a 1662bp fragment was amplified, for Nox2, a 1749bp fragment was amplified, and for NoxR, a 1578bp fragment was amplified. Respective fragments were cloned into pGBKT7 and pGADT7 plasmids linearized by digestion with EcoRI and SmaI. Yeast two-hybrid assays using pGADT7 or pGBKT7 (Clontech) based constructs were performed according to the manufacturer’s instructions (MATCHMAKER Gold Yeast Two-Hybrid System).

Imaging of fluorescent fusion proteins

For the construction of NoxD:GFP, primers were designed to amplify the ORF including 2kb upstream of the start codon, GFP and TrpC terminator with 15bp overhangs complementary to adjacent fragments (S4 Table). Fragments were ligated into pCB1532[83], which carries the sulphonyl urea resistance cassette and had been digested with BamHI and HindIII and this construct transformed into of the wild-type strain Guy11 using protoplasts[6]. The NoxD:mRFP construct was generated using multi-site gateway technology (Life Technologies) with the entry mCherry-withSTOP and destination SULPH-R3R4 vectors[77], and PCR fragments amplified from M. oryzae genomic DNA using Phusion DNA polymerase (NEB) and primers detailed in S4 Table. Appressorium development assays were performed on hydrophobic borosilicate glass coverslips (Fisher Scientific), as described previously[6]. For epifluorescence microscopy, conidia were incubated on coverslips and observed at each time point using an IX-81 inverted microscope (Olympus) and a UPlanSApo X100/1.40 oil objective. All microscopic images were analyzed using MetaMorph (Molecular Devices). Confocal imaging was performed with a Leica SP8 microscope.

QPCR and ROS detection

To confirm microarray results, the relative abundance of gene transcripts were analysed by qPCR (S4 Table). One μg of total RNA from 8-day old fungal mycelia grown on cellophane placed on CM agar was reverse transcribed using PrimeScript RT reagent Kit (Takara). The average threshold cycle (Ct) was normalized against actin transcript and relative quantification of gene expression was calculated by the 2 ΔΔCt method[85]. Primer efficiency was tested using dilutions of cDNA samples. qPCR reactions were carried out with 1 μl of reverse transcribed products and fast-start DNA master SYBR green I kit (Roche Diagnostics) in a final reaction of 20 μl using the following program: one cycle of 95°C for 4 min and 40 cycles of 94°C for 30 s and 60°C for 30 s. The Ct (threshold cycle) provided a measure for the starting copy numbers of the target genes. Three technical repetitions from three independent biological experiments were used for each gene. For ROS detection in M. oryzae fungal structures, NBT staining[65] and quantification method of pixel intensities in hyphal tips[86] were carried out as previously described.

Chromatin immunoprecipitation (ChIP) and quantitative PCR (qPCR) analysis

Two strains, the Δtpc1 mutant expressing TPC1:GFP and M. oryzae wild-type Guy11 strain as negative control were used for this experiment. Mycelia were grown in liquid CM at 25°C for 48 h in a shaker (120 rpm), and collected using two layers of Miracloth. Harvested mycelia were washed extensively with sterile water. To crosslink DNA and proteins, one gram of each washed mycelium was treated with 1% formaldehyde in 20 mM HEPES pH 7.4 buffer for 20 min with continuous shaking at 100 rpm. Then, 0.125 M glycine was added and incubated at room temperature for an additional 10 min to stop crosslinking. Mycelia were harvested with Miracloth, rinsed with water removing excessive water by squeezing and immediately frozen in liquid nitrogen, grinded into a fine powder and stored at -80°C until used. ChIP was conducted according to published procedures with some modifications [87]. 600 mg of each mycelium powder was used for chromatin extraction and sonication. The powder was added into 10 ml of Extraction buffer 1 (0.4 M sucrose, 10 mM Tris-HCl pH 8, 10 mM MgCl2, 5 mM β-mercaptoethanol/β-ME and Protease Inhibitors Complete-PIC/Roche) and mixed by vortexing. The solution was filtered through a double layer of Miracloth and centrifuged at 5000 g for 10 min at 4°C. The pellet was resuspended in 1 ml of Extraction buffer 2 (0.25 M sucrose, 10 mM Tris-HCl pH 8, 10 mM MgCl2, 1% Triton X-100, 5 mM β-ME and PIC) and centrifuged at 5000 g for 10 min at 4°C. The pellet was resuspended in 300 μl of Extraction buffer 3 (1.7 M sucrose, 10 mM Tris-HCl pH 8, 0.15% Triton X-100, 2 mM MgCl2, 5 mM β-ME and PIC) and, carefully layered on the top of additional 600 μl of extraction buffer 3. Then, samples were centrifuged at 16000 g for 60 min at 4°C. The chromatin pellet was resuspended in 300 μl of Nuclei Lysis Buffer (50 mM Tris-HCl ph 8, 10 mM EDTA, 1% SDS and PIC) and sonicated for 25 min at 4°C, operating a pattern of 30 sec ON and 30 sec OFF, at high power level in the Bioruptor Plus (Diagenode, Liege, Belgium) to obtain DNA fragments ranging from 500 to 1,000 bp. The chromatin solution was centrifuged at maximum speed for 5 min at 4°C to pellet cell debris. The supernatant was kept as chromatin solution and a small aliquot (10%) was stored as input DNA control. For each immunoprecipitation, 15 μl of Dynabeads Protein A magnetic beads (ref. 10001D, Life Technologies) was washed twice with 500 μl ChIP dilution buffer (1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris-HCl pH 8, 167 mM NaCl and PIC). Then, anti-GFP antibody (ref. A6455, Life Technologies) was added and incubated with gentle rotation for 1h at 4°C in 50 μl ChIP dilution buffer. Prepared anti-GFP coated beads were washed twice with 500 μl ChIP dilution buffer and resuspended in 100 μl of ChIP dilution buffer. For each immunoprecipitation, the latter and 100 μl of chromatin solution were gathered together and diluted up to 1 ml of ChIP dilution buffer. All immunoprecipitations were incubated overnight at 4°C with gentle rotation, then washed with a serie of wash buffers (2 washes with Low Salt Wash Buffer: 150 mM NaCl, 0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris-HCl pH 8 one wash with High Salt Wash Buffer: 500 mM NaCl, 0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris-HCl pH 8 one wash with LiCl Wash Buffer: 0.25 M LiCl, 1% NP-40, 1% sodium deoxycholate, 1 mM EDTA, 10 mM Tris-HCl pH 8, 2 washes with TE Buffer: 10 mM Tris-HCl pH 8, 1 mM EDTA). Immunoprecipitated DNAs and Input DNA control were reverse-crosslinked at 95°C for 10 min with 200 μl of 10% chelex 100 resin to remove any trace of metals. DNA samples were treated with proteinase K that was inactivated afterwards. After centrifugation, supernatants of DNA samples were stored at -20°C until used. Immunoprecipitated chromatin was diluted 10 times for qPCR analysis (primers listed in S4 Table). This was performed using a Roche LightCycler 480 machine. qPCR reactions were carried out using either 2 μl of input DNA or 2 μl of immunoprecipitated chromatin in a final reaction of 12 μl with the following program: one cycle of 95°C for 5 min and 58 cycles of 94°C for 10 s, 60°C for 10 s and 72°C for 10 s. The Ct (threshold cycle) provided a measure for the starting copy numbers of DNA. Three technical repetitions from 4 independent biological experiments were used. Ct values were used to calculate ratios evaluating the fold difference between experimental samples (GFP-tagged or untagged wild-type strains) and normalized the input. We normalized with “Fold Enrichment Method” using the untagged strain. The Wilcoxon Mann Whitney test was applied to analyze the difference between two independent groups. Statgraphics software was used to make pairwise comparisons between GFP-tagged strain and untagged wild-type strain.

Protein purification and EMSA

M. oryzae MST12 and TPC1 cDNAs derived from mycelial RNA were cloned by PCR using a high fidelity Q5 DNA polymerase (NEB), primers (S4 Table) and the restriction enzymes BamHI-NotI and EcoRI- NotI for MST12 and TPC1 respectively, into a modified pET28 vector (5,667bp Novagen). MST12- and TPC1-containing plasmids were transformed in E. coli Rosetta DE3 (Novagen) and colonies grown in LB medium containing chloramphenicol (34 μg/L) and kanamycin (50 μg/L) until reaching OD600nm = 0.8. Protein expression was induced 4 hours at 28°C with 1 mM IPTG (Sigma-Aldrich). Centrifuged cell pellets (30 min at 7000g) were resuspended in lysis buffer (20 mM sodium phosphate pH 8, 300 mM NaCl and one tablet of PIC/50 ml, 1 mM PMSF and 50 μg/ml Dnase I), lysed by sonication and pelleted at 4°C and high speed (20 min at 20,000g). Recombinant proteins were purified from clear lysate by metal affinity chromatography (HisTrap HP 1 ml, #17-5247-01 GE Healthcare) in denaturing conditions using 6 M Urea and eluted with 250 mM imidazole containing buffer. Samples were desalted on PD10 column (#17085101 GE Healthcare) to remove urea and imidazole using buffer (20 mM sodium phosphate pH 8, 10% glycerol and PIC). Protein samples purity was evaluated by SDS-PAGE.

EMSA probes were generated as follows. Amplified by PCR fragments using primers listed in S4 Table were prepared using modified Biotin 3’end DNA labeling procedure (#89818 Thermo-Scientific). Briefly, each

500pb purified PCR products was KpnI-digested, purified and labelled (5 pmol of each probe) with Biotin-11-UTP and Terminal Deoxinucleotidyl Transferase at 37°C for 1 hour. Biotinylated probes were purified by Chloroform:IAA (24:1) extraction and stored at -20°C until use. EMSA reactions (20 μl) contained 10 mM Tris HCl pH 7.5, 50 mM KCl, 16 mM DTT, 1 mM ZnCl2, 1 mM MgCl2, 1% Glycerol, 50 ng/μl Poly dI-dC (#20148E Thermo-Scientific), 10 μg BSA, Protease inhibitor complete (Roche), and 80 fmol of biotinylated probe. Before probe addition proteins (0–12 μM) were incubated in binding buffer for 10 min, then probe was added and incubated during 30 min at room temperature before loading. The EMSA gel (0.2% agarose, 5% polyacrylamide, 1% glycerol in TBE 0.5x) was run for 2h 100V in TBE 0.5x and then transferred to a Hybond-XL nylon membrane (#RPN203S GE Healthcare) at 400 mA for 1 hour. The membrane was UV crosslinked at 120mJ/cm 2 . Detection was performed with stabilized streptavidin-horseradish peroxidase conjugate (#21134 Thermo-Scientific) and enhanced chemiluminescent substrates (#32106 Thermo-Scientific) following LightShift Chemiluminescent EMSA procedure (#20148 Thermo-Scientific).

Phylogenetic analysis of Tpc1

First, 141 M. oryzae protein sequences containing a fungal Zn(II)2Cys6 binuclear cluster domain (PF00172) were identified from the Magnaporthe sequence database at the Broad Institute ( and the Fungal Transcription Factor Database ( HMMsearch from HMMER3[88] was used to screen the genome assembly of M. oryzae proteins with the fungal Zn2Cys6 profile hidden Markov model pHMM zn_clus_ls.hmm (PF00172.13) from Pfam database[89] ( Subsequently, gene numbers were updated using the MG8 genome version of EnsemblFungi database ( Out of these 141 sequences, only 113 had a full length zinc cluster domain, and extra six closest sequences were included to build S5 Fig. Additional Zn(II)2Cys6 proteins found in Lu et colleagues[28] were included in S2 Table. Basic Local Alignment Search Tool (BLAST) was used to find orthologous proteins of TPC1/MGG_01285 ( Protein sequences were pre-aligned using HMMalign and the pHMM zn_clus_ls.hmm (S4 Fig) from Pfam. The Zn(II)2Cys6 binuclear cluster domain region was extensively manually aligned in BioEdit ( Unambiguous aligned positions were used for the subsequent phylogenetic analyses. The maximum likelihood (ML) analyses were performed with the program PhyML version 3.0.1[90]. All trees were visualised using the program Figtree (

Accession numbers

M. oryzae sequence data from this article can be found in the GenBank/EMBL-EBI (EnsemblFungi) databases under the following accession numbers: TPC1 (MGG_01285), PMK1 (MGG_09565), MST12 (MGG_12958), ATG1 (MGG_06393), ATG8 (MGG_01062), CON6 (MGG_02246), GH18 MGG_04732, NOXD (MGG_09956), PEBP1 (MGG_06800), PEBP2 (MGG_14045), NOXR (MGG_05280), NOX1 (MGG_00750), NOX2 (MGG_06559), FIMBRIN (MGG_04478) GELSOLIN (MGG_10059), ACTIN (MGG_03982), YDIU (MGG_03159) and SEP5 (MGG_03087).

PRD-class Clusters: Remains of a PRD-class Mega-cluster?

Mazza et al. (2010) identified the HRO cluster of PRD-class genes in Cnidaria and protostomes, including insects and molluscs. This cluster is composed of the genes Homeobrain (Hbn), Rax/Rx, and Orthopedia (Otp). At least part of the cluster is even more ancient than the cnidarian-bilaterian ancestor as Hbn and Otp are also clustered in the placozoan T. adhaerens (Mazza et al., 2010). Also, elements of the HRO cluster are now known to be more widespread in protostomes than initially described. For example, more recent whole genome sequencing projects like that of the myriapod S. maritima have revealed that this arthropod has also retained the HRO cluster (Chipman et al., 2014).

Intriguingly, this HRO cluster exhibits temporal collinearity in the cnidarian Nematostella vectensis (Mazza et al., 2010). That is, the order of the genes along the chromosome corresponds to the order in which they are activated during development. Temporal collinearity has also been hypothesized to be the main underlying reason for the maintenance of intact, ordered Hox and ParaHox clusters (Ferrier and Holland, 2002 Ferrier and Minguillón, 2003 Monteiro and Ferrier, 2006). Thus, there is the potential that deeper mechanistic understanding of temporal collinearity can be obtained by comparisons across all three homeobox clusters: Hox, ParaHox, and HRO.

Clustering of PRD-class genes is not confined to the HRO cluster. The clustering of Goosecoid (Gsc) and Otx was noted in amphioxus (Putnam et al., 2008 Takatori et al., 2008) and the hemichordate genome sequences analyzed recently, reveal that in one species (Ptychodera flava) Gsc also clusters with Otx, but in another species (Saccoglossus kowalevskii) Gsc instead clusters with Otp, Rx, Hbn, and Islet (Isl) (all of which are PRD-class genes except Isl, which is LIM-class Simakov et al., 2015). Two things are noteworthy here. First, it will be important to independently check the Saccoglossus gene arrangement, particularly the location of Gsc. Second, the gene nomenclature risks causing confusion and in extended Figure 4 of Simakov et al. (2015), the authors have depicted the cluster containing an Arx gene, when in fact the gene should be named Hbn or Arx-like on the basis of its sequence. Arx is a distinct family from Hbn/Arx-like, as seen in the cnidarian Nematostella vectensis (Ryan et al., 2006 Table 1).

Table 1. Homeobox families present in the protostome�uterostome ancestor (PDA).

Looking deeper in animal evolution, Schierwater et al. (2008) noted two instances of PRD-class clustering in T. adhaerens: PaxB with Pitx and Ebx/Arx-like with Otp (this second cluster also containing the LIM-class gene Isl). The Ebx/Arx-like gene of Schierwater et al. (2008) is equivalent to the Hbn gene of Mazza et al. (2010). This then, in combination with the new hemichordate data, establishes the clustering of Otp with both Hbn/Arx-like and Isl as an ancient cluster that has been conserved from before the start of the Cambrian, over 541 million years ago. Furthermore, in combination with the data on the HRO PRD-class cluster of cnidarians and selected bilaterians, it is possible to deduce an ancestral extended PRD-LIM class cluster including Hbn, Rx, Otp, Gsc, Otx, and Isl (Figure 3). By comparison to the large ancestral array hypothesized for the ANTP-class (see above), we perhaps should now also view the PRD-class as having evolved via a Mega-cluster array as well (which in turn was also a sub-component of the Giga-cluster outlined above).

Figure 3. Composition of the PRD/LIM-class Mega-cluster. Specific instances of gene clustering are listed against specific taxa, which when considered together allow the deduction of the PRD/LIM-class Mega-cluster. These animals include non-bilaterians (T. adhaerens and N. vectensis), protostomes, and deuterostomes (hemichordates and amphioxus). Most members of the array are PRD-class genes (black boxes), but there is also a single member of the LIM-class (white box). The Pitx and Pax (PaxB) clustering is found in T. adhaerens, but is not reported for another animal as yet, hence the question mark to denote the ambiguity as to whether these PRD-class genes can be included in the PRD/LIM-class Mega-cluster. The HRO cluster is the PRD-class cluster originally described by Mazza et al. (2010). The figure only shows established instances of clustering arrangements described in the literature (see text for details). Lack of a gene alongside a taxon does not necessarily represent absence of the gene from the genome of that species, except in the case of Rax/Rx for T. adhaerens, which was not found in the placozoan genome by Mazza et al. (2010) (denoted by “X”).

Shaping an Improved Roadmap toward Precision Medicine

Because changes in mitochondrial function can alter global alternative splicing events, it is not surprising that both phenomenon are linked to the induction of cellular heterogeneity and human disease pathology progression (Raj and van Oudenaarden, 2008 Hanahan and Weinberg, 2011 Pagliarini et al., 2015). Despite a growing appreciation for the role that alternative splicing plays in promoting phenotypic variability, the identification of genetic polymorphisms linked to disease, which may or may not alter gene-splicing events, remains a core focus of pharmacogenomics. Mutations in single genes have now been identified for over 4000 human diseases, of which 5–15% are the result of SNPs resulting in nonsense mutations (Mort et al., 2008). However, SNPs occur in coding regions of genes with a frequency of approximately 1.5%, and of those, only one-third are expected to result in nonsynonymous mutations and a much smaller number alter pre-mRNA processing, leading to a frequency in change of phenotype of only ∼0.5%. When one considers that humans express almost 50,000 SNPs across the 57 human P450 genes, fewer than 250 may be capable of significantly altering a patient’s metabolic phenotype, an insight we contend may help streamline PGx approaches to personalized medicine (Zhou et al., 2009 Zanger and Schwab, 2013). For example, although CYP2D6 genotyping is no longer recommended in the clinical setting for tamoxifen treatments (Abraham et al., 2010 Wu, 2011 Lum et al., 2013), the FDA still considers CYP2D6 clinically actionable from a PGx perspective for codeine and other drugs (Crews et al., 2014). Unfortunately, over 74 allelic variants of CYP2D6 have already been described, which greatly complicates genetic testing and the clinician’s ability to select the appropriate pharmacotherapy and dose (Zhou, 2009). Fortunately, organizations such as the Clinical Pharmacogenetics Implementation Consortium (CPIC) and the Pharmacogenomics Knowledgebase (PharmGKB) are working to standardize methodologies for PGx data analysis and clinical guidance for specific gene-drug pairings (Caudle et al., 2016). However, CPIC concedes that some rare P450 variants may not be included in genetic tests, and that patients may be assigned “wild-type” genetics by default (Crews et al., 2014). When paired with additional uncertainties regarding P450-specific genotype/phenotype associations and the untold numbers of unclassified SNPs, it becomes clear why accurate genetic screening remains challenging, and only one aspect of the therapeutic decision-making process.

Ultimately, the falling cost and broader availability of pyrosequencing technologies support our call for improved RNA sequence strategies for personalized medicine, as accurate identification of the patient’s functional genome is a crucial component of precision medicine. Although transcriptomic analysis offers superior guidance in the design of personalized therapeutic options, its broad implementation will require technical improvements to sample collection and processing that are also problematic for genomic testing. In this regard, complementary metabolomics approaches directed at variant-specific metabolism may provide more feasible, short-term improvements to PGx screening and precision-based approaches to medicine (Beger et al., 2016). Innovative, gene-directed therapeutic technologies such as splice-altering antisense oligonucleotides and CRISPR/Cas9 genome-editing systems may also become feasible tools for manipulating a patient’s transcriptome to optimize therapeutic outcomes. Key examples of splice-switching technology already being investigated to treat human disease are listed in Supplemental Table 2. In this regard, our group has participated in the development of eteplirsen, the new antisense oligonucleotide drug that received accelerated approval from the FDA for the treatment of Duchenne muscular dystrophy (DMD Syed, 2016 Niks and Aartsma-Rus, 2017). Eteplirsen’s development evolved from early studies of exon skipping in the murine dystrophin model (Fall et al., 2006 Fletcher et al., 2007 Adams et al., 2007 Mitrpant et al., 2009), a canine model of DMD (McClorey et al., 2006b), and in human muscle explants (McClorey et al., 2006a). Our group has also employed exon-skipping oligomers to refine the immune response–mediated gene expression of CD45 protein-tyrosine phosphatase in a murine anthrax model (Panchal et al., 2009 Mourich and Iversen, 2009), of interleukin-10 in an Ebola virus lethal-challenge mouse model (Panchal et al., 2014) and of CTLA-4 in a murine model of autoimmune diabetes (Mourich et al., 2014). We hypothesize that similar splice-altering technology may be useful in redirecting the function of drug metabolizing P450s like CYP3A4 (Arora and Iversen, 2001), whose metabolism of drugs like tamoxifen is linked to genotoxicity (Mahadevan et al., 2006). As our appreciation for transcriptome expansion and the mechanisms of alternative P450 gene-splicing evolve, new therapeutic gene-editing options will probably emerge that could scarcely be predicted using genetic testing alone.

In summary, the human cytochrome P450 family transcriptome contains over 965 different variant forms (Table 2), many with common structural features sensitive to alternative splicing events that expand P450 protein diversity. The transcription and processing of P450 gene transcripts is complex and coordinately regulated within the nucleus by multiple factors, including NR signaling via environmental sensors like the peroxisome proliferator–activated receptors (PPARs) PPARγ and PPARα, which interact with the PGC-1α transcriptional coactivator to regulate oxidative metabolism and mitochondrial biogenesis (Wu et al., 1999 Monsalve et al., 2000). Multiple steroids, including products of CYPs 1A, 1B, 2A, 2B, 2C, 2D, 3A, 7A, 17A, 19A, 24A1, and 51A metabolism, bind NRs and lead to interactions with NR coregulators through LxxLL or FxxLF motifs that modulate the assembly of the spliceosome complex and pre-mRNA splicing (Auboeuf et al., 2005). The androgen receptor (AR), which binds multiple metabolites of CYPs 2A, 2C, 2D, 3A, 17A, 19A, and 21A, can also directly interact with nucleolar splicing factors (e.g., U5 small nucleolar ribonuclear protein), indicating a receptor-mediated role in transcription that is coupled to pre-mRNA splicing mechanisms (Zhao et al., 2002). Vitamin D receptor activation mediated by metabolites of CYPs 2R, 2J, 3A, 11A, 27A, 24A, and 27B can also alter P450 gene expression and splicing through NR-mediated crosstalk (e.g., PPARs) transduced via interactions with the retinoid X receptor (Matsuda and Kitagishi, 2013), recruitment of the NCoA62/SKIP coactivator complex (Zhang et al., 2003), and discrete interactions with the heterogeneous nuclear ribonucleoprotein C-splicing factor (Zhou et al., 2015). Traditional VDR signaling in the nucleus is further refined by several nontraditional NR functions operating near the plasma membrane that alter gene expression via modulation of key membrane-based paracrine signaling pathways, mediated by agents like Wnt and epidermal growth factor (Larriba et al., 2014). Vitamin A metabolites (or retinoids) of CYPs 1A, 2B, 2C, and 26, signaling through the retinoic acid receptor and retinoid X receptor, are also known to guide the recruitment of SC35 coactivators to regulate the alternate splicing of protein kinase delta (PKCδ) among other pre-mRNAs (Apostolatos et al., 2010). Collectively, these data reveal a novel P450-based mechanism for adaptive transcriptome remodeling, whereby xenobiotics and endogenous substrates, monitored by one of several tissue- or disease-specific “P450 clouds,” are metabolized in a coordinated fashion that harmonize NR signaling cascades with alternative gene expression and splicing events that promote adaptive responses to cell stress or stimuli (Fig. 4).

Endoxenobiotic crosstalk among cytochrome P450 and nuclear receptor genes coordinate alternative splicing and resemble a primitive immune system. Human tissues are subject to exposure from over 400 FDA-approved drugs, >10,000 xenobiotics, and untold numbers of endogenous substrates and their metabolites (x > 100,000). Cytochrome P450 genes participate in phase I detoxification of many of these compounds, including model substrates benzo[a]pyrene (via CYP3A4) and calcifediol (via CYP24A1). P450 genes are classically induced to silence endoxenobiotic signaling through cognate nuclear receptors, which modulate global gene expression and splicing events by “coloring” or modulating the composition of coregulatory factors that comprise both the transcription complex and the spliceosome complex, which ultimately alter the nature of both ribosome assembly and gene expression (Auboeuf et al., 2005). Model substrates are subject to metabolism by a finite population of P450s in a given tissue, however, and because each gene is sensitive to an infinite number of environmentally sensitive, alternative splicing events, each individual may express a unique, tissue-specific “P450 gene cloud” comprising both wild-type (WT) and splice-altered variant forms (e.g., SV1, SV2, etc.). P450 splice variants can: 1) display reduced ability to metabolize model substrates, 2) function as dominant negatives to sequester compounds from metabolism or potentiate basal NR-mediated signaling, or 3) function as a conformationally distinct protein with alternative metabolic function or cellular role. When coupled with existing paradigms of alternate P450 trafficking and membrane-associated cooperativity, an integrated network of crosstalk among 57 P450 and 48 NR genes begins to emerge, as novel P450 metabolites may engage NR signaling pathways in unique ways that reprogram gene splicing and expression to promote cellular homeostasis in the face of endocrine disruption. NR signaling cascades can alter both the transcriptome and epigenome of an individual, providing an elegant feedback mechanisms for adaptation to cellular stress created by unique personal history and disease status.

In conclusion, the human metabolome adapts to substrate burden through the induction of gene transcription, which helps to maintain homeostasis in a well documented pathway guided by NR binding and signaling events. In this respect, the metabolic response to xenobiotics (via P450 induction) is adaptive in a manner reminiscent of the immune response to viral antigen there is a recognition phase of the chemical by the P450 active site, an activation phase when the chemical (or P450 metabolite) interacts with the NR, and an effector phase in which the coordinated transcription and splicing of P450 transcripts occurs to feedback-modulate NR signaling. The analogy to the immune response is appropriate here in that specific transcript variants are produced in response to a specific chemical stimulus. The ability to tightly control cellular homeostasis via NR-mediated gene expression and alternative splicing implies a high order of sophistication operating in what might be considered a primitive, chemical immune response. Molecules that engage this primitive P450-based immune system transduce transcriptome-wide biologic responses capable of reshaping both the phenotype and “epigenotype” of the cell (via the regulation of both coding and noncoding RNA), allowing for reversible environmental adaptation, as well as imprinting, which in rare cases may persist transgenerationally (Hochberg et al., 2011). Improved knowledge of both the adaptive and maladaptive epigenome remodeling processes induced by xenobiotics may ultimately help reconcile interindividual variability in efficacy and toxicity that plague many FDA-approved drugs. These insights will provide new guidance for developing “individualized” therapeutic strategies more sensitive to a patient’s adaptive transcriptome or functional genome, which as exemplified by P450 superfamily of genes is the ultimate expression of the heritable epigenotype and appears to remain environmentally responsive throughout all phases of the human life cycle.