Information

Does a fully-resolved phylogenetic tree have to be dichotomous?

Does a fully-resolved phylogenetic tree have to be dichotomous?



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I read that wherever there is polytomy there is an unresolved pattern of divergence. I don't understand why this is so.

When divergence takes place, is it necessary that there will be division into only 2 paths? Can't there be more than 2 paths with all the paths being equally related to each other (that is, no 2 paths being related more to each other than to a third path)?

In other words, does a fully-resolved phylogenetic tree have to be dichotomous?


In theory, yes, every tree has to be dichotomous. You can understand a trichotomy in a tree as the summatory of two dichotomies that had happened so close in time that you cannot know wich was first.

Given a certain population, assume that some individuals colonize a new environment, got reproductively isolated and form a new specie. This is the typical speciation process. In this case, the two species, share a common ancestor. If you go back in time, all the individuals of the new specie will descend of only one individual. The same goes for the original specie. If you go back further, this individuals will share an ancestor, too. This is the conceptual meaning of the dichotomy, but normally it is impossible to determinate wich was exactly the common ancestor.

Now imagine that the original population formed not one, but two new species. And that this process happened about the same time as the one described before. It is conceptually possible that a set of three brothers were the origin of three different species (wich will be a true trichotomy), but not only it's very unlikely, but it's virtually impossible to prove. Since two close dichotomies are far more probable than a trichotomy, it's assumed that every tree has to be dichotomous, and that a trichotomy is in fact due to lack of resolution.


6.1: Phylogenetic Trees

  • Contributed by Boundless
  • General Microbiology at Boundless

Scientists use a tool called a phylogenetic tree, a type of diagram, to show the evolutionary pathways and connections among organisms. Scientists consider phylogenetic trees to be a hypothesis of the evolutionary past since one cannot go back to confirm the proposed relationships. In other words, a &ldquotree of life&rdquo, as it is sometimes called, can be constructed to illustrate when different organisms evolved and to show the relationships among different organisms.

Unlike a taxonomic classification diagram, a phylogenetic tree can be read like a map of evolutionary history. Many phylogenetic trees have a single lineage at the base representing a common ancestor. Scientists call such trees &lsquorooted,&rsquo which means there is a single ancestral lineage (typically drawn from the bottom or left) to which all organisms represented in the diagram relate. Notice in the rooted phylogenetic tree that the three domains (Bacteria, Archaea, and Eukarya) diverge from a single point and branch off. The small branch that plants and animals (including humans) occupy in this diagram shows how recent and miniscule these groups are compared with other organisms. Unrooted trees don&rsquot show a common ancestor but do show relationships among species.

Figure (PageIndex<1>): Phylogenetic trees: Both of these phylogenetic trees shows the relationship of the three domains of life (Bacteria, Archaea, and Eukarya), but the (a) rooted tree attempts to identify when various species diverged from a common ancestor, while the (b) unrooted tree does not.

In a rooted tree, the branching indicates evolutionary relationships. The point where a split occurs, called a branch point, represents where a single lineage evolved into a distinct new one. A lineage that evolved early from the root and remains unbranched is called basal taxon. When two lineages stem from the same branch point, they are called sister taxa. A branch with more than two lineages is called a polytomy and serves to illustrate where scientists have not definitively determined all of the relationships. It is important to note that although sister taxa and polytomy do share an ancestor, it does not mean that the groups of organisms split or evolved from each other. Organisms in two taxa may have split apart at a specific branch point, but neither taxa gave rise to the other.

Figure (PageIndex<1>): Rooted phylogenetic trees: The root of a phylogenetic tree indicates that an ancestral lineage gave rise to all organisms on the tree. A branch point indicates where two lineages diverged. A lineage that evolved early and remains unbranched is a basal taxon. When two lineages stem from the same branch point, they are sister taxa. A branch with more than two lineages is a polytomy.

Rooted phylogenetic trees can serve as a pathway to understanding evolutionary history. The pathway can be traced from the origin of life to any individual species by navigating through the evolutionary branches between the two points. Also, by starting with a single species and tracing back towards the &ldquotrunk&rdquo of the tree, one can discover that species&rsquo ancestors, as well as where lineages share a common ancestry. In addition, the tree can be used to study entire groups of organisms.

Another point to mention on phylogenetic tree structure is that rotation at branch points does not change the information. For example, if a branch point was rotated and the taxon order changed, this would not alter the information because the evolution of each taxon from the branch point was independent of the other.

Many disciplines within the study of biology contribute to understanding how past and present life evolved over time together, these disciplines contribute to building, updating, and maintaining the &ldquotree of life.&rdquo Information is used to organize and classify organisms based on evolutionary relationships in a scientific field called systematics. Data may be collected from fossils, from studying the structure of body parts or molecules used by an organism, and by DNA analysis. By combining data from many sources, scientists can put together the phylogeny of an organism. Since phylogenetic trees are hypotheses, they will continue to change as new types of life are discovered and new information is learned.


What is Dichotomous Key

A dichotomous key is a diagram which allows the identification of a group of organisms by dividing them into two categories repeatedly. These categories are called ‘leads’ while a set of leads at a given point is called a ‘couplet’. When an organism no longer shares its selected characteristics with any other organism, it has been identified. However, the above-selected characteristics have to be immutable, or in other words, they do not change. As examples, coloration, size, and behavioral features may vary among individuals as well as throughout their lifetime. However, physical structures such as the number of limbs and biochemical processes such as reproduction methods are better characteristics.

Figure 1: Dichotomous Tree of Extant Shark Orders

Furthermore, two types of dichotomous keys are there based on the method of representation. They are the diagrammatic representation as a branching flowchart and descriptive representation as a series of paired statements laid out in numbers. However, the dichotomous key does not provide any kind of trustworthy phylogenetic relationships of organisms in the diagram.


Organizing Life on Earth

In scientific terms, the evolutionary history and relationship of an organism or group of organisms is called its phylogeny. A phylogeny describes the relationships of an organism, such as from which organisms it is thought to have evolved, to which species it is most closely related, and so forth. Phylogenetic relationships provide information on shared ancestry but not necessarily on how organisms are similar or different.

Phylogenetic Trees

Scientists use a tool called a phylogenetic tree to show the evolutionary pathways and connections among organisms. A phylogenetic tree is a diagram used to reflect evolutionary relationships among organisms or groups of organisms. Scientists consider phylogenetic trees to be a hypothesis of the evolutionary past since one cannot go back to confirm the proposed relationships. In other words, a “tree of life” can be constructed to illustrate when different organisms evolved and to show the relationships among different organisms ([link]).

Unlike a taxonomic classification diagram, a phylogenetic tree can be read like a map of evolutionary history. Many phylogenetic trees have a single lineage at the base representing a common ancestor. Scientists call such trees rooted, which means there is a single ancestral lineage (typically drawn from the bottom or left) to which all organisms represented in the diagram relate. Notice in the rooted phylogenetic tree that the three domains— Bacteria, Archaea, and Eukarya—diverge from a single point and branch off. The small branch that plants and animals (including humans) occupy in this diagram shows how recent and miniscule these groups are compared with other organisms. Unrooted trees don’t show a common ancestor but do show relationships among species.

In a rooted tree, the branching indicates evolutionary relationships ([link]). The point where a split occurs, called a branch point, represents where a single lineage evolved into a distinct new one. A lineage that evolved early from the root and remains unbranched is called basal taxon. When two lineages stem from the same branch point, they are called sister taxa. A branch with more than two lineages is called a polytomy and serves to illustrate where scientists have not definitively determined all of the relationships. It is important to note that although sister taxa and polytomy do share an ancestor, it does not mean that the groups of organisms split or evolved from each other. Organisms in two taxa may have split apart at a specific branch point, but neither taxa gave rise to the other.

The diagrams above can serve as a pathway to understanding evolutionary history. The pathway can be traced from the origin of life to any individual species by navigating through the evolutionary branches between the two points. Also, by starting with a single species and tracing back towards the "trunk" of the tree, one can discover that species' ancestors, as well as where lineages share a common ancestry. In addition, the tree can be used to study entire groups of organisms.

Another point to mention on phylogenetic tree structure is that rotation at branch points does not change the information. For example, if a branch point was rotated and the taxon order changed, this would not alter the information because the evolution of each taxon from the branch point was independent of the other.

Many disciplines within the study of biology contribute to understanding how past and present life evolved over time these disciplines together contribute to building, updating, and maintaining the “tree of life.” Information is used to organize and classify organisms based on evolutionary relationships in a scientific field called systematics. Data may be collected from fossils, from studying the structure of body parts or molecules used by an organism, and by DNA analysis. By combining data from many sources, scientists can put together the phylogeny of an organism since phylogenetic trees are hypotheses, they will continue to change as new types of life are discovered and new information is learned.

Limitations of Phylogenetic Trees

It may be easy to assume that more closely related organisms look more alike, and while this is often the case, it is not always true. If two closely related lineages evolved under significantly varied surroundings or after the evolution of a major new adaptation, it is possible for the two groups to appear more different than other groups that are not as closely related. For example, the phylogenetic tree in [link] shows that lizards and rabbits both have amniotic eggs, whereas frogs do not yet lizards and frogs appear more similar than lizards and rabbits.

Another aspect of phylogenetic trees is that, unless otherwise indicated, the branches do not account for length of time, only the evolutionary order. In other words, the length of a branch does not typically mean more time passed, nor does a short branch mean less time passed— unless specified on the diagram. For example, in [link], the tree does not indicate how much time passed between the evolution of amniotic eggs and hair. What the tree does show is the order in which things took place. Again using [link], the tree shows that the oldest trait is the vertebral column, followed by hinged jaws, and so forth. Remember that any phylogenetic tree is a part of the greater whole, and like a real tree, it does not grow in only one direction after a new branch develops. So, for the organisms in [link], just because a vertebral column evolved does not mean that invertebrate evolution ceased, it only means that a new branch formed. Also, groups that are not closely related, but evolve under similar conditions, may appear more phenotypically similar to each other than to a close relative.

Head to this website to see interactive exercises that allow you to explore the evolutionary relationships among species.

The Levels of Classification

Taxonomy (which literally means “arrangement law”) is the science of classifying organisms to construct internationally shared classification systems with each organism placed into more and more inclusive groupings. Think about how a grocery store is organized. One large space is divided into departments, such as produce, dairy, and meats. Then each department further divides into aisles, then each aisle into categories and brands, and then finally a single product. This organization from larger to smaller, more specific categories is called a hierarchical system.

The taxonomic classification system (also called the Linnaean system after its inventor, Carl Linnaeus, a Swedish botanist, zoologist, and physician) uses a hierarchical model. Moving from the point of origin, the groups become more specific, until one branch ends as a single species. For example, after the common beginning of all life, scientists divide organisms into three large categories called a domain: Bacteria, Archaea, and Eukarya. Within each domain is a second category called a kingdom. After kingdoms, the subsequent categories of increasing specificity are: phylum, class, order, family, genus, and species ([link]).

The kingdom Animalia stems from the Eukarya domain. For the common dog, the classification levels would be as shown in [link]. Therefore, the full name of an organism technically has eight terms. For the dog, it is: Eukarya, Animalia, Chordata, Mammalia, Carnivora, Canidae, Canis, and lupus. Notice that each name is capitalized except for species, and the genus and species names are italicized. Scientists generally refer to an organism only by its genus and species, which is its two-word scientific name, in what is called binomial nomenclature. Therefore, the scientific name of the dog is Canis lupus. The name at each level is also called a taxon. In other words, dogs are in order Carnivora. Carnivora is the name of the taxon at the order level Canidae is the taxon at the family level, and so forth. Organisms also have a common name that people typically use, in this case, dog. Note that the dog is additionally a subspecies: the “familiaris” in Canis lupus familiaris. Subspecies are members of the same species that are capable of mating and reproducing viable offspring, but they are considered separate subspecies due to geographic or behavioral isolation or other factors.

[link] shows how the levels move toward specificity with other organisms. Notice how the dog shares a domain with the widest diversity of organisms, including plants and butterflies. At each sublevel, the organisms become more similar because they are more closely related. Historically, scientists classified organisms using characteristics, but as DNA technology developed, more precise phylogenies have been determined.

At what levels are cats and dogs considered to be part of the same group?

Visit this website to classify three organisms—bear, orchid, and sea cucumber—from kingdom to species. To launch the game, under Classifying Life, click the picture of the bear or the Launch Interactive button.

Recent genetic analysis and other advancements have found that some earlier phylogenetic classifications do not align with the evolutionary past therefore, changes and updates must be made as new discoveries occur. Recall that phylogenetic trees are hypotheses and are modified as data becomes available. In addition, classification historically has focused on grouping organisms mainly by shared characteristics and does not necessarily illustrate how the various groups relate to each other from an evolutionary perspective. For example, despite the fact that a hippopotamus resembles a pig more than a whale, the hippopotamus may be the closest living relative of the whale.

Section Summary

Scientists continually gain new information that helps understand the evolutionary history of life on Earth. Each group of organisms went through its own evolutionary journey, called its phylogeny. Each organism shares relatedness with others, and based on morphologic and genetic evidence, scientists attempt to map the evolutionary pathways of all life on Earth. Historically, organisms were organized into a taxonomic classification system. However, today many scientists build phylogenetic trees to illustrate evolutionary relationships.

Art Connections

[link] At what levels are cats and dogs considered to be part of the same group?

[link] Cats and dogs are part of the same group at five levels: both are in the domain Eukarya, the kingdom Animalia, the phylum Chordata, the class Mammalia, and the order Carnivora.

Review Questions

What is used to determine phylogeny?

What do scientists in the field of systematics accomplish?

  1. discover new fossil sites
  2. organize and classify organisms
  3. name new species
  4. communicate among field biologists

Which statement about the taxonomic classification system is correct?

  1. There are more domains than kingdoms.
  2. Kingdoms are the top category of classification.
  3. Classes are divisions of orders.
  4. Subspecies are the most specific category of classification.

On a phylogenetic tree, which term refers to lineages that diverged from the same place?

Free Response

How does a phylogenetic tree relate to the passing of time?

The phylogenetic tree shows the order in which evolutionary events took place and in what order certain characteristics and organisms evolved in relation to others. It does not relate to time.

Some organisms that appear very closely related on a phylogenetic tree may not actually be closely related. Why is this?

In most cases, organisms that appear closely related actually are however, there are cases where organisms evolved through convergence and appear closely related but are not.

List the different levels of the taxonomic classification system.

domain, kingdom, phylum, class, order, family, genus, species

Glossary


Phylogenetic Tools for Comparative Biology

An R phylogenetics user just emailed me about a post I wrote (archived here) describing how to resolve multifurcations before calculating contrasts for PIC regression. I just wanted to take a couple of moments to elaborate on this issue.

The contrasts algorithm (Felsenstein 1985) requires a fully dichotomous (i.e. fully bifurcating) phylogenetic tree with branch lengths. By computing the differences between tip & node states, the contrasts algorithm creates a transformed dataset which (under a Brownian model of evolution) is free from phylogenetic dependence.

However, if your phylogeny is not fully dichotomous, it can be rendered dichotomous by arbitrarily resolving all polytomies through the addition of one or more internal branches of length 0.0 after which contrasts can be calculated as normal. Since every trifurcation can be resolved in at least three ways (and higher order multifurcations in even more different ways), we may have many options in how we choose to resolve our non-dichotomous tree. For instance, if our tree is fully bifurcating except at two nodes which are each trifurcations, we have 3 × 3 = 9 different ways of rendering our tree dichotomous via the addition of internal zero length branches.

Luckily, it turns out not to matter how we resolve the tree: statistical analysis of the contrasts transformed data is unaffected. To prove this to ourselves, let's look at two different examples:

> require(phytools)
> # first let's create a random tree with polytomies
> tree # pick nodes to collapse
> nodes # match the nodes to tree$edge & set to zero
> tree$edge.length[apply(apply(as.matrix(nodes),1,"= off")

Close inspection of this tree reveals that it has many multifurcations, and this is corroborated by the fact that the tree can be seen to have 30 fewer nodes (just as we'd hoped) than the number expected for a fully dichotomous 100 taxon tree:

The next thing we'll do is simulate under an arbitrary linear model, here y = 1.5 + 0.75x + &epsilon, but we will do so in such a way so that the data are phylogenetically autocorrelated:

OK, now let's resolve the tree randomly, compute contrasts, and fit a bivariate regression model to the contrasts:

pic.x-1) # fits the regression model without an intercept

To show that we get the same result from contrasts regression regardless of the random resolution that we use, let's try a different random resolution and fit the regression model again:

To assure ourselves that this is not just magical good luck at having obtained the same resolution in each case we can compute:

We have to avoid the following tempting shortcut:

Call:
lm(formula = pic(y, multi2di(tree))

Coefficients:
pic(x, multi2di(tree))
-0.01451

or variants thereof, because in this case different random resolutions are used for x & y, which is not permitted. However, now knowing that the specific choice of resolution doesn't matter (so long as it is the same for dependent & independent variables), we can also see that the following will work:

Call:
lm(formula = pic(y, multi2di(tree, random = FALSE))

pic(x,
multi2di(tree, random = FALSE)) - 1)

Coefficients:
pic(x, multi2di(tree, random = FALSE))
0.7761

In this case, multi2di(. random=FALSE) polytomies are not resolved randomly, but in the order they appear in the tree (which appears to be right-to-left in the input Newick string).


Does a fully-resolved phylogenetic tree have to be dichotomous? - Biology

ABSTRACT. Nowadays, there are many phylogeny reconstruction methods, each with advantages and disadvantages. We explored the advantages of each method, putting together the common parts of trees constructed by several methods, by means of a consensus computation. A number of phylogenetic consensus methods are already known. Unfortunately, there is also a taboo concerning consensus methods, because most biologists see them mainly as comparators and not as phylogenetic tree constructors. We challenged this taboo by defining a consensus method that builds a fully resolved phylogenetic tree based on the most common parts of fully resolved trees in a given collection. We also generated results showing that this consensus is in a way a kind of &ldquomedian&rdquo of the input trees as such it can be closer to the correct tree in many situations.

Key words: Phylogeny, Phylogenetic consensus, Splits

There are many cases in which, having a collection of similar elements, one desires to compare these elements or combine the information of the whole collection into a single, consensus element. Probably, the most widely known consensus in bioinformatics is the consensus between sequences. Sequence consensi may be used both for comparison, for instance, to point out probable SNPs in a collection of very similar sequences, and for the reconstruction of a larger or more accurate sequence, in the case of EST clustering or genome assembling.

In the field of phylogenetics, a consensus is mostly used to summarize similarities between trees. As a result, we have many consensus methods designed to point out the common parts of phylogenetic trees. Examples include strict and semi-strict consensus (Kitching, 1998), Nelson-Page consensus (Bryant, 2003), majority rule consensus (Margush and McMorris, 1981) and Adams consensus (Bryant, 2003). Only a few methods, such as the asymmetric median tree (AMT) proposed by Phillips and Warnow (1996), are designed to be used as a final step in tree reconstruction. The main difference between AMT and the other consensus methods is that the former is designed to produce a tree that is a substitute for the collection of trees used to build it whereas the trees of the other consensus methods are just an image, or a summary, of the whole collection.

One problem with the AMT is that it relies on an optimization problem that is NP-hard in general. We define a new kind of consensus tree, which we call the most probable tree. Its definition relies on a maximization problem that can be solved efficiently by dynamic programming, yielding a polynomial time algorithm for its computation.

In addition, we tested the properties of our new method with several experiments. The first batch of experiments, designed to determine to what extent our tree can be said to be &ldquoin between&rdquo the trees used to construct it, used artificially generated data. A final experiment, which shows a particular case where our tree came closer to the &ldquooriginal&rdquo or &ldquotrue&rdquo tree than most reconstruction methods, is based on the ribosomal RNA sequences used as a standard reference of the phylogeny of living organisms.

In the present study, we only consider unrooted phylogenetic trees. Therefore, every time that the term phylogenetic tree, or simply tree, appears in this text, it refers to an unrooted phylogenetic tree. A phylogenetic tree is basically a set of nodes connected by branches, as in Figure 1. The nodes connected to a certain node are its neighbors. If a node has only one neighbor, it is called a leaf otherwise it is an internal node. There are no nodes with exactly two neighbors in the trees, since they are unrooted. Nodes with more than three neighbors are called polytomies. In our work, both the trees in the given collection and the consensus trees have no polytomy, and therefore they are called fully resolved.


We denoted the set of leaves of a phylogenetic tree with the letter L. Any non-empty subset of L is called a cluster defined on L. When comparing two different clusters, if their sizes differ, we indicated that the cluster with less elements is smaller than the other. For instance, if we have two different clusters R and S, with R having 2 elements and S having 3, then R is smaller than S and we can write R < S. If the two clusters have the same number of elements, we have the situation shown in Figure 2. In this case, we use an enumeration of the elements of L as a tiebreaker as follows:

1. Consider the elements that are in exactly one of the clusters only.

2. Choose the element numbered with the smallest number.

3. The cluster that has the chosen element is the smallest one.


Quitzau and Meidanis (2005) proved that, when using a fixed enumeration, if one has three different clusters A, B and C, with A < B and B < C, then A < C. This is an important result because it allows us to sort a collection of clusters and find a cluster very quickly by binary search.

Two clusters A and B such that A Ç B = Æ or A È B = A or A È B = B are called compatible otherwise they are incompatible. Clusters can be combined to form trees only when they are compatible.

It is easy to see that the removal of any branch in a phylogenetic tree transforms it into a pair of rooted phylogenetic trees with smaller sets of leaves, as Figure 3 exemplifies. The sets of leaves of the new trees are clearly clusters of L and have the property of being disjoint that is, they have no common element. At the same time, their union is the set of leaves of the original phylogenetic tree. A pair of clusters with these two properties is called a split of the set of leaves. Note that, since a split is a pair of different clusters, it is always possible to compare them and define which one is the small cluster and which is the large cluster of a split. In addition, two splits are compatible if their small clusters are compatible otherwise they are incompatible (Quitzau and Meidanis, 2005).


Every single branch in a phylogenetic tree corresponds to a different split on its set of leaves. The set of all splits of a phylogenetic tree corresponds to its split system. We denote the split system of a phylogenetic tree T by S[T]. Finally, two different splits found in the split system of the same phylogenetic tree are always compatible.

It is possible to measure the distance between two trees by counting the minimum number of changes that must be made in the split system of the first tree in order to obtain the split system of the second one. There are only two possible changes that can be made in a split system: the addition and the removal of a split. The minimum number of changes is called split distance (Waterman, 1995 Quitzau and Meidanis, 2005) and may be calculated using the formula:


where T 1 and T 2 are the first and the second trees, respectively, and |S[T]| denotes the size, or the number of elements, of the split system S[T]. Figure 4 shows an example of two trees and one of the minimal sequences of operations that transforms the first tree into the second.


Sets of small clusters

A set of clusters that are all subsets of the same set X is an n-tree if and only if it has the four properties below:

1. It contains the cluster X.

2. It contains a cluster <i> for all elements i Î X.

3. It does not contain the empty set, since it is not a cluster.

4. All its pairs of clusters are compatible.

If T is a phylogenetic tree, then the set of all small clusters of S[T], denoted by F[T], is a set of disjoint n-trees. In particular, if T is fully resolved, then F[T] has exactly three disjoint maximal n-trees.

Let T be a collection of phylogenetic trees, S be a split, and S s the small cluster of S. Then the weight of S with respect to T, p(S, T), is the relative frequency of S in T, given by the formula


which corresponds to the probability of finding S in T. The weight can also be defined for clusters, as follows


In fact, recalling that S s denotes the small cluster of S, we have


The most probable tree

Let the weight p(T, T) of a phylogenetic tree T with respect to a collection of phylogenetic trees T be defined as follows:


Note that p(T, T) should correspond to the probability of finding T in T, if the choice of the splits of S[T] could be made independently therefore, we call T a most probable tree of a collection T if the following conditions are satisfied:

1. T is a fully resolved phylogenetic tree.

2. There is no fully resolved phylogenetic tree T* such that p(T, T) < p(T*, T).

The algorithm presented in this section takes advantage of the correspondence between split systems and disjoint sets of n-trees and finds trios of n-trees instead of compatible split systems. To begin with, we rewrite the tree weight function as a function of the small clusters of a phylogenetic tree and the collection of trees. The weight of an n-tree can be defined as


Therefore, the weight function used in the algorithm is:


knowing that the maximal n-trees of F[T] are disjoint (Quitzau and Meidanis, 2005).

We now make some considerations about trivial n-trees and the combination of n-trees. If a cluster C has only a single element, we may trivially associate an n-tree to it, since the set y C = <C> satisfies all the n-tree conditions. Note that, in this case, the trivial n-tree is maximal, therefore fully resolved. Furthermore, if A and B are two disjoint clusters, y A is an n-tree on A and y B is an n-tree on B, then the set y C = <A È B> È y A È y B is an n-tree on C = A È B. In this case, if y A and y B are fully resolved, then y C is also a fully resolved n-tree.

We call a cluster C solved when there is at least one fully resolved n-tree y C associated with it and p( y C , T) is maximal.

The algorithm uses the dynamic programming paradigm, using solutions to smaller problem instances to build solutions for larger instances. During an algorithm run, the cluster solutions and the phylogenetic trees with maximal weights are stored in a data structure constructed with the building blocks shown in Figure 5. The most basic type is the &ldquoCluster&rdquo, which records the main information of a cluster. It has four fields: elements, which store the cluster elements p, which stores the cluster relative frequency bs, which stores a list of pairs of clusters whose best solutions, together with the represented cluster, form an n-tree of maximal weight. The &ldquoTree&rdquo type stores a trio of disjoint clusters that cover the set of leaves. The &ldquoSolution&rdquo type stores a pair of disjoint clusters. The field bsl of the &ldquoCluster&rdquo type is actually a linked list of &ldquoSolutions&rdquo.


Two important sub-routines are the one that store the best solutions found for a cluster, called Cluster, and the one that stores the best phylogenetic trees, called Forest. Both are described below:



These two sub-routines are almost identical. In fact, the only differences between them are the formula of the calculated value and the type of the elements stored in the linked lists. In the case of Cluster, the calculated value is the weight of an n-tree, which is the product of the weights of two smaller n-trees and the relative frequency of a cluster. The element stored in this case is a &ldquoSolution&rdquo. Concerning the Forest sub-routine, the calculated value is a phylogenetic tree weight and the stored type is a &ldquoTree&rdquo.

Both sub-routines consist of two analogous comparisons leading to one of three different actions. At line 1, the new calculated weight is tested, and if it is greater than the stored weight, lines 2 and 3 discard the old list and initialize a new one, with the newly calculated weight. Otherwise, the sub-routines test if the new weight is equal to the stored one. If the answer is affirmative, the new &ldquoSolution&rdquo/&ldquoTree&rdquo is added to the best solution list (bsl) otherwise it is discarded. It is not difficult to see that the running time of each sub-routine is bound by a constant.

The core algorithm is presented in pseudo-code below. The procedure Small just extracts the small clusters of T and calculates their relative frequencies, returning a sorted array of clusters.


The core of the algorithm is quite simple. It starts by creating an array of clusters at line 1. This array, called Small, contains all small clusters found in the collection T of fully resolved phylogenetic trees and is sorted in increasing order. Lines 2 and 3 create an empty list of most probable trees, and initialize the weight of the best tree found, respectively. After that, the algorithm finds at least one maximal n-tree for each cluster in a bootstrap fashion. In other words, it trivially solves the singleton clusters and then starts to build solutions for larger clusters using the already-solved clusters. This is achieved by processing all the clusters A Î Small in increasing order and then analyzing all pairs formed by A and the other clusters B such that B < A. Only pairs formed by disjoint clusters are further analyzed. These pairs fall in exactly one of three cases:

A È B Î Small: In this case, for all n-trees y A such that p( y A , T) is maximal and for all n-trees y B such that p( y B , T) is maximal, <A È B> È y A È y B may be an n-tree on A È B with maximal weight. This case is treated in lines 9 and 10.

L (A È B) Î Small: In this case, A, B and L (A È B) are three disjoint sets that cover the set of leaves. Therefore, the n-trees associated with these clusters together correspond to a fully resolved phylogenetic tree, and the weight of this tree is compared to the best weight stored so far, in lines 13 and 14.

Useless case: In this case, A È B is not part of a split found in the collection T of phylogenetic trees. Such a pair must be discarded.

When all possible pairs have been analyzed, the clusters in Small are organized in a structure, such as the one shown in Figure 6. Quitzau and Meidanis (2005) proved that, after the execution of the algorithm, the returned structure represents all and only the most probable trees for the collection T given as input.


The algorithm running time

Every fully resolved phylogenetic tree has 2l - 3 edges, where l is the number of leaves. If collection T has t fully resolved phylogenetic trees, then there are 2lt - 3t small clusters to be inserted into the array Small. We implemented the algorithm using an array of bits to represent the clusters. Therefore, the comparison between two clusters takes O(l) time. To create the Small array, we need to find the position of 2lt - 3t clusters in the array by binary search, which takes O(l 2 t lg s - 3lt lg s), where s = ½ Small ½ . In addition, s clusters must be inserted in the array, each one at the cost of O(s). Summing up, the total time spent in the preparation of the array Small is


However, since the number of distinct clusters in a collection is 2l - 3, when all trees are equal, and l + lt - 3t, when all the trees are totally different, the time spent with the preparation of Small is


To find the cluster solutions, s 2 pairs of clusters are analyzed and, in the worst case, the analysis of a cluster requires a binary search in Small, which can be made in O(l lg s). Therefore, the total time used to find the solutions for all clusters is


which is an upper bound for the running time of the algorithm and may be written as


We tested the efficiency of the most probable tree as a reconstruction method using four artificial sequence sets taken from the repository maintained by Gascuel (http://www.lirmm.fr/

w3ifa/MAAS/US-MAAS.html). We chose four artificial data sets simulating different hypotheses of the evolutionary mechanism:

K2P - This sequence set was created using Kimura&rsquos two-parameter evolutionary model, together with a gamma distribution of transition/transversion rates across sites. The tree topology used was a phylogenetic tree with edge lengths that are not necessarily consistent with the molecular clock hypothesis.

K2Pm - The conditions under which these sequences were created are similar to the first sequence set. The only difference is that the edge lengths are consistent with the molecular clock hypothesis.

COV - The evolutionary model used for the creation of artificially evolved sequences in this set was the covarion model (Nei, 1975). As in the K2P data set, the tree topology used does not take the molecular clock hypothesis into account.

COVm - For the creation of this data set, the covarion model was used over a topology compatible with the molecular clock hypothesis.

A fifth data set was created with sequences of small ribosomal subunit RNAs taken from the Ribosomal Database Project (Cole et al., 2003). This data set is called REAL and the reference tree used in this case was the tree published by the Ribosomal Database Project.

Creation of the collections of trees

We chose eight publicly available softwares to create our collections of phylogenetic trees: fastMe (Desper and Gascuel, 2002) Mega (version 3 for Windows) (Kumar et al., 2004) the softwares dnacomp, dnaml, dnamlk, dnapars, and neighbor from the PHYLIP (Felsenstein, 1989) package (version 3.6), and weighbor (Bruno et al., 2000). These softwares allowed us to work with seven phylogenetic reconstruction methods: DNA compatibility, maximum likelihood, maximum parsimony, minimum evolution, neighbor joining, UPGMA, and weighted version of the neighbor joining. In cases where it was necessary to estimate the distance between sequences, three evolutionary models where used: Jukes and Cantor&rsquos model, Kimura&rsquos two-parameter model (Graur and Li, 1999 Quitzau and Meidanis, 2005), and the Tamura-Nei model (Tamura and Nei, 1993). We call a constructor a combination of a software, a tree reconstruction method and, when necessary, the evolutionary model used to estimate distances between sequences, and we used the 18 constructors shown in Table 1 to build collections of trees. All softwares were used with the default parameters. In the case of the methods DNA compatibility and maximum parsimony, whenever more than one tree was produced, only the first tree in the output file was considered for further analysis. The constructor PML was unable to produce an output for the data set REAL in less than 24 h. It was therefore stopped, and its result not considered.


After creating the trees with the constructors, we used them to create a most probable tree for each collection. In the end, each data set included a reference tree, a set of reconstructed trees and a most probable tree. The distances between each pair of trees were calculated and used for further analysis.

To perform the analysis based on average distances, we used only distances between the reconstructed trees and distances between most probable and reconstructed trees (Table 2). The code CONS corresponds to our most probable tree.


We can see that the minimum average distance always belongs to the consensus tree. This means that the most probable tree is centralized with respect to the collection of input trees. Since we used the split distance to make the comparisons, the distance is closely related to the number of splits that the trees have in common. As a result, we note that the definition of the most probable tree actually reached its objective, which was to create trees that have the most common splits found in a set.

Distances to the reference tree

We calculated the split distance between each phylogenetic tree, including the consensus tree, and the reference tree (Table 3). As we can see, for the vast majority of the data sets, the consensus tree is closer to the reference tree than 60% of the trees in the collection used to build it. The only case where the consensus tree had a bad performance was for the data set K2Pm. Even so, for the data set containing real sequences, the most probable tree is the tree that is closest to the reference.


We suggest a very simple strategy for a better approximation of reconstructed trees. The main idea was to determine if phylogenetic consensus makes any sense in the context of phylogenetic reconstruction. For verification, we chose reference trees as &ldquotrue trees&rdquo and used sets of DNA sequences to try to reconstruct these trees. In spite of the fact that we knew the parameters used to create the data sets, we tried to simulate a real case of phylogenetic reconstruction. Therefore, because no one really knows which model rules natural evolution, we decided to treat every data set as if we had no idea of the way the sequences evolved. In many cases, this decision led us to create trees that were quite bad, such as the trees built for the K2Pm and COVm models. However, our purpose was not to determine whether a reconstruction method based on a certain model is able to reconstruct a tree based on sequences created using the same model they are supposed to do it successfully. The scenario we tried to analyze was the situation where different, trustworthy methods generate different trees using real data.

In this scenario, we have a tree that is assumed to be correct and unknown. For this reason we chose artificially created data to represent DNA extracted from nature. Since we cannot be sure about which evolutionary model best describes actual DNA evolution, we created four data sets using two different models (covarions and rate across sites) in two opposite variations (molecular clock rules ´ molecular clock does not rule). In addition, we also used a real data set, based on a well-studied phylogeny: the phylogeny of ribosomal small subunit RNAs.

With the models, the sequences, and the trees in hand, another problem appeared: how to measure distance among trees? We decided to use the split distance to compare the trees, since this definition of distance is directly related to the number of groups (clusters) that the reconstructed tree has in common with the reference cluster. It is unfair to compare partially resolved trees by split distance, because every cluster in a fully resolved phylogenetic tree that is not in the reference is counted at least twice, because one has at least to take a split out of the reconstructed tree and to put the right one in place. This can be seen in Figure 4. In this figure, if we consider that tree number 2 is the reference tree, then both tree 1 and the tree in the middle of the figure make the same mistakes: they were not able to group the clusters <B, C, D> and <F, G> properly but the tree in the middle has distance 2, while tree 1 has distance 4. It could get worse. Note that the tree that has only seven leaves and one internal node has also distance 4 to tree 2, despite the fact that such a tree represents no information at all.

To overcome this obstacle, we presented a consensus method called most probable tree, which is able to produce fully resolved consensi for a collection of fully resolved phylogenetic trees. This consensus method is based on an optimization criterion. In spite of this, we presented a fast (polynomial) algorithm able to find all most probable trees for a given collection of fully resolved phylogenetic trees.

The new consensus method was created with the aim of approximating the trees in a collection to the true tree. This characteristic of the most probable tree was checked by two simple tests repeated over five very different data sets. The behavior of the consensus tree was the same in all the data sets: the consensus tree was centralized with respect to the collection used to build it and it was, in most cases, one of the trees closest to the reference.

The use of default parameters when constructing the trees used in the tests may appear to be a weak point of the tests, since better chosen parameters for the reconstruction methods could have given rise to better trees. Actually, the lack of quality of the rebuilt trees is in fact the strongest point of the tests. Since the position of the consensus tree in the space of phylogenetic trees depends on the position of the trees in the collection used to create it, better trees can only improve the quality of the consensus. What the results show is that the most probable tree is centralized with respect to the input trees. If those are distributed around the reference tree, the consensus will be closer to the reference, regardless of the quality of the trees used to build it.

Of course, five data sets may not be enough to prove that consensus techniques are the best phylogenetic reconstructors. We must keep in mind that consensus methods are not reconstruction methods at all, since they are not able to rebuild a tree from raw data. But phylogenetic consensus methods can be used to improve the quality of a collection of trees built with the same set of species.

Bruno WJ, Socci ND and Halpern AL (2000). Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17: 189-197.

Bryant D (2003). A classification of consensus methods for phylogenetics. In: Bioconsensus (Janowitz M, Lapointe FJ, McMorris F, Roberts FS, eds.). DIMACS-AMS Series, Providence, RI, USA, Vol. 61, pp. 163-184.

Cole JR, Chai B, Marsh TL, Farris RJ, et al. (2003). The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res. 31: 442-443.

Desper R and Gascuel O (2002). Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol. 9: 687-705.

Felsenstein J (1989). PHYLIP - phylogeny inference package (version 3.2). Cladistics 5: 164-166.

Graur D and Li WH (1999). Fundamentals of molecular evolution. 2nd edn. Sinauer Associates Inc., Sunderland, MA, USA.

Kitching IJ, Forey PL, Humpries CJ and Williams DM (1998). Cladistics: The theory and practice of parsimony analysis. In: The systematics association publication. 2nd edn. Oxford Publication Press, New York, NY, USA.

Kumar S, Tamura K and Nei M (2004). MEGA3: Integrated software for molecular evolutionary genetics analysis and sequence alignment. Briefings Bioinformatics 5: 150-163.

Margush T and McMorris FR (1981). Consensus n-trees. Bull. Math. Biol. 43: 239-244.

Nei M (1975). Molecular population genetics and evolution. In: Frontiers of biology (Neuberger A and Tatum EL, eds.). Vol. 40. North-Holland Publishing Co., Amsterdam, Netherlands.

Phillips C and Warnow TJ (1996). The asymmetric median tree - a new model for building consensus tree. Discrete Applied Math. 71: 311-335.

Quitzau JAA and Meidanis J (2005). A fully resolved consensus between fully resolved phylogenetic trees. Technical Report IC-05-027, Unicamp, Campinas, SP, Brazil.

Tamura K and Nei M (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10: 512-526.

Waterman MS (1995). Introduction to computational biology. Chapman and Hall Ltd., London, England.


Does a fully-resolved phylogenetic tree have to be dichotomous? - Biology

A phylogenetic tree can be read like a map of evolutionary history. Many phylogenetic trees have a single lineage at the base representing a common ancestor. Scientists call such trees rooted, which means there is a single ancestral lineage (typically drawn from the bottom or left) to which all organisms represented in the diagram relate. Notice in the rooted phylogenetic tree that the three domains—Bacteria, Archaea, and Eukarya—diverge from a single point and branch off. The small branch that plants and animals (including humans) occupy in this diagram shows how recent and miniscule these groups are compared with other organisms. Unrooted trees don’t show a common ancestor but do show relationships among species.

Figure 1. Both of these phylogenetic trees shows the relationship of the three domains of life—Bacteria, Archaea, and Eukarya—but the (a) rooted tree attempts to identify when various species diverged from a common ancestor while the (b) unrooted tree does not. (credit a: modification of work by Eric Gaba)

In a rooted tree, the branching indicates evolutionary relationships (Figure 2). The point where a split occurs, called a branch point, represents where a single lineage evolved into a distinct new one. A lineage that evolved early from the root and remains unbranched is called basal taxon. When two lineages stem from the same branch point, they are called sister taxa. A branch with more than two lineages is called a polytomy and serves to illustrate where scientists have not definitively determined all of the relationships. It is important to note that although sister taxa and polytomy do share an ancestor, it does not mean that the groups of organisms split or evolved from each other. Organisms in two taxa may have split apart at a specific branch point, but neither taxa gave rise to the other.

Figure 2. The root of a phylogenetic tree indicates that an ancestral lineage gave rise to all organisms on the tree. A branch point indicates where two lineages diverged. A lineage that evolved early and remains unbranched is a basal taxon. When two lineages stem from the same branch point, they are sister taxa. A branch with more than two lineages is a polytomy.

The diagrams above can serve as a pathway to understanding evolutionary history. The pathway can be traced from the origin of life to any individual species by navigating through the evolutionary branches between the two points. Also, by starting with a single species and tracing back towards the “trunk” of the tree, one can discover that species’ ancestors, as well as where lineages share a common ancestry. In addition, the tree can be used to study entire groups of organisms.

Another point to mention on phylogenetic tree structure is that rotation at branch points does not change the information. For example, if a branch point was rotated and the taxon order changed, this would not alter the information because the evolution of each taxon from the branch point was independent of the other.

Many disciplines within the study of biology contribute to understanding how past and present life evolved over time these disciplines together contribute to building, updating, and maintaining the “tree of life.” Information is used to organize and classify organisms based on evolutionary relationships in a scientific field called systematics. Data may be collected from fossils, from studying the structure of body parts or molecules used by an organism, and by DNA analysis. By combining data from many sources, scientists can put together the phylogeny of an organism since phylogenetic trees are hypotheses, they will continue to change as new types of life are discovered and new information is learned.

Video Review



Conclusion

Polyphy, a standalone tool for phylogeny reconstruction and polytomy identification, is robust and easy to use for studying evolution. It can construct a consensus tree without requiring multiple sequence alignment and can also identify polytomies without requiring manual interferences. It is a fully automated tool. The phylogeny reconstruction implements a jackknife recursive sampling procedure then recursively trains the BLR classification model for polytomy identification. It allows users to infer a multifurcating phylogenetic tree for any set of microbial genomes of interest to study their evolutionary relationships. Although the tool is built for reconstructing a multifurcating phylogeny from whole genome sequences, users can have a pre-generated bifurcating tree from any phylogenetic tool as the starting point for classification. We believe that this is a timely and necessary development as more and more microbes are sequenced daily (especially from the metagenomic analysis of microbial communities) without reliable taxonomy established.


Discussion

The distressing point from the comparison of different methods of phylogenetic inference is that all of them are quite sensitive to the artifacts introduced by outgroups. The differences among trees obtained with different methods are minor, and appear to be largely related just to the idiosincracy of algorithms. The good news, however, is that published phylogenetic trees appear to be under a remarkable, unexpected constraint. The constraining factor is most likely the fact that practicing taxonomists appear to be generally (and rather intuitively) aware of these artifacts, so they tend to choose carefully the array of outgroup taxa. A corollary of this is that there is no hope for any brute-force meta-analysis performed without consideration of what phylogenetic trees really mean and how they are obtained. A second consequence of this finding is that there is a wide open field for designing formal ways to discriminate the noise due to outgroups from the phylogenetic signal within the taxon of interest. The methods presented here and the following discussion may provide some guide.

Not all phylogenetic trees are equally valid –in fact, there are huge differences in their robustness or support. This variable extent and reliability of phylogenetic hypotheses translates into artifacts in tree shape. For example, poor quality data introduce noise that results in increased imbalance [26]–[28]. Likewise, tree size does have an impact, because real large trees tend to approach a predictable, moderate level of imbalance [4]. These problems can be circumvented in part because tree shape and fit to the data appear to be unrelated [29], and there is at least one measure of imbalance that is independent of tree size [3]. Without being aware of these problems and how to treat them, one may gather a bewildering array of grossly dissimilar trees. Thus, having no control over what different trees mean surely will reduce any possibility of finding common rules.

The three meta-analyses [3]–[5] were based on TreeBASE (http://www.treebase.org), a searchable, archival repository of data and scientific references [30], which can be explored by statistical packages designed to perform large-scale analyses of tree shape [15]. Only binary trees were included in [3], while polytomies were resolved under a random model in [4]. In order to ensure “testing the universality of the results derived across scales”, thousands of cladograms and a few dozen “intraspecific phylogenies” were compiled in [5]. This sampling was totally uncritical, aimed at amassing a bulk of different trees. Moreover, it was partially manual, although simply taking numerous trees with no selection criterion from the literature or from a repository database should yield virtually identical results. Basically, the problem is that it is unclear whether adding numerous hypotheses with an unknown degree of uncertainty may yield a credible global answer.

Resolving phylogenetic trees into perfectly dichotomous branching patterns is a general goal in phylogenetics [31]. However, as any approach that imposes structure on the data, bifurcations are an imposition of method, not necessarily a reality [32]–[35]. All tree-building methods will force a binary tree on the data, but it has seldom been tested at what point of the analysis the conclusions might stretch beyond the assumptions, and thus at what level of detail it would be warranted to stop [21]. One such limitation involves short interior branches (i.e., fast evolutionary radiations), which may be even more prone to error in reality than predicted by theoretical studies [36]. Actually, it may not be really necessary to resolve a multifurcation 𠇋ush” (i.e., non-binary splits, or polytomies) in rapidly branching parts of a tree, because the temporal information encoded in that unresolved topology may be more relevant than the detailed sequence of bifurcations [31]. Another overstretching of methods occurs because above species level multifurcations that surely exist in evolution will always tend to be split. A justification may be that it is easier to work on a strictly binary set of nodes, although it is already possible to deal with polytomies in trees [11]. Ideally, the assumptions of systematists should be in agreement with those underlying tree-building algorithms [37]. However, even if there is a real dichotomous structure in the data, unresolved nodes will often occur mostly at or near the terminal branches, because the data analyzed are usually gathered with the goal of resolving mostly the intermediate taxonomic levels considered, and thus may not allow discriminating among very similar terminal taxa. Thus, the best resolution is generally in the middle of published trees. One must bear in mind that awfully unresolved trees are seldom published. Also, it is in the central area that the researcher's interest was in the first place. This explains departures from expected values in the left part of Fig. 3 . It is also a good reason to prefer analyses in the tree space defined by A and C, given that it includes polytomous trees.

The artifactual nature of binary trees is most relevant at or below the species level. Species may be incompletely isolated due to recent or incomplete speciation, the pattern of speciation may not be a simple cladogenetic event but may be instead paraphyletic, hybridization may cause reticulate evolution, and sorting of ancestral polymorphisms may render gene trees incongruent with species trees [17], [38]–[40]. Toward the contemporary tips of a phylogenetic tree, resolution is subject to the delimitation of species, a complex and often arbitrary issue that is not part of the phylogenetic inference process eventually, recognizing the distinctiveness of individual taxa becomes problematic, because recent and incipient speciation may be difficult to identify [17], [41]. Even more problematic is portraying intraspecific variation as a branching tree. Within a species there is gene flow, so gene trees will most rarely be amenable to be translated directly into a history of population subdivision. It would be more meaningful to ask in the first place if there is an inherent hierarchical structure in data [34]. Actually, the clustering of subpopulations and the comparison of trees for different genes are by no means simple tasks, and dichotomous branching ordinations are just a small part of the methods available [42]. However, being aware of their meaning, they can be powerful tools in combination with other approaches to deal with intraspecific data [13], [43]. It is obvious that trees of intraspecific variation are actually simplified sketches, and thus have a radically different nature than interspecific trees. Thus, the mixing of intraspecific and interspecific trees in [5] has no justification, and their claims of uniform branching pattern above and below the species level are simply an artifact of applying similar binary-tree-building methods to different biological questions. At any rate, the high prevalence of multifurcations that exists among intraspecific trees reflects the inadequacy of tree-building methods for reticulate data, and their finding of lower-than expected values of C at short branch lengths is solely an artifact.

The selection of trees is also a source of noise. In fact, different tree-building methods produce significantly different arrays of trees [3], [44]. This precaution was not taken into account by [4], [5], who mixed trees obtained from various kinds of tree-building algorithms –some distance-based (neighbor-joining), some based on parsimony, and still others on maximum likelihood. The differences between these methods can be shown to be rather of �grees of freedom” [21], [45], yet they are based on different assumptions and often yield different outcomes for the same data matrix (as shown in Fig. 4 ). Moreover, real-world deviations from theoretical simple models of evolution may easily produce artifactual phylogenetic reconstructions under the commonly used models of sequence evolution, and it is still unclear how to capture the historical signal with a minimum of parameters to be estimated from the data [46]–[48]. Also, trees may differ if calculated with a naïve one-step process, or are derived from an approach that seeks to compare trees and find an average final model [20], [21] 𠄾ven in simple 3-taxon cases, the outcome may differ strikingly, with substantial evolutionary implications [49]. Thus it remains unclear why trees obtained with different methods from a variety of taxa should be mixed up with no control.

The value of a null model lies not in its mathematical elegance, but in its relevance to the question posed. On average, a totally balanced tree is also expected from Yule's equal-rates Markov model [3], [50], [51], but this kind of tree would be most unusual for any large set of real taxa. In the case of phylogenetic trees, null models based on random, increasing, balanced diversification [5], [6] were only a reasonable early start. More elaborate stochastic models exhibit an enhanced approach to real trees [3], [4], but it is unclear whether there is any reason to prefer any such model beyond a rough fit to the data and the rejection of the overly simplistic Yule model. Clearly, more realistic models are needed that place randomness right where relevant variables impact the model's behavior [16], [17], [52]–[54]. From this viewpoint, it should come as no surprise the finding in all three meta-analyses [3]–[5] that the average imbalance of phylogenetic trees inferred from real data falls neatly in between extreme possibilities (i.e., the symmetric and pectinate trees in [5] the random and uniform models in [3] and random and pectinate trees in [4]).

The departure of real trees from random models can be caused at least by two major natural factors, and two artifacts. The first natural factor is simply that extinction does occur, so not all lineages can continue to divide at the specified rate. As lineages go extinct along a tree, its imbalance will almost inevitably increase. This is a consequence of extinction being the outcome of complex dynamics, so it is not reasonable to expect that it should remain stable across the tree. The second natural deviating factor is that diversification rates will surely vary across the different branches of the tree over time, because it is a complex function of a plethora of intrinsic and environmental factors operating on living organisms. Several methods have been devised to estimate absolute rates of speciation and extinction, showing that large variation in those parameters is the rule [55]–[63]. Indeed, balanced random processes are too slow to account for most patterns of observed diversity, yet diversification is subject to complex environmental constraints [17], [53]. A reflection of such complexity is likely to result in autocorrelation of diversification rate along lineages [8]. Thus, real phylogenies should be expected to range throughout all possible topologies, with no reasonable way of a priori delimiting tree space.

Aside from real-world issues, the two major artifacts that increase imbalance are related to the taxa included in the analysis. On one hand, all known taxa from a given group are rarely included, so some choice has to be made. Often this may be imposed by the availability of samples. However, it may be difficult to know whether species have been removed from the analysis deliberately and selectively [26]. And including selected species from high-rank taxa may cause problems of two sorts. Actually, real trees are quite imbalanced, and more so if the taxa are above the species level [39]. In addition, such large branches will inevitably result in underestimation of real change, and thus of long branch lengths. This is the pervasive node-density artifact, whose impact on tree shape is still unclear [64]. At any rate, non-random taxon sampling will cause errors in estimates of speciation and extinction rates, more so than just incomplete taxon sampling [65], [66]. Indeed, the inclusion of evolutionarily isolated species may affect synthetic measures of phylogenetic trees [67].

On the other hand, outgroups (used to place the root of the tree) are a definite source of imbalance. At the highest taxonomic levels considered, C has higher-than-expected values, indicating that long branches tend to be more pectinate. But this is due to the inclusion of selected taxa from progressively more distantly related lineages. This is routinely done in order to provide various outgroups. This is justified because, based on sampling theory, the more dense the sampling of outgroup taxa, the more stable the internal topology will be and the stronger the test for the monophyly of the ingroup [68], [69]. Being clear that outgroup taxa significantly contribute to an excess of imbalance [3], [21], [70], there is a motive for removing outgroups from tree analysis [3], [4]. Unfortunately, the outgroup taxa are often not displayed in the published trees, and it is frequent that more outgroups are included than those explicitly identified as such. Actually, outgroups often involve more than just the first low-diversity branch, or the usual basal one or two single-species branches. In some instances, such as in tree A ( Fig. 2 ), a priori outgroups turn out not to be the branches closest to the root, making any automated identification and deletion of outgroups highly suspect. This problem is exacerbated if the basal taxa turn out to be paraphyletic [39], because they will appear as pectinate long branches. The two trees analysed in detail ( Fig. 3 ) show several basal branches that belong to outgroups that are revealed to be paraphyletic. Actually, higher taxa that have traditionally been considered as basal to other higher-order taxa often turn out to be paraphyletic when subject to cladistic evaluation –the Acari, Limidae, Propeamussiidae and Aequipectinini are likely candidates to join the club of outfashioned, unnatural groups such as the Protobranchia, Reptilia, and Pongidae. Without a proper identification of outgroup taxa, coupled to a taxonomic assessment of any basal paraphyletic taxa, it is very hard to control for the pervasive artifact of imbalance increasing at the highest taxonomic levels of published trees. Therefore, the reported findings of imbalance increasing at large tree sizes stems from this control being insufficient in [4] and just missing in [5], and thus appears to be totally caused by the outgroup and basal paraphyly artifacts.

Various tree-shape statistics have been divised, whose merits vary widely. Most of these methods extract a single summary index from the distribution of nodes, so it's not too surprising that the majority of such measures of tree shape are sensitive to the level, or depth in the phylogeny at which imbalance is concentrated [3], [71] and to the presence of polytomies [36]. As summarized in Fig. 1 , C suffers from these same shortcomings. Focusing instead on the dispersion of node traits in a bidimensional plot aims at capturing more of the tree's features [72], although interpretation of such analysis is also difficult [3], [10]. Likewise, estimates of the alpha model fail to adjust extreme tree shapes and often yield a zero value [3], thus being also hard to interpret. As shown above, the relationship between A and C can be used to locate and explain imbalance in the different regions of a given tree, even if there are polytomies. The drawbacks of this method are that it does not have optimal resolution because different trees yield identical values, and all trees are constrained within a small sector of geometric space, so even quite distinct trees will yield similar plots. Nevertheless, it is clear that the two phylogenies in Fig. 2 have quite different shape, yet are translated into overall similar plots in Fig. 3 . It is also relevant to notice that these two parameters can be used to design meaningful measures (such as log outgroup imbalance) of the impact of outgroups (and possibly other artifacts) in tree space. Thus, the uniform relationship among branch size A and cumulative branch size C is due to a narrow design of methods, not a quality of results.

A third avenue is to compare trees strictly in terms of what they are –high-dimensional parameters amenable to geometrical depictions in ultrametric space [73]. Actually, ultrametrics have been successfully applied to a variety of questions where data have a hierarchical structure [34], [74], [75]. This perspective allows the exploration of geometric space [14], [20], [76], without relying on simulations, and leading to the application of statistical methods [21]. It is thus possible to develop a measure of resolution for different tree-shape statistics, and thus select those statistics that have similar values only for similar trees [14]. The analysis shown in Fig. 1 is a step in this direction, pointing at further developments in generalized tree shape distribution.

However, there is a critical caveat to any analysis of the shape of phylogenetic trees. Our perspective being inevitably from the present, extant diversity always appears to come out of a burst from a distant single stem [17]. Virtually all real trees will have a rather 𠇌onical” shape, due to the fact that the recent splits considered are many more than old surviving lineages. Including extinct taxa should help in correcting this retrospective illusion, but the incompleteness of the fossil record will always play against such correction. But this leads to a second obstacle, which is related but more difficult to tackle –what exactly are fossil taxa that are basal to later diversification. In an orthodox cladistic framework, such an extinct species will always be treated as the sister group of all later branches, provided the traits of later taxa can be inequivocally identified in their earliest stages. Now, this methodological shortcut may not always provide an accurate description of reality, our placing of those early stems, or “species germinalis”, being strongly dependent on later evolution that is only apparent from our contemporary point of view [77]. Clearly there is a challenge to develop methods for correcting our 𠇌onvex from the present” view of phylogenetic trees prior to analysis of their actual shape and information content.

In spite of grand declarations, the Darwinian goal of classifying organisms in terms of their relationships of common descent has powered evolutionary research and is at the root of the field of phylogenetics. There is really nothing like universal scaling in phylogenetic trees 𠄺nd no good reason why it should exist. We are dealing with attempts to understand history [22], thus a phylogenetic tree is only a diagram of a complex irreversible process. In this sense, the linking of TreeBASE to databases providing information on the taxa actually included in each analysis [78] is a valuable addition that should help in assessing the significance and merits of each tree before including it in any meta-analysis. Beyond failures based on unreasonable assumptions and oversimplistic paradigms, the wealth of information encoded in phylogenetic trees is there to be deciphered. However, this will not happen with any uncontrolled meta-analysis, but only through an integration of population genetics, ecology, paleontology, and graph theory. Artifacts pave the way, and they can only be overcome with an understanding of the structure and biological meaning of phylogenetic trees.

Exploring the geometry of unlabeled trees with constant internodal distances represents only an initial approach. It is critical to notice that taking tree topologies alone explicitly disregarding any time scale has the implicit problem of obviating extinction. Actually, time on a phylogeny does matter, at least because individual branch lengths actually are estimates of different processes depending on where they are located within the tree. Towards the terminal taxa, individual branch lengths estimate the inverse of the speciation rate, but at the basal regions they rather estimate the inverse of the diversification rate, being the difference between the speciation and extinction rates [79]–[81]. It may even be possible to distinguish decreasing speciation from increasing extinction in early evolutionary radiations [63]. This is relevant to methods such as the lineage-through-time approach [82], [83], which ignores extinct lineages and is thus sensitive to the effects of poor sampling of taxonomic diversity, as well as to its intrinsic inability to distinguish reduced extinction and enhanced speciation [17]. Although the variability of branch lengths in real trees can be used to test hypothesis about evolutionary rates [65], [84], precise estimation of these rates requires large phylogenetic trees [85], and it is still unclear how to assess in general the impact of disappearing lineages on the shape of phylogenetic trees. Although it is episthemologically impossible to read directly the empty space left by vanished taxa, the contribution of missing branches to the observed patterns remains as a signature to be deciphered. Eventually, it is the biological phenomenon of extinction that imposes an ultrametric structure on phylogenetic trees, because the unavoidable disappearance of interfertile individuals and intermediate taxa throughout life's history sets apart the surviving lineages and promotes the growth of biodiversity.


Part 3: Making Caminalcule Cladograms

Caminalcules were created by the evolutionary biologist Joseph Caminal and were originally called Caminalcules. The 29 living species and 48 fossils were published by Robert R. Sokal (1983a) in the journal Systematic Zoology. They were used to conduct detailed research on evolutionary classification.

Lab Question

Warmup

  1. Draw the simplest possible evolutionary tree that contains these 4 living critters and these 3 fossil critters.
  2. Include at least one trait that distinguishes each living critter.
  3. Use each fossil as an ancestor.
  4. Remember—cladograms are hypotheses that must be tested. Your hypothesis might be different from someone else’s!

The Caminalcules

Using the provided sheet of paper with illustrated Caminalcules, cut out each individual using a pair of scissors. Next, working in groups of 2 or 3, categorize individuals into subgroupings based on shared morphological features that distinguish them from other subgroupings (known as a derived feature). Once you have categorized the species into four or five major groups, paste the animals onto a piece of paper and indicate branches delineated by a shared derived feature. The instructor will illustrate an example of a completed cladogram.

Remember! Cladograms are hypotheses that must be tested! Your hypothesis might be different from someone else’s. Be prepared to defend your decision (with evidence) to someone whose hypothesis is different.