Introductions

TreeBeST, which stands for (gene) Tree Building guided by Species Tree, is a versatile program that builds, manipulates and displays phylogenetic trees. It is particularly designed for building gene trees with a known species tree and is highly efficient and accurate.

TreeBeST is previously known as NJTREE. It has been largely used in the TreeFam database, Ensembl Compara and OPTIC database of Chris Ponting group.

Installing TreeBeST

The latest version of TreeBeST can be downloaded at the SourceForge download page, or retrieved from the treesoft subversion server:

svn co https://treesoft.svn.sourceforge.net/svnroot/treesoft/trunk/treebest

The command-line tool, treebest, can be compiled by typing `make' in the source code directory. Compiling the GUI version, fltreebest, requires FLTK, an open source cross-platform GUI library. Binaries for i686-linux, universal-macosx and Windows are also available at the download page.

Tree Building and Orthology Inference

An Example

TreeBeST has many functionalities, but most users would be interested first in how to reconstruct a gene tree and to get the orthology inference. For a quick start, you can try treebest on the example file ex1.nucl.mfa that comes with all TreeBeST packages:

treebest best ex1.nucl.mfa > ex1.nucl.nhx

and then see the resultant tree with fltreebest:

fltreebest ex1.nucl.nhx

You will see a window like this one:

Orthologs and within-species paralogs can be inferred with:

treebest nj -t dm -vc ex1.nhcl.nhx ex1.nucl.mfa > ex1.nucl.out

The lines between "@begin full_ortholog" and "@end full_ortholog" give the results.

Preparing the Species Tree

Without a species tree, treebest is nothing but a common PhyML or even worse. Providing the correct species tree is HIGHLY recommended. The default species tree stored in treebest is the one used by TreeFam. The following command prints this default species tree to stdout:

treebest spec

In this tree, species names are encoded in the Swiss-Prot way where the length of each species name is no longer than 5 characters. If your tree is different from the default tree or you want to use taxon ID to represent the species name, you should specify yours with command-line option `-f'.

The input species tree should be in the New Hampshire format and can be multifurcated. Each internal node MUST have a taxon name. Here is a simple example:

(HUMAN*,(RAT*,MOUSE*)Murinae,(CANFA*-dog,PIG-comment)Laurasiatheria)Eutheria

In this example, HUMAN, RAT, MOUSE, CANFA and PIG are species names. Murinae, Laurasiatheria and Eutheria are taxon names. A star `*' indicates that the species is completely sequenced and any gene losses should be counted. A hyphen `-' marks the start of a comment which will not be parsed. Both star and hyphen will not be parsed as species name.

At the moment, the default species tree used by treebest is:

Preparing Multialignment

TreeBeST takes an aligned multi-sequence FASTA file as the input. Species are recognized from the sequence names with underline `_' as the separator. This is also the Swiss-Prot rule. Here is an example adapted from ex1.nucl.mfa:

>11_MOUSE
------------------------------------------------------------
---------------------ATGGCGGCGGCCGCTCTATCCCGGACGCTGTTGCCAGAG
GCCCGGCGGCGCCTGTGGGGATTTACACGAAGGCTTCCG---------------------
---------------------------CTTCGCCGC------GCCGCTGCTCAGCCGTTG
TACTTTGGAGGGGAC---------CGACTAAGA---------------------------
---------------------------AGCACACAGGCTGCCCCACAGGTTGTGCTGAAT
---GTCCCCGAGACACAAGTGACATGTTTGGAAAATGGACTCAGAGTAGCTTCTGAA---
AACTCTGGGCTCTCAACGTGCACAGTTGGGCTGTGGATCGATGCGGGAAGTCGCTATGAG
AATGAGAAGAACAACGGCACCGCCCACTTCCTGGAGCACATGGCCTTCAAGGCAAGGACT
AAAAAGAGGTCCCAGTTAGACCTTGAACTTGAGATTGAGAATATGGGCGCTCATCTTAAC
>12_HUMAN
------------------------------------------------------------
---------------------ATGGCGGCTGCGGCGGCTCGAGTGGTGTTGTCATCCGCG
GCGCGGCGGCGGCTCTGGGGTTTCAGCGAGAGTCTTCTA---------------------
---------------------------ATCCGAGGC------GCTGCGGGACGGTCATTA
TATTTTGGAGAGAAC---------AGATTAAGA---------------------------
---------------------------AGTACACAGGCTGCTACCCAAGTTGTTCTGAAT
---GTTCCTGAAACAAGAGTAACATGTTTAGAAAGTGGACTCAGAGTAGCTTCGGAA---
GACTCTGGGCTCTCAACATGCACAGTTGGACTCTGGATTGATGCTGGAAGTAGATACGAA
AATGAGAAGAACAATGGAACAGCACACTTTCTGGAGCATATGGCTTTCAAGGGC---ACC
AAGAAGAGATCCCAGTTAGATCTGGAACTTGAGATTGAAAATATGGGTGCTCATCTCAAT

In this example, `11_MOUSE' is a mouse sequence and `12_HUMAN' is a human sequence.

To use `treebest best', the input alignment MUST be a protein-guided codon alignment which can be generated by replacing, in the protein alignment, each amino acid with the three nucleotides of the corresponding codon. TreeBeST also provides a tool, `backtrans', to facilitate this processing.

Invoking the `best' Command

The `best' command builds the best gene tree. When you have prepared your species tree and alignment in the correct formats, you can simply invoke `best' command with:

treebest best -f in.spectree.nh -o out.tree.nhx in.align.mfa

The resultant tree out.tree.nhx will be bootstrapped for 100 times, reconciled with the species tree and rooted by minimizing with the number of duplications and losses. Duplications and losses are also stored in the NHX format.

Note that treebest first determines the topology of resultant tree with a complex procedure, and then performs a hundred times of resampling with an improved neighbour-joining algorithm. Branch lengths are finally estimated with the standard ML method under the HKY model.

Inferring orthologs

You can use the `ortho' command to infer the orthologs and within-species paralogs from a known tree. However, this command does not give you the bootstrap values for ortholog pairs. To get the bootstrap values, you need to use the `nj' command as follows:

treebest nj -vf in.spec.nh -t dm -c in.tree.nhx -o out.txt in.align.mfa

The lines between `@begin full_ortholog' and `@end full_ortholog' in out.txt show the orthologs.

Innovations in TreeBeST

Several methods were developed to incorporate species phylogenies in building gene trees. These methods greatly improve the accuracy of gene trees, without adding too much computational overhead. The first algorithm infers gene duplications and losses from an unrooted gene tree and root the tree in the mean time. Multifurcated species tree can be used. The average time complexity is Q(N log N) and worst-case time complexity is O(N^2), faster than a naive implementation.

The second algorithm merges several input trees into one tree by minimizing number of duplications and losses. It may produce a gene tree better than all the input. Five trees are used in reconstructing the topology of a gene tree. They are neighbour-joining synonymous distance (dS) tree, NJ non-synonymous distance (dN), NJ p-distance, max-likelihood tree under the WAG model and ML under the HKY model. The merging tree of the first two trees is used in resampling.

The third algorithm calculates the probablity of a gene tree in the context of species evolution and multiplies this with the probability of sequence evolution. PhyML typed search is then applied to search for the max-likelihood tree.

All these methods have not been published. The talk I gave at the Newton Institute shows the basic theory behind. My thesis is the most complete description of the first and the second methods. The thesis also explains other techniques used in TreeBeST, known as NJTREE at that time.

TreeBeST Manual