Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis

    Ari Löytynoja* and Nick Goldman

    Science 20 June 2008:
    Vol. 320. no. 5883, pp. 1632 - 1635
    DOI: 10.1126/science.1158395


    Genetic sequence alignment is the basis of many evolutionaryand comparative studies, and errors in alignments lead to errorsin the interpretation of evolutionary information in genomes.Traditional multiple sequence alignment methods disregard thephylogenetic implications of gap patterns that they create andinfer systematically biased alignments with excess deletionsand substitutions, too few insertions, and implausible insertion-deletion–eventhistories. We present a method that prevents these systematicerrors by recognizing insertions and deletions as distinct evolutionaryevents. We show theoretically and practically that this improvesthe quality of sequence alignments and downstream analyses overa wide range of realistic alignment problems. These resultssuggest that insertions and sequence turnover are more commonthan is currently thought and challenge the conventional pictureof sequence evolution and mechanisms of functional and structuralchanges.

    Do orthologous gene phylogenies really support tree-thinking?

    E Bapteste, E Susko, J Leigh, D MacLeod, RL Charlebois and WF Doolittle



    Since Darwin's Origin of Species, reconstructing the Tree of Life has been a goal of evolutionists, and tree-thinking has become a major concept of evolutionary biology. Practically, building the Tree of Life has proven to be tedious. Too few morphological characters are useful for conducting conclusive phylogenetic analyses at the highest taxonomic level. Consequently, molecular sequences (genes, proteins, and genomes) likely constitute the only useful characters for constructing a phylogeny of all life. For this reason, tree-makers expect a lot from gene comparisons. The simultaneous study of the largest number of molecular markers possible is sometimes considered to be one of the best solutions in reconstructing the genealogy of organisms. This conclusion is a direct consequence of tree-thinking: if gene inheritance conforms to a tree-like model of evolution, sampling more of these molecules will provide enough phylogenetic signal to build the Tree of Life. The selection of congruent markers is thus a fundamental step in simultaneous analysis of many genes.


    Heat map analyses were used to investigate the congruence of orthologues in four datasets (archaeal, bacterial, eukaryotic and alpha-proteobacterial). We conclude that we simply cannot determine if a large portion of the genes have a common history. In addition, none of these datasets can be considered free of lateral gene transfer.


    Our phylogenetic analyses do not support tree-thinking. These results have important conceptual and practical implications. We argue that representations other than a tree should be investigated in this case because a non-critical concatenation of markers could be highly misleading.


    Tree thinking cannot taken for granted: challenges for teaching phylogenetics

    Theory Biosci. 2008 March; 127(1): 45–51.

    Published online 2008 February 5. doi: 10.1007/s12064-008-0022-3.
    PMCID: PMC2254468

    Hanno Sandvikcorresponding author1,2,3


    Evolution: What determines the rate of sequence evolution?

    J.F.Y. Brookfield
    High rates of amino-acid sequence evolution have
    sometimes been considered to be diagnostic for genes
    undergoing adaptive change. However, two recent
    studies have shown that rapid evolution of amino-acid
    sequence can also be congruent with neutrality.
    Address: Institute of Genetics, University of Nottingham, Queens
    Medical Centre, Nottingham NG7 2UH, UK.
    Current Biology 2000, 10:R410–R0411

    In silico simulation of biological network dynamics


    Lukasz Salwinski & David Eisenberg

    Realistic simulation of biological networks requires stochastic
    simulation approaches because of the small numbers of
    molecules per cell. The high computational cost of stochastic
    simulation on conventional microprocessor-based computers
    arises from the intrinsic disparity between the sequential steps
    executed by a microprocessor program and the highly parallel
    nature of information flow within biochemical networks.
    This disparity is reduced with the Field Programmable Gate
    Array (FPGA)-based approach presented here. The parallel
    architecture of FPGAs, which can simulate the basic reaction
    steps of biological networks, attains simulation rates at least
    an order of magnitude greater than currently available

    Genome assembly forensics: finding the elusive mis-assembly

    Adam M Phillippy , Michael C Schatz  and Mihai Pop 
    Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA

    Genome Biology 2008, 9:R55doi:10.1186/gb-2008-9-3-r55

    GC-Content Evolution in Mammalian Genomes: The Biased Gene Conversion Hypothesis

    N. Galtiera, G. Piganeaub, D. Mouchiroudb, and L. Duretb

    Genetics, Vol. 159, 907-911, October 2001


    The Human Transcriptome Map Reveals Extremes in Gene Density, Intron Length, GC Content, and Repeat Pattern for Domains of Highly and Weakly Expressed Genes

    Rogier Versteeg,, Barbera D.C. van Schaik, Marinus F. van Batenburg, et al.

    Genome Res. 2003 13: 1998-2004; originally published online Aug 12, 2003;

    Access the most recent version at doi:10.1101/gr.1649303

    The chromosomal gene expression profiles established by the Human Transcriptome Map (HTM) revealed a
    clustering ofhighly expressed genes in about 30 domains, called ridges. To physically characterize ridges, we
    constructed a new HTM based on the draft human genome sequence (HTMseq). Expression of25,003 genes can be
    analyzed online in a multitude oftissues ( Ridges are found to be very
    gene-dense domains with a high GC content, a high SINE repeat density, and a low LINE repeat density. Genes in
    ridges have significantly shorter introns than genes outside of ridges. The HTMseq also identifies a significant
    clustering ofweakly expressed genes in domains with fully opposite characteristics (antiridges). Both types of
    domains are open to tissue-specific expression regulation, but the maximal expression levels in ridges are
    considerably higher than in antiridges. Ridges are therefore an integral part ofa higher order structure in the
    genome related to transcriptional regulation.

    Evolution of Molecular shape in bacterial globin-related proteins

    Marsh, L
    J.Molec.Evol. 62, 585-587 (2006)

     The globin family of proteins has a characteristic structural pattern of helix interactions that nonetheless exhibits some variation. A simplified model for globin structural evolution was developed in which protein shape evolved by random change of contacts between helices. A conserved globin domain of 15 bacterial proteins representing four structural families was studied. Using a parsimony approach ancestral structural states could be reconstructed. The distribution of number of contact changes per site for a fixed topology tree fit a gamma distribution. Homoplasy was high, with multiple changes per site and no support for an invariant class of residue-residue contacts. Contacts changed more slowly than sequence. A phylogenetic reconstruction using a distance measure based on the proportion of shared contacts was generally consistent with a sequence-based phylogeny but not highly resolved. Contact pattern convergence between members of different globin family proteins could not be detected. Simulation studies indicated the convergence test was sensitive enough to have detected convergence involving only 10% of the contacts, suggesting a limit on the extent of selection for a specific contact pattern. Contact site methods may provide additional approaches to study the relationship between protein structure and sequence evolution.


    How to Build Transcriptional Network Models of Mammalian Pattern Formation

    Chrissa Kioussi, Michael K. Gross



    Genetic regulatory networks of sequence specific transcription factors underlie pattern formation in multicellular organisms. Deciphering and representing the mammalian networks is a central problem in development, neurobiology, and regenerative medicine. Transcriptional networks specify intermingled embryonic cell populations during pattern formation in the vertebrate neural tube. Each embryonic population gives rise to a distinct type of adult neuron. The homeodomain transcription factor Lbx1 is expressed in five such populations and loss of Lbx1 leads to distinct respecifications in each of the five populations.

    Methodology/Principal Findings

    We have purified normal and respecified pools of these five populations from embryos bearing one or two copies of the null Lbx1GFP allele, respectively. Microarrays were used to show that expression levels of 8% of all transcription factor genes were altered in the respecified pool. These transcription factor genes constitute 20–30% of the active nodes of the transcriptional network that governs neural tube patterning. Half of the 141 regulated nodes were located in the top 150 clusters of ultraconserved non-coding regions. Generally, Lbx1 repressed genes that have expression patterns outside of the Lbx1-expressing domain and activated genes that have expression patterns inside the Lbx1-expressing domain.


    Constraining epistasis analysis of Lbx1 to only those cells that normally express Lbx1 allowed unprecedented sensitivity in identifying Lbx1 network interactions and allowed the interactions to be assigned to a specific set of cell populations. We call this method ANCEA, or active node constrained epistasis analysis, and think that it will be generally useful in discovering and assigning network interactions to specific populations. We discuss how ANCEA, coupled with population partitioning analysis, can greatly facilitate the systematic dissection of transcriptional networks that underlie mammalian patterning.

    Dispensability of mammalian DNA

    Cory McLean and Gill Bejerano
    Genome Res. 2008 18: 1743-1751

    In the lab, the cis-regulatory network seems to exhibit great functional redundancy. Many experiments testing
    enhancer activity of neighboring cis-regulatory elements show largely overlapping expression domains. Of recent
    interest, mice in which cis-regulatory ultraconserved elements were knocked out showed no obvious phenotype,
    further suggesting functional redundancy. Here, we present a global evolutionary analysis of mammalian conserved
    nonexonic elements (CNEs), and find strong evidence to the contrary. Given a set of CNEs conserved between
    several mammals, we characterize functional dispensability as the propensity for the ancestral element to be lost in
    mammalian species internal to the spanned species tree. We show that ultraconserved-like elements are over 300-fold
    less likely than neutral DNA to have been lost during rodent evolution. In fact, many thousands of noncoding loci
    under purifying selection display near uniform indispensability during mammalian evolution, largely irrespective of
    nucleotide conservation level. These findings suggest that many genomic noncoding elements possess functions that
    contribute noticeably to organism fitness in naturally evolving populations.

    Kernytsky A et al. Using genetic algorithms to s...[PMID: 18798568]

    Tag page (Edit tags)
    • No tags
    You must login to post a comment.