Was this page helpful?

Enumerating RNA Structures

    Figure 1
    Figure 1: a-c, a single stem, two nested stems, and a pair of pseudoknotted stems shown as stem-loop diagrams. d-f, the same structures shown as dome plots

    Introduction

    RNA is a biolpolymer comprising four bases (A,C,G,U) linked into a linear chain by a sugar-phosphate backbone.  The chemical structures and importance of RNA as an information transfer molecule are described in all introductory texts on molecular biology. The biochemical function of RNA is dependent on its three-dimensional structure, which is created when the bases along the RNA chain interact to form base-pairs (A:U, C:G, G:C, G:U, U:A, U:G). Extensive analysis has been done of the number of possible secondary structures,i.e., structures that are comprised of nested and serial stems such as in fig 1a,b, but contain no pseudoknotted structures such as in fig 1c [ref:stadler].  Methods for predicting minimum free energy secondary structures have been widely applied [ref:zuker, hofacker].

    More recently, considerable attention has focused on predicting pseudoknotted structures (fig 1c) [refs:pknots].  Structures containing pseudoknots are biologically very important, but cannot be completely enumerated or predicted by the methods used for secondary structures.

    RNA structures are conventionally shown as stem-loop diagrams (fig 1a-c).  Here  the thick line indicates the RNA chain and the thin lines represent base-pairs between the bases along the chain (A:U, C:G, G:U).  The structures also may be shown as a dome plot (fig 1d-f) where the thick line again indicates the linear connectivity of the RNA chain and the thin lines the base-pairs.  For simplicity, the multiple base-pairs that make up a stem can be compressed to a single arc.  Many RNA structures with different base-pairing patterns can be represented by such a compressed dome plot - the compressed plot can then be considered to represent an RNA topology, i.e., a pattern of nesting and pseudoknots between the stems that comprise the structure.  This single-arc representation is fundamental to counting the number of topologies with specific numbers of stems.

    [giegerich work]

    Counting Connected Topologies

    Theorem: A complete set of topologies can be enumerated  for a given number of stems, n, by adding an additional stem to each topology with n-1 stems such that, in the dome plot, the left end of the new arc is to the left of all of the existing stems and the right end of the new arc lies in each possible interval created by the endpoints of the arcs in the previous structures.  Fig 2 shows an example of this procedure for n=2, a partial example is shown for n=3 in fig3a-b.

    Figure 2
    Figure 2. Extending a structure from n=1 to n=2 by addition of a new stem (red) at the leftmost position.
    Figure 3
    Figure 3. a,b. Extending a structure from n=2 to n=3 by addition of a new stem (red) at the leftmost position. Only the extension of the lower structure in figure 2 is shown. c. Addition of a new stem (red) at an internal location (c.1) can always be converted to a structure where the leftmost stem is considered to be the new stem (c.2)

    Proof:  While it might seem that added arcs could have their left end in any interval in the n-1 dome plot, it is clear that an arc that is added at other than the leftmost position can be relabeled so that the leftmost arc is considered to be the new arc, and the remainder is one of the previously enumerated n-1 structures.  An example of this is shown in fig 3c where an new arc is added at an internal position (fig 3c.1, red arc).  By relabeling the leftmost arc as the "new" arc, one can see that this structure corresponds to structure b.3.  As all the n-1 connected structures have previously been enumerated, this will always be the case.

    From the preceding, it is clear that there is always exactly one position to place the left end of the new arc, and 2n positions to place the right end.  The possible number of connected topologies, C(n), with n stems is therefore the product of the positive integers from 1 to 2n-2.

    where n is the number of stems.  As explained below, this count omits certain structures and hence can be considered to be a lower bound on the number of structures.  The rational used for counting the number of connected structures is extended in the next section to include all nested and overlapping (pseudoknotted) structures.

    Counting Unconnected Topologies

    Figure 4
    Figure 4.

    The count of connected topologies is an undercount because as you add each successive arc to the diagram you can enclose structures that were previously unconnected (fig 4).  In order to find all of the possible unconnected subtopologies that can be linked by an enclosing loop, you need to know the possible partitions (unconnected groups) that n-1 stems can be divided into.  This is a well known problem in combinatorics - the number of unordered partitions of a natural number n.  A partition is a nonincreasing sequence of positive integers which sum to n. There is no nice analytical form form for this but the code is fairly simple

    my $n = 7;
    
    my @x;
    push @x, $n;
    
    my $h = 0;      # index of last element > 0 (non-one digit)
    my $count = 1;
    print "p($count): @x\n";
    
    while ( $x[0] > 1 ) {
        if ( $x[$h] == 2 ) {
            # special case, the first non-one digit is two
            # just change it to one and add an additional one
            push @x, 1;
            $x[$h] = 1;
            $h--;
        } else {
            # general case, first non-one digit > 2
            my $r = $x[$h] - 1;        # new value for first non-one digit
            $x[$h] = $r;
            my $t = @x - $h;           # sum of ones in partition, i.e. the
                                       # total to repartition this time
            $#x = $h;                  # trim array of digits
            
            while ( $t >= $r ) {
                # add as many as possible of the next permitted value
                push @x, $r;
                $t -= $r;
             }
            $h = $#x;
                   
            if ( $t > 0 ) { 
                # add remainder if > 0, and increment position of 
                # last non-one digit if > 1
                push @x, $t; 
                if ( $t > 1 ) { $h = $#x; }
            }
        }
        $count++;
        print "p($count): @x\n";
    }

    Overall Count

    The total count of topologies is thus the number of connected topologies, C(n), plus the number that can be generated by all possible partitions of n-1 topologies           

    where p(n) is the set of partitions of an integer n, and i represents each integer in a particular partition π.  This is because for each connected topology with n stems, there are an additional set of structures enclosed by a single stem, of all possible kinds of relatively disjoint topologies whose numbers of stems sum to n-1.

    Table 1. Number of RNA Topologies with Stems from 1 to 8
    n     Partition     Subtotals     S(n)                        n     Partition     Subtotals     S(n)    
    1 (1) 1 1   7 (7) 61,476 62,216
                (51) 456  
    2 (2) 2 2     (42) 104  
                (411) 57  
    3 (3) 8 9     (33) 81  
      (11) 1       (321) 18  
                (3111) 9  
    4 (4) 54 57     (222) 8  
      (21) 2       (2211) 4  
      (111) 1       (21111) 2  
                (111111) 1  
    5 (5) 456 472          
      (31) 9     8 (8) 871,024 878,360
      (22) 4       (61) 5,123  
      (211) 2       (52) 944  
      (1111) 1       (511) 472  
                (43) 513  
    6 (6) 4560 5123     (421) 104  
      (41) 57       (4111) 57  
      (32) 18       (331) 81  
      (311) 9       (3211) 18  
      (221) 4       (31111) 9  
      (2111) 2       (2221) 8  
      (11111) 1       (22111) 4  
                (211111) 2  
                (1111111) 1  

     

    Note: Since at this point, i have not discussed the issue of 5'->3' symmetry, i should probably use the number of ordered partitions, instead of the number of unordered partitions.

    The numbers of stems calculated in this fashion differ those given by Li et al. [2008] which are based on a looser upper bound and post hoc filtering of the enumerated stems using a graph isomorphism check.  It should be noted that the graph isomorphism check used by Li et al. treats certain structures that are distinguishable due to their 5' to 3' orientation as redundant; in this work these structures would be considered distinguishable.  This partially accounts for the slightly higher values seen here.

    Table 2. Comparison with Li et al., 2008
     n      Li 2008
    Upper bound
        
     Li 2008
    Unique
        
     This work      Lower bound
    This work
        
     Unique
    5' -> 3'
    1 1 1 1 1 1
    2 2 2 2 2 2
    3 10 8 9 8 8
    4 78 49 57 48 55
    5 945   472 384
    6 746 4,207 5,123 3,840
    7 114,834 56,227 62,216 46,080

    Count of Distinct RNA Graphs

    Figure 5
    Figure 5.a. Stem-loop diagrams of similar structures in which the 5' -> 3' order of the stems differs. b. dome plots of the same structures. c. XIOS graph; both structures in a and b have the same graph.  In the XIOS graph, each vertex corresponds to a stem in the topology, an arrow indicates stems with a nested (or included) relationship, and a pseudoknot (overlapping) relationship is indicated by a solid line.  Serial edges are not shown.

    Several groups have developed methods that represent RNA structures as graphs [ref:RAGs, XIOS].  One of the primary purposes of such approaches is to identify similar or conserved RNA structures.  These representations are abstractions that only capture some of the biochemical characteristics of RNAs.  In particular, certain structures that may be distinguishable in a biological sense, may not be distinguishable as graphs (fig 5).  For instance in the XIOS representation, topologies that differ only in their 5' -> 3' order are not distinguishable.  This lowers the number of possible graphs considerably.

    We have not been able to determine a constructive method for enumerating XIOS RNA graphs that generates only biologically possible graphs while not constructing topologies that have internal 5' -> 3' symmetry of the type that makes the XIOS graphs of the topologies identical.  These redundant graphs can be detected using a test for graph isomprphism - as the graphs in question are relatively small

    Was this page helpful?
    Tag page (Edit tags)
    • No tags
    Viewing 2 of 2 comments: view all
    wow, this is very professional!
    Posted 09:30, 19 Aug 2011
    Just like a written-up paper!
    Posted 14:06, 19 Aug 2011
    Viewing 2 of 2 comments: view all
    You must login to post a comment.