Sequence 190 Annotation

    Found two fatty acid desaturase genes.

    Confirmed erg28 gene.

    Annotators: Weil, Gribskov

    DNA: Sequence 190, repeat masked sequence

    Predicted proteins: FGENESH, Genmark

    Supplemental Information

    seq190cpg.png

    Sequence 190 is 400,000 bases long and contains  2121 unknown bases in our sequence (0.53%).  These are presumably all Ns.  As expected the content of of CG and its complement GC are low, 0.67 and 0.87 times the expected frequencies respectively.  Surprisingly,  AC content is also low 0.84 times the expected frequency?  Content of AA and TT are unusually high, 1.25 and 1.29 times the expected frequencies respectively.  I speculate that the high AA/TT content may be due to the presence of repeated sequences, but the reason for the high AC content is unknown. For further details, see supplemental table S1.

    The locations of the undefined regions in the sequences are shown in table S2.  Every region of Ns is precisely 100 bases long.  This is highly suspect and indicates that the missing sequence at these positions may be substantially longer or shorter.  Treating each region between a run of Ns as a separate segment, 12.5% of the sequence is found in segments less than 10 kb long.Repeat_Analysis.png

    Sequence 190 contains several regions ( 190 - 220 kb, 305 - 315 kb, 370-380 kb, 390 - 400 kb ) with conspicuously different GC content ( Fig 1,middle ).  Approximately 75% of sequence 190 is made up of retrotransposon and other repeats (Table S4).  Figure 2 shows that some of the GC rich regions correspond to long sections of repeats identified by repeatmasker.  The non transposon regions (blue, left line) are concentrated in several relatively small regions around 20kb, and  between 60 - 105 kb, 170- 190 kb, an 280 and 340 kb.  An additional region from 140 - 150 kb is largely made up of transposine sequences mixed with short regions of sequence that cannot be identified as being of transposon origin.  this region seems unlikely to be a coding region.

    Predicted Proteins

    Some inital thoughts on the predicted proteins.  At least two good genes (fatty acid desaturase).  Orange indicates definite gene, light orange or wite need more checking, blue seem uninteresting.  Searches with the unmasked sequence compared to cDNAs have so far found only cDNAs for the fatty acid desaturases.

    Predicted Proteins
    ID Begin (TSS) End (PolA) Strand N Exons

    F=FGENESH

    G=Genmark

    BLAST
    190a

    (19901) 19787

    19787

    16677 (16467)

    17770

    -

    -

    8

    6

    F

    G

    gypsy retrotransposon
    190b 16456 17432 + 3 G no match
    190c

    (33459) 34269

    34281

    34703 (34946)

    34703

    +

    +

    2

    2

    F

    G

    strong but short match to sorghum hypothetical.  transposon?
    190d 81056 82660 + 3 G no match
    190e

    (84851) 82982

    82918

    81435 (80169)

    83322

    -

    -

    4

    2

    F

    G

    no match
    190f 85243 89377 + 2 G no match
    190g

    (89944)

    89377

    88223 (87657)

    88223

    -

    -

    1

    1

    F

    G

    probable fatty acid desaturase.  Strong end to end match with rice and sorghum.

    nearly fl cdna:EE036554.2, matches  89153 - 89934

    190h

    (90767) 91169

    91169

    92317 (92428)

    92317

    +

    +

    1

    1

    F

    G

    probable fatty acid desaturase.  Strong end to end match with rice and sorghum

    nearly fl cdna: EE182703 matches 90924-91809

    190i 100853 100674 - 2 G weak match to short sequence from Gibberella zeae.  Could be a contaminant
    190j 100907 101865 + 2 G no match
    190k (102262) 110346 110636 (111539) + 1 F no match
    190l 110346 110636 + 2 G no match
    190m 126199 125610 - 2 G very weak (e~6) match to plumbago indica chloroplast matK.
    190n (146510) 146921 147088 (147545) + 1 F no match
    190o 147477 146651 - 3 G no match
    190p

    (181589) 181731

    181731

    182837 (183545)

    182355

    +

    +

    3

    3

    F

    G

    F(1-118) 1e-27 match to sorghum and rice hypothetical

    G(1-105) match to sorghum and rice.  possibly erg28 like??

    erg28 is involved in ergosterol biosynthesis.  blastx of 175-185k finds a full length erg28 like protein e-35.  proably is real.

    190q 222157 221929 - 2 G no match
    190r 271738 273183 + 2 G no match
    190s

    (291450) 291830

    292367

    292545 (292610)

    293561

    +

    +

    1

    3

    F

    G

    F(1-57) almost perfect match to middle of  rice putative polyA binding protein

    G-no match

    no hits in blastx with region 280-310k.

    190t 300265 299625 - 3 G weak matches to zinc finger and dihydropyrimidine amidohydrolase
    190u (300964) 300960 297947 (297736) - 3 F F(13-62) matches rice putative dihydropyrimidine amidohydrolase

     

    190v

    (318934) 321297

    321726

    324688 (324740

    326184

    +

    +

    4

    7

    F

    G

    F(246-323) matches rice hypothetical protein, 6e-08.  There appear to be pentatricopeptide repeat here.  blastx looks like two clear exons in range 322.5-324.5k
     
    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 5

    FileSizeDateAttached by 
     fgenesh_predicted_proteins.fa.txt
    FGENESH predicted proteins
    11.08 kB16:23, 20 Sep 2010gribskovActions
     genmark_masked_proteins.txt
    Genemark predicted proteins
    3.99 kB16:23, 20 Sep 2010gribskovActions
     Repeat_Analysis.png
    No description
    63.54 kB13:57, 14 Sep 2010gribskovActions
     RM2sequpload_1284417569.masked.txt
    No description
    398.46 kB15:55, 20 Sep 2010gribskovActions
     seq190cpg.png
    No description
    8.74 kB18:07, 13 Sep 2010gribskovActions
    You must login to post a comment.