Table of contents
    No headers

     

    Annotators: Minming Li & Jin Sun

    DNA sequence: Seq 206(Original Sequence), Repeat Masked Sequence

    Predicted proteins: FGENESH Genmark 

     

    Supplemental Information 

    Analysis and Explanations:

    Sequence 206 is 400,000 bases long and contains  3874 unknown bases in our sequence (0.97%).  These are presumably all Ns. The content of of CG and its complement GC are low, 0.69 and 0.87 times the expected frequencies respectively. The GT and TA content is also low with 0.83 and 0.84 times the expected frequency. Content of AA and TT are unusually high, 1.29 and 1.26 times the expected frequencies respectively.  I speculate that the high AA and TT content may be due to the presence of repeated sequences, but the reason for the high AC content is unknown. For further details, see supplemental table S1.

     

    Our team use Perl Programming method to find the location and length of the undefined regions in the sequences. And here is the output from the Perl Program. We arranged the output into table S2. "Position" in the table S2 means the position of the first N in the sequence.  "Segment Length" means the length of the region, not including Ns, between the first N and the last N of the previous run.  There are 39 regions of Ns in total in the whole sequence. Every region of Ns is precisely 100 bases long.  But there is only one exception: at position 31903, the region of Ns is only 35 bases long, which is quite unusual since all the other Ns are all 100 bases long. The reason for this exception is unknown.

     

    Sequence 206 contains several regions ( 15 - 30 kb, 63 - 78 kb, 82 - 105 kb, 160 - 170 kb, 370 - 390 kb ) with conspicuously different GC content ( Fig 1,middle ). For further details, see supplemental table S3. 

     

     cpgplot.1.png

                                                                     Figure 1

    Approximately 79.52% of sequence 206 is made up of retrotransposon and other repeats (Table S4). After calculations, Retroelements 73.00% + Transposons 6.29% + Simple Repeats 0.06%+ Low Complexity 0.16% = 79.52%

     

    During the process of analysis, we met some problems: We used the masked sequence as input when running the GeneMark Program. However, no matter how many times we tried, we still cannot get the correct output. We even tried to do previous steps again to make sure there is no mistakes with getting the masked sequence. We doubted it may be related to the weired 35-bases-long Ns region which is shown in table S2.  But it may be not. And Dr. Gribskov suggested us to cut the sequence into half or three parts and run the GeneMark program. Then it worked. The output is pasted in the supplemental information part. As the sequence was cut into half as input, for the second half, we should remember to change the starting position and ending position of the genes when comparing the result with the FGENESH Program's output.

     

    After finish these steps, we can analyze our sequence.

    1). Using the outputs from FGENESH and GeneMark programs, we compared and arranged them  into a text file called "Comparison of  FGENESH and GenMark results for Protein-BLAST" and then run the Protein BLAST to indentify the possible proteins.  

    2). Results: please see the form. For detailed analysis, see links for some ones. (Purple means definite gene, Blue means need more check, white for no matching).

    Predicted Proteins

    ID          

    Begin (TSS)

    End (PolA)

     Length   

    Strand

    N Exons

    F=FGENESH

    G=Genmark

    BLAST

    206-1

    50387

    51951

    49302

    49638

    182aa

    265aa

    -

    -

    3

    4

    F

    G

    Histidine Kinase [zea mays], also match to Oryza sativa Japonica Group (7e-64)
    206-2
    52339
    52005
    81aa
    -
    2
    G No match
    206-3
    54455
    54017
    67aa
    -
    2 G No match
    206-4

    51764

    56885

    57416

    57416

    735aa

    141aa

    +

    +

    12

    2

    F

    G

    Probably the gene contain the PLN02913 domain or dihydrofolate synthase.
    206-5 108256 101066 218aa - 4 G No match

    206-6

    107993

    108281

    108748

    113925

    214aa

    201aa

    +

    +

    3

    3

    F

    G

    similar as 206-4 results, also contain the PLN02913 (dihydrofolate synthase) domain

    206-7

    150541

    150773

    24aa 

    +

    G

    No match

    206-8

    179015

    179015

    178653

    178653

    120aa

    120aa

    -

    -

    1

    1

    F

    G

    hypothetical protein SORBIDRAFT_01g006230 [Sorghum bicolor]

    206-9

    179830

    182842

    132aa

    +

    5

    F

    No match

    206-10

    213251

    212793

    91aa 

    -

    3

    F

    Hypothetical splicing factor, arginine/serine-rich 12 [Zea mays], also we use blast against to confirm that it has strong match with Os12g0553900 [Oryza sativa Japonica Group] DNA

    206-11

    214564

    214448

    38aa 

    -

    1

    G

    simialar results with  206-10

    206-12

    214695

    214856

    53aa 

    +

    1

    F

    No match

    206-13

    289309

    289822

    285982

    285982

    489aa

    408aa 

    -

    -

    7

    6

    F

    G

    Match with F-box family protein, and also match with

    ubiquitin-protein ligase

    206-14

    337665

    337665

    338435

    338556

    256aa

    267aa 

    +

    +

    1

    2

    F

    G

    No match [From Fgenesh, there only find one similarity (ApbE family lipoprotein) which E-value is 7.4, For Genmark, there isn’t any similarities can be found in the library]

    206-15

    361821

    362584

    171aa

    +

    4

    G

    No match

    206-16

    399962

    399953

    396506

    399547

    265aa

    98aa 

    -

    -

    5

    2

    F

    G

    match to nodulin family protein of Arabidopsis thaliana, guess it is a gene contain the Oxalate/Formate Antiporter domain.

    Was this page helpful?
    Tag page (Edit tags)
    • No tags
    You must login to post a comment.