Was this page helpful?

Smo5 project

    Table of contents
    No headers

    1. Anotators

    Pasajee   Kongsila

    Nimnara  Yookongkaew

    2. Sequence repeats 

        Dot plot shows the main two regions of tandem repeats :  around 4K-8K and 60K-73K

        Etandem shows the presence of SSRs at 14-14.5K (ca, gct), 54-56K (ta) and minisatellite at 22-25K

        2.1 Dot plot

    dottup.1.png

    2.2 Etandem

        2.3 Repeat Masking

    3. Basic statistic


    4. CpG islands

    The red hi-light regions are CpG which occured in the predicted genes.

    CPGPLOT islands of unusual CG composition
    scaffold_14 from 1 to 100001

         Observed/Expected ratio > 0.60
         Percent C + Percent G > 50.00
         Length > 200

    Length 228 (284..511) Length 278 (2080..2357) Length 511 (2959..3469) Length 264 (4268..4531)

    Length 268 (5908..6175) Length 416 (6199..6614) Length 333 (7173..7505) Length 222 (7886..8107)

    Length 460 (8143..8602)
    Length 226 (9305..9530) Length 597 (9705..10301) Length 373 (10316..10688)

    Length 362 (12222..12583) Length 277 (12622..12898) Length 290 (12907..13196) Length 905 (13222..14126)

    Length 622 (14639..15260) Length 510 (15960..16469) Length 507 (16530..17036) Length 303 (17129..17431)

    Length 275 (17478..17752) Length 371 (17793..18163)
    Length 391 (22261..22651) Length 278 (28889..29166)

    Length 317 (29766..30082) Length 669 (31623..32291) Length 547 (32403..32949) Length 459 (33024..33482)

    Length 953 (33946..34898)
    Length 652 (35817..36468) Length 256 (36884..37139) Length 738 (37302..38039)

    Length 946 (38271..39216) Length 459 (39265..39723) Length 959 (39816..40774) Length 618 (42590..43207)

    Length 627 (43591..44217) Length 305 (44263..44567) Length 655 (44576..45230) Length 217 (47089..47305)

    Length 419 (48830..49248) Length 253 (58091..58343) Length 200 (58450..58649) Length 251 (59053..59303)

    Length 407 (59488..59894)
    Length 526 (67626..68151) Length 283 (69575..69857) Length 1723 (69865..71587)

    Length 271 (71957..72227) Length 285 (73763..74047) Length 262 (74474..74735) Length 2596 (74894..77489)

    Length 348 (78809..79156) Length 225 (79805..80029) Length 758 (82205..82962) Length 399 (83383..83781)

    Length 1120 (83882..85001) Length 836 (89449..90284) Length 1268 (90681..91948) Length 303 (92378..92680)

    Length 350 (94193..94542) Length 578 (99300..99877)

      

    5. Gene prediction

        Tools

                - FGENSH monocot

                - GENEMARK (rice, maize)

                - GENSCAN (maize)


        Sequence analysis

        - First 50,000 bases and translation frame for gene 1-14 

        - Second 50,000 bases


    FGENESH Genemark Genscan Gene predicted Comments Alignment tblastn
    # begin    exons # begin end exons # begin end exons
    1 280 669 1 - 1 200 592 1 + 1 280 669 1 - XM_001774108 Physcomitrella patens subsp. patens early light-induced protein 9 (ELIP9) mRNA, complete cds Both pregrams predicted the same gene, but FGENESH predicted minus strand. While, genemark predicted the shorter plus strand. Gene1 Alignment Gene1 BLASTp
                A 1701 3128 4 +             The best hit of tblastn: No significant similarity Just genemark predicted this region. However, it hit no match. Therefore, This gene might be pseudogene. Alignment tblastn
    2 3948 4690 2 - 2 3948 4690 2 - 2 3948 4690 2 - XM_001774108 Physcomitrella patens subsp. patens early light-induced protein 9 (ELIP9) mRNA, complete cds Interproscan predicted early light-induced protein domain and chlorophyll a,b binding domain Gene2 Alignment Gene2 BLASTp
    3 5869 6603 2 - 3 5014 6603 3 - 3 5869 6603 2 - XM_001774108 Physcomitrella patens subsp. patens early light-induced protein 9(ELIP9) mRNA, complete cds The same domain prediction as in gene 2 Gene3 Alignment Gene3 BLASTp
    4 7818 8569 2 - 4 7506 8569 3 - 4 7381 8569 3 - XM_001774108 Physcomitrella patens subsp. patens early light-induced protein 9(ELIP9) mRNA, complete cds

    The same domain prediction as in gene 2 (Gene 1-4 are the repeat regions shown in dot plot.)

    Gene4 Alignment Gene4 BLASTp
    5 9337 14094 5 + 5.1 9409 11579 4 + 5 9409 15362 5 + Transducin family protein Interproscan predicted WD 40 repeat domain Gene5 Alignment Gene5 BLASTp
                5.2 12476 14094 2 +             XM_001756373 Physcomitrella patens subsp. patens predicted protein (PHYPADRAFT_24385) mRNA, partial cds FGENESH prediction of the 5th gene covered two genes prediction by genemark (gene 5.1,5.2) Alignment tblastn
    6 14409 16995 5 - 6 14409 16995 6 - 6 15967 16995 4 - 3-hydroxyacyl-CoA dehydrogenase Both programs predicted the same gene with the same length, but different exons. Genemark predicted more exons than FGENESH. Gene6 Alignment Gene6 BLASTp
    7 17202 18382 5 + 7 17202 18382 5 + 7 17202 18293 4 + TatC-like protein sec-independent periplasmic protein translocase TatC domain Gene7 Alignment Gene7 BLASTp
                8.1 19414 21489 2 -             AC158184 Selaginella moellendorffii clone JGIASXY-5E21, complete sequence Just genemark predicted this gene which is the same as the next predicted gene (8.2). Alignment tblastn
    8 22143 22436 1 - 8.2 21592 22436 3 - 8 21788 22436 2 - AC158184 Selaginella moellendorffii clone JGIASXY-5E21, complete sequence For BLASTp, not match any genes. It might be pseudogene Gene8 Alignment Gene8 BLASTp
                B 23480 27219 3 +             The best hit of tblastn: no significant similarity found Just genemark predicted this gene which hit no match in NCBI database. Therefore, this might be the psuedogene. Alignment tblastn
                            9.1 28075 31191 4 -       Alignment tblastn
    9 27984 32821 16 - 9 27260 32821 19 - 9.2 31784 32828 2 - ATP-dependent metalloprotease AAA ATPase domain

    Gene9 Alignment

    Gene9 BLASTp
    10 33215 35059 2 + 10 33140 35059 2 + 10 33140 35059 2 - Glycoside hydrolase family 17 Both programs predicted the same gene with the same two exons and the same stop codon, but different in the start codon.  

    Gene10 Alignment

    Gene10 BLASTp
    C 35757 37065 6 - c.1 35757 36196 2 - C 35757 37065 3 - AP009673 Lotus japonicus genomic DNA, clone: LjT02J24, TM0401, complete sequence. E-value = 5.1 FGENESH predicted this gene which covered the region of two genes predicted by genemark (C.1, C.2). However, this gene matched the NCBI database with high e-value. Therefore, this gene might be pseudogene. GeneC Alignment GeneC BLASTp
                c.2 36525 37001 1 +             AP007255 Magnetospirillum magneticum AMB-1 DNA, complete genome E-value = 3.7 The same as comment in C.1 Alignment tblastn
    11 37784 39157 2 + 11 37784 39157 2+ 11 37784 39157 2 + Serine-threonine protein kinase Both programs predicted exactly the same gene.  Gene11 Alignment Gene11 BLASTp
                D 39285 39719 1 + D 39282 39705 2 + Ferredoxin Just genemark predicted this gene. It also has low e-value, but it has just one exon. Therefore, it might be pseudogene. GeneD Alignment GeneD BLASTp
    12 39791 43119 8 - 12.1 39791 42449 7 - 12 39791 43119 4 - EU262743 Selaginella moellendorffii putative gibberellin receptor mRNA,complete cds Esterase_lipase superfamily

    Gene12 Alignment

    Gene12 BLASTp
                12.2 42617 43119 3 -             EF081679 Picea sitchensis clone WS02813_H13 unknown mRNA The stop codon of FGENESH predicted gene is the same as in gene 12.2 of genemark.   Alignment tblastn
    13 43595 45184 1 - 13 43595 45184 1 - 13 43595 48434 7 - Glyoxal oxidase N-terminal Both programs predicted exactly the same gene.   Gene13 Alignment Gene13 BLASTp
                E 45423 45782 2 +             AC157217 Pan troglodytes BAC clone CH251-305B5 from chromosome unknown,complete sequence E-value = 6.7 This prediction hit high e-value match. Therefore, this might be pseudogene.   Alignment tblastn
    F 46663 48636 7 - F 46663 48936 7 -             Polygalacturonase A Glycohydrolase 28  GeneF Alignment GeneF BLASTp
    14 48850 49940 4 + 14 48850 49940 4 +             Bifunctional phosphoribosyl-ATP pyrophosphohydrolase Both programs predicted exactly the same gene.   Gene14 Alignment Gene14 BLASTp

    Gene prediction program comparison

    FGENESH Genemark Genscan Comment
    # begin    exons # begin end exons # begin end exons FGENESH GENEMARK GENSCAN
    1 280 669 1 - 1 200 592 1 + 1 280 669 1 - * Different *
                A 1701 3128 4 +                Different   
    2 3948 4690 2 - 2 3948 4690 2 - 2 3948 4690 2 - * * *
    3 5869 6603 2 - 3 5014 6603 3 - 3 5869 6603 2 - * * with longer C terminal *
    4 7818 8569 2 - 4 7506 8569 3 - 4 7381 8569 3 - * * with longer C terminal * with longer C terminal
    5 9337 14094 5 + 5.1 9409 11579 4 + 5 9409 15362 5 + * Split gene 5 into 2 genes * but shorter than FGENESH
                5.2 12476 14094 2 +                Split gene 5 into 2 genes   
    6 14409 16995 5 - 6 14409 16995 6 - 6 15967 16995 4 - * but shorter than Genemark * * but shorter than other two programs
    7 17202 18382 5 + 7 17202 18382 5 + 7 17202 18293 4 + * a few differences from genemark * a few differneces from FGENESH * much shorter than other two programs
                8.1 19414 21489 2 -                Different   
    8 22143 22436 1 - 8.2 21592 22436 3 - 8 21788 22436 2 - * * *
                B 23480 27219 3 +                Different   
                            9.1 28075 31191 4 -       Different
    9 27984 32821 16 - 9 27260 32821 19 - 9.2 31784 32828 2 - * a few differences from genemark * a few differneces from FGENESH * much shorter than other two programs
    10 33215 35059 2 + 10 33140 35059 2 + 10 33140 35059 2 - * with shorter N terminal * *
    C 35757 37065 6 - c.1 35757 36196 2 - C 35757 37065 3 - * * with shorter N terminal * with shorter N terminal
                c.2 36525 37001 1 +                Different   
    11 37784 39157 2 + 11 37784 39157 2+ 11 37784 39157 2 + * * *
                D 39285 39719 1 + D 39282 39705 2 +    * with longer C terminal *
    12 39791 43119 8 - 12.1 39791 42449 7 - 12 39791 43119 4 - * Split gene 12 into 2 genes * but shorter than FGENESH
                12.2 42617 43119 3 -                Split gene 12 into 2 genes   
    13 43595 45184 1 - 13 43595 45184 1 - 13 43595 48434 7 - * * * with much longer N terminal
                E 45423 45782 2 +                Different   
    F 46663 48636 7 - F 46663 48936 7 -             * *   
    14 48850 49940 4 + 14 48850 49940 4 +             * *   

      

    * means some similar regions of gene prediction
    Yellow means the longest gene prediction
    Green means the different prediction fron the other programs
    Blue means the exactly similar prediction among at least two programs

    Summary of Gene prediction programs comparison

    Conclusion: Three gene prediction programs comparison Program Comment
    The most gene predicted Genemark Some predicted genes are not matched with any protein sequence in NCBI
    The most longest gene predicted Genemark / FGENESH   
    The number of genes that are predicted by all three programs 2 genes that exactly alike (Gene 2 and gene 11)   
       9 genes that alomost alike or have some regions that similar (gene 3,4,6,7,8,9,10,C,13)   
    The number of genes that are predicted by two programs FGENESH / GENSCAN 1 gene (gene #1)
       Genemark / GENSCAN 1 gene (gene D)
       FGENESH / Genemark 2 genes (gene F,14)
    From the first 50,000 bases prediction, FGENESH and Genemark are the best two programs because FGENESH predicted genes that sometimes cover the region that both Genemark and GENSCAN predicted
    and genemark can predict some genes that the other two genes cannot predict. However, most of the different genes that Genemark predicted are not matched with any protein database (might be pseudogene or wrong prediction). For GENSCAN, the predictions are quite different (shorter gene) from the other two gene prediction programs because GENSCAN used maize as a model for prediction while the other programs used monocot as a model.
    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 15

    FileSizeDateAttached by 
     artemis_manual_v10.zip
    No description
    816.95 kB00:04, 4 Nov 2008pkongsilActions
     comparative EST.pdf
    No description
    412.19 kB12:45, 6 Nov 2008pkongsilActions
     comparative genomics.pdf
    No description
    210.52 kB12:42, 6 Nov 2008pkongsilActions
    cpgplot.1
    No description
    10.32 kB00:44, 4 Nov 2008pkongsilActions
     dottup.1.png
    Dot plot
    18.42 kB08:11, 30 Oct 2008nyookongActions
     GeneMark (Arabidopsis).pdf
    No description
    163.64 kB07:53, 30 Oct 2008nyookongActions
     GENEMARK-maize.doc
    No description
    65.5 kB20:12, 27 Nov 2008nyookongActions
     genome annotation.pdf
    No description
    260.44 kB12:47, 6 Nov 2008pkongsilActions
     selaginella-genome.pdf
    No description
    477.69 kB12:27, 6 Nov 2008pkongsilActions
     selaginella-genome2.pdf
    No description
    1953.19 kB12:32, 6 Nov 2008pkongsilActions
     selaginella-genome3.pdf
    No description
    421.54 kB12:32, 6 Nov 2008pkongsilActions
     SMO5-FGENESH-monocot.txt
    No description
    56.13 kB00:18, 13 Nov 2008pkongsilActions
    SMO5-FGENESH.pdf
    No description
    490.87 kB00:51, 4 Nov 2008pkongsilActions
     SMO5-GENEMARKhmm-Orysa.txt
    No description
    16.11 kB00:12, 18 Nov 2008pkongsilActions
     translation frame of SMO5.doc
    No description
    2.68 MB17:45, 17 Dec 2008pkongsilActions
    You must login to post a comment.