Welcome to Group 4!

    Group members

    - Helena Avila, first year PhD student in Ecological Science and Engineering program, Deparment of Agronomy

    - Yi Cui, second year PhD student in Biological Engineering, focusing on epigenetics, optical biophysics

    - Hady Wahby, second year PhD student in Biological Sciences, research using experimental and computational approaches to better understand proteins

    Project Contents

    1. Tools

    2. Overall characteristics

      2.1. Content overview

      2.2. Overall alignment (Blastn)

      2.3. Repeat masking

    3. Gene structure prediction

      3.1. Augustus

      3.2. CpG islands finding

      3.3. Gene prediction using FGENESH and GeneMark

           3.3.1. Using raw sequence

           3.3.2. Using masked sequence

           3.3.3. Comparison

      3.4. Summary

    4. Gene annotation

    5. Non-coding RNA annotation

    6. Final presentation

    7. References

     

    1. Tools

    Pipeline: Ensemble, Gramene

    1. Ab initio gene prediction
        GlimmerHMM, FGENESH, genemark.HMM, Augustus
    2. Homoloy search
        genblastA, wublastp
    3. Homology based gene prediction
        Exonerate, genewise
    4. EST based gene structure prediction
        PASA
    5. Gene prediction combiner
        Evidence modeler
    6. Functional annotation
        interproscan

     

    2. Overall characteristics

    2.1. Content overview

    Word size	1
    # Word	Obs Count	Obs Frequency	Exp Frequency	Obs/Exp Frequency
    A	40058		0.2670533	0.2500000	1.0682133
    C	36315		0.2421000	0.2500000	0.9684000
    G	34857		0.2323800	0.2500000	0.9295200
    T	37970		0.2531333	0.2500000	1.0125333
    Other	800		0.0053333	0.0000000	N.A
    
    Word size	2
    # Word	Obs Count	Obs Frequency	Exp Frequency	Obs/Exp Frequency
    AA	12033		0.0802205	0.0625000	1.2835286
    AC	8597		0.0573137	0.0625000	0.9170194
    AG	9471		0.0631404	0.0625000	1.0102467
    AT	9955		0.0663671	0.0625000	1.0618737
    CA	10259		0.0683938	0.0625000	1.0943006
    CC	9604		0.0640271	0.0625000	1.0244335
    CG	7231		0.0482070	0.0625000	0.7713118
    CT	9218		0.0614537	0.0625000	0.9832599
    GA	9703		0.0646871	0.0625000	1.0349936
    GC	8637		0.0575804	0.0625000	0.9212861
    GG	8909		0.0593937	0.0625000	0.9502997
    GT	7608		0.0507203	0.0625000	0.8115254
    TA	8062		0.0537470	0.0625000	0.8599524
    TC	9474		0.0631604	0.0625000	1.0105667
    TG	9242		0.0616137	0.0625000	0.9858199
    TT	11188		0.0745872	0.0625000	1.1933946

    The "Other" category displayed when computing words observed and word frequency may refer to 'N' bases (bases that have been masked). There are 8 regions of these Ns in the sequences, each one with 100 bases. The overall GC content of the segment of genome is 47.45%, which is comparable to the CG content of the total Maize genome (47%)1. This result may suggest that the DNA sequence could have an average gene density.

    2.2. Overall alignment (BLASTn)

    To clearly show the origin of the partial genome sequence and to get an idea of where coding regions may be located, BLASTn was used. Using this database search with both raw and masked sequence revealed that the greatest number of matches were retrieved in the Zea mays database (shown below). The majority of the repetitive elements were also able to be identified via this databse (see 3.1). (for more blast results)

    Blastn-masked.bmp

    Alignments Table

     

    2.3. Repeat masking

    Repeatmasker was used to mask repeating sequence elements in order to more easily identify coding regions. The overall Maize genome is known for its large transposon and repeat content2 (see also:http://www.maizesequence.org/Zea_mays). About 93 Kbp of the maize region were masked as repetitive, of which around 90% are retroelements and DNA transposons.

    Repeatmasker_Seq4.png

    For more details: Overall Report

    Masked Sequence

    Annotation

    Alignment

     

    3. Gene structure prediction

    3.1. Augustus

    Augustus.png 

    3.2. CpG islands finding

    EMBOSS OUTPUT

    cpgplot.1.png

     

    3.3. Gene prediction using FGENESH and GeneMark

    3.3.1. Using raw sequence

    • FGENESH                                                                                                            

    Number of predicted genes 22: in +chain 15, in -chain 7.

    Number of predicted exons 97: in +chain 67, in -chain 30.

              Number of predicted genes 26: in +chain 15, in -chain 11.

    Number of predicted exons 135: in +chain 86, in -chain 49. 

    • Comparison

    GeneMarkVsFGENESH_raw.jpg

    3.3.2. Using masked sequence

    • FGENESH: Gene number decreases from 22 to 5 compared with the prediction using the unmasked sequence.

    Number of predicted genes 5: in +chain 4, in -chain 1.

    Number of predicted exons 31: in +chain 25, in -chain 6.

    • GeneMark: Gene number decreases from 26 to 9 compared with the prediction using the unmasked sequence.

    Number of predicted genes 9: in +chain 5, in -chain 4.

    Number of predicted exons 15: in +chain 15, in -chain 4.

    • Comparison

     GeneMarkVsFGENESH_masked.jpg

    For more details: GeneMark-comparison.xlsx

    3.3.3. Prediction comparison with raw and masked sequence 

    GeneMark results were used in order to compare the gene predictions made using the raw and the masked sequences. Most exons of the masked sequence prediction have exact matches in raw sequence prediction, meaning that the masking did not cause much rearrangement. However, the second gene in masked sequence overlaps three gene regions of the raw sequence based prediction. Excessive masking may be at fault for this outcome.  

    Raw Sequence       Masked Sequence      
    Gene/Exon/Strand/Type Begin End Length Gene/Exon/Strand/Type Begin End Length
     #    #                 #    #               
    1    10   -  Internal 451 538 88 1     2   -  Terminal 474 538 65
    1     9   -  Internal 727 1596 870 1     1   -  Initial 727 991 265
    1     8   -  Internal 2038 2103 66        
    1     7   -  Internal 2298 2381 84        
    1     6   -  Internal 3466 3535 70        
    1     5   -  Internal 4127 4167 41        
    1     4   -  Internal 4285 4329 45        
    1     3   -  Internal 4436 4597 162 2     1   +  Initial 4455 4529 75
    1     2   -  Internal 4690 4765 76        
    1     1   -  Initial 4855 4929 75        
                   
    2     1   +  Initial 5009 5206 198 2     2   +  Internal 4997 5206 210
    2     2   +  Terminal 6116 6163 48        
                   
    3     1   +  Initial 6215 6571 357        
    3     2   +  Internal 7973 8122 150 2     3   +  Internal 7973 8122 150
    3     3   +  Internal 8217 8351 135 2     4   +  Internal 8217 8351 135
    3     4   +  Internal 8444 8509 66 2     5   +  Internal 8444 8509 66
    3     5   +  Internal 8683 8748 66 2     6   +  Internal 8683 8748 66
    3     6   +  Internal 8896 8985 90 2     7   +  Internal 8896 8985 90
    3     7   +  Internal 9934 10015 82 2     8   +  Internal 9934 10015 82
    3     8   +  Terminal 11089 11288 200 2     9   +  Terminal 11089 11288 200
                   
    4     2   -  Terminal 15354 15479 126 3     2   -  Terminal 15354 15479 126
    4     1   -  Initial 15574 15669 96 3     1   -  Initial 15574 15669 96
                   
    5     1   +  Initial 15754 15906 153 4     1   +  Initial 15754 15906 153
    5     2   +  Internal 16006 16225 220 4     2   +  Internal 16006 16225 220
    5     3   +  Internal 16321 16440 120 4     3   +  Internal 16321 16440 120
    5     4   +  Internal 16530 16674 145 4     4   +  Internal 16530 16674 145
    5     5   +  Terminal 18281 18725 445 4     5   +  Terminal 18281 18725 445
                   
    6     1   +  Initial 22626 22905 280 5     1   +  Initial 22626 22905 280
    6     2   +  Internal 23005 23066 62 5     2   +  Internal 23005 23066 62
    6     3   +  Internal 23155 23298 144 5     3   +  Internal 23155 23298 144
    6     4   +  Internal 23515 23568 54 5     4   +  Internal 23515 23568 54
    6     5   +  Internal 23709 23840 132 5     5   +  Internal 23709 23840 132
    6     6   +  Internal 25505 25613 109 5     6   +  Internal 25505 25613 109
    6     7   +  Terminal 25699 25964 266 5     7   +  Terminal 25699 25964 266
                   
    7     8   -  Terminal 29854 29970 117 6     9   -  Terminal 29854 29970 117
    7     7   -  Internal 30045 30188 144 6     8   -  Internal 30045 30188 144
    7     6   -  Internal 30657 30727 71 6     7   -  Internal 30657 30727 71
    7     5   -  Internal 30825 30924 100 6     6   -  Internal 30825 30924 100
    7     4   -  Internal 32127 32194 68 6     5   -  Internal 32127 32194 68
    7     3   -  Internal 32338 32377 40 6     4   -  Internal 32338 32377 40
    7     2   -  Internal 32469 32558 90 6     3   -  Internal 32469 32558 90
    7     1   -  Initial 33147 33353 207 6     2   -  Internal 33147 33204 58
            6     1   -  Initial 35611 35618 8
    14     2   +  Internal 70473 70569 97 7     1   +  Initial 70476 70569 94
    14     3   +  Internal 70899 71186 288 7     2   +  Internal 70899 71186 288
    14     4   +  Terminal 71278 71381 104 7     3   +  Terminal 71278 71381 104
                   
    21     1   +  Initial 120882 120936 55 8     1   +  Initial 120882 120936 55
    21     2   +  Internal 121843 121913 71 8     2   +  Internal 121843 121913 71
    21     3   +  Terminal 122030 122137 108 8     3   +  Terminal 122030 122137 108
                   
    25     3   -  Terminal 145753 145848 96 9     2   -  Terminal 145753 145848 96
    25     2   -  Internal 145957 146310 354 9     1   -  Internal 145957 146310 354
    25     1   -  Initial 146628 146687 60        

     

    For more details: Prediction comparison between raw and masked sequence

    3.3.4. CpG Islands within Predicted Gene Models

    For each of the predicted gene models by GeneMark and FGENESH, we can find corresponding CpG islands on upstream regions. This provides additional evidence for the credibility of real gene loci.

     

    GeneMark Upstream CpG islands nearby   FGENESH
    Gene Strand No. exons Predicted gene range   Gene Strand No. exons Predicted gene range
    1 - 10 451 4929 4726..5120(395)          
    2 + 2 5009 6163 4726..5120(395) 1 + 1 5009 5257
    3 + 8 6215 11288 6286..6581(296) 2 + 10 6215 11288
              11902..12246(345) 3 + 3 13954 15109
    4 - 2 15354 15669 15319..16244(926)          
    5 + 5 15754 18725 15319..16244(926) 4 + 6 16066 19594
    6 + 7 22626 25964 22497..23069(573) 5 + 10 22626 25964
    7 - 8 29854 33353 34099..34467(369) 6 - 6 30033 33353
    8 + 10 38639 50348 36861..37695(835) 7 + 3 38639 44365
              46411..46615(205) 8 + 3 47097 50348
    9 + 2 50412 51082 46708..50914(4207)          
    10 - 6 51304 54659 55114..56093(980) 9 - 5 51222 54659
    11 - 2 55305 55641 56164..57366(1203)          
    12 - 3 56886 61345 61171..61838(668)          
    13 + 7 61514 63105 60700..60991(292) 10 + 4 61514 63105
    14 + 4 63118 71381 62875..63241(367) 11 + 1 70820 71194
    15 - 3 79830 80584 80982..81206(225) 12 - 2 79894 80584
    16 - 4 82976 86268 88438..89582(1145) 13 - 3 82976 87348
    17 + 13 88403 94685 85055..85260(206) 14 + 2 88555 90387
              90667..91514(848) 15 + 1 92074 92571
    18 + 6 94705 98592 94145..94947(803) 16 + 7 95032 100309
    19 + 2 98656 100309 97642..98796(1155)          
    20 - 6 103950 109551 110163..110670(508) 17 - 8 103322 109551
    21 + 3 120882 122137 120175..120527(353)          
    22 - 2 128114 128354 131824..132315(492) 18 - 1 127489 128463
    23 + 10 128544 136397 126018..126486(469) 19 + 11 128870 137712
    24 + 4 138998 142322 137923..138271(349) 20 + 4 138998 142541
              144019..144232(214) 21 - 5 143031 144025
    25 - 3 145753 146687 147264..147667(404)          
    26 + 3 148038 149864 147688..147920(233) 22 + 1 148613 149864

    Note: pay attention to the forward and reverse(-) strand.

    4. Gene annotation

    4.1. Raw sequence

    Using BLASTp, the predicted proteins (GenMark and FGENESH) were used to query the reference proteins and UniProtKB/Swiss-Prot databases. The E-value, the score of the alignments, and the presence of putative conserved domains were taken into account when identifying matches of interest.

    • Protein prediction GeneMark

    For more details: GeneMark

    BLASTp identified 11 genes out of the 26 predicted proteins. The first four predicted genes were also found using the masked sequence. Five predicted proteins, however, did not produce hits with E-values < 1. In six predicted proteins transposable elements or virus-related sequences were found.

    • Protein prediction FGENESH

    For more details: FGENESH

    Eleven possible genes were identified, which includes predicted proteins 4 and 7 which overlapped with conjectured sequnces in the GeneMark analysis. These predictions were also maintained with the masked sequence analysis. Three proteins did not produce hits qith E-values < 1. In eight predicted proteins transposable elements or virus-related proteins were found.

     

    4.2. Masked sequence

    FGENESH requires "TSS" and "PolA" markers for an intact gene prediction, likely leading to GeneMark to predict three more genes (as displayed below). However, evidence for these additional genes (after 120,000 bases of the sequence) are supported by matches on Augustus and BLASTn in the region. 

    As for the genes that agree, we see considerable sequence overlap, especially on the predicted matches to coiled-coil domain-containing protein 94 and LOC100273691 uncharacterized protein. Throughout comparison of the two gene prediction programs there are exon discrepancies. We need further and in-depth inspection for each predicted protein.

    Matched gene

                     

    Matched exon

                     

    bp discrepancy

                     
        GeneMark         FGENESH    

    Gene/exon

    Strand

     

     

    Blastp(Significant e-value<1e-05)

    Gene/exon

    Strand

     

     

    Blastp

       

    Begin

    End

         

    Begin

    End

     

    1

    -

    474

    991

    Tafazzin [Medicago truncatula], e-val=7e-04;
    uncharacterized LOC100217271 [Zea mays], e-val=0.006

             

    1_1

    -

    474

    538

               

    1_2

    -

    727

    991

               

    2

    Gene 1

    +

    4455

    11288

    coiled-coil domain-
    containing protein 94 [Zea mays], e-val=3e-18

    1

    +

    5009

    11288

    Same

    2_1

    +

    4455

    4529

               

    2_2

    +

    4997

    5206

     

    1_1

    +

    5009

    5206

     

    2_3

    +

    7973

    8122

     

    1_2

    +

    7973

    8122

     

    2_4

    +

    8217

    8351

     

    1_3

    +

    8217

    8351

     

    2_5

    +

    8444

    8509

     

    1_4

    +

    8444

    8509

     

    2_6

    +

    8683

    8748

     

    1_5

    +

    8683

    8748

     

    2_7

    +

    8896

    8985

     

    1_6

    +

    8896

    8985

     
             

    1_7

    +

    9425

    9454

     
             

    1_8

    +

    9508

    9591

     

    2_8

     

    9934

    10015

     

    1_9

    +

    9934

    10014

     

    2_9

     

    11089

    11288

     

    1_10

    +

    11091

    11288

     

    3

    -

    15354

    15669

    uncharacterized protein LOC100501629 [Zea mays], e-val=0.054;
    mitochondrial substrate carrier family protein ucpB-like [Vitis vinifera], e-val=0.004

             

    3_1

    -

    15354

    15479

               

    3_2

    -

    15574

    15669

               

    4

    Gene 2

    +

    15754

    18725

    sigma factor sigB regulation
    protein rsbQ [Zea mays], e-val=1e-102

    2

    +

    16066

    19594

    sigma factor sigB regulation protein
    rsbQ [Zea mays], e-value=5e-81; 
    mov34/MPN/PAD-1 family protein
    [Zea mays], e-value=5e-43

     

    4_1

    +

    15754

    15906

               

    4_2

    +

    16006

    16225

     

    2_1

    +

    16066

    16224

     

    4_3

    +

    16321

    16440

               

    4_4

    +

    16530

    16674

     

    2_2

    +

    16532

    16669

     
             

    2_3

    +

    17001

    17075

     

    4_5

    +

    18281

    18725

     

    2_4

    +

    18282

    18629

     

    5

    Gene 3

    +

    22626

    25964

    uncharacterized protein LOC100273691 [Zea mays], e-val=5e-147;
    protein midA, mitochondrial-like[Brachypodium distachyon], e-val=1e-103

    3

    +

    22626

    25964

    Same

    5_1

    +

    22626

    22905

     

    3_1

    +

    22626

    22904

     

    5_2

    +

    23005

    23066

     

    3_2

    +

    23007

    23066

     

    5_3

    +

    23155

    23298

     

    3_3

    +

    23155

    23298

     

    5_4

    +

    23515

    23568

     

    3_4

    +

    23515

    23625

     

    5_5

    +

    23709

    23840

     

    3_5

    +

    23709

    23903

     
             

    3_6

    +

    24827

    25015

     

    5_6

    +

    25505

    25613

     

    3_7

    +

    25505

    25612

     

    5_7

    +

    25699

    25964

     

    3_8

    +

    25701

    25964

     

    6

    Gene 4

    -

    29854

    35618

    carboxy-lyase [Zea mays], e-val=6e-131

    4

     

    30033

    34733

    Same

    6_1

    -

    29854

    29970

               

    6_2

    -

    30045

    30188

     

    4_1

     

    30033

    30188

     

    6_3

    -

    30657

    30727

     

    4_2

     

    30657

    30725

     

    6_4

    -

    30825

    30924

     

    4_3

     

    30826

    30924

     

    6_5

    -

    32127

    32194

     

    4_4

     

    31834

    31863

     

    6_6

    -

    32338

    32377

               

    6_7

    -

    32469

    32558

     

    4_5

     

    32463

    32558

     

    6_8

    -

    33147

    33204

     

    4_6

     

    34581

    34733

     

    6_9

    -

    35611

    35618

               

    7

    +

    70476

    71381

    RIO kinase 1
    [Arabidopsis thaliana], e-val=3e-04; High cDNA matches, probably a non-coding RNA region

    5

    +

    70820

    71194

    N/A

    7_1

    +

    70476

    70569

               

    7_2

    +

    70899

    71186

     

    5_1

    +

    70820

    71194

     

    7_3

    +

    71278

    71381

               

    8

    +

    120882

    122137

    F-box domain containing
    protein [Zea mays], e-val=7e-04

             

    8_1

    +

    120882

    120936

               

    8_2

    +

    121843

    121913

               

    8_3

    +

    122030

    122137

               

    9

    -

    145753

    146310

    catalytic/ protein phosphatase
    type 2C [Zea mays], e-val=3e-32; But only covers half of the real protein

             

    9_1

    -

    145753

    145848

               

    9_2

    -

    145957

    146310

               

     

     

    6. Final presentation

    Click here

    7. References

    1 Meyers, B.C., Tingey, S.V. and Morgante, M. 2001. Abundance, Distribution, and Transcriptional Activity of Repetitive Elements in the Maize Genome. Genome Res. Vol 11: 1660-1676

    2 Liang, C., Mao, L., Ware, D. and Stein, L. 2009. Evidence-based gene predictions in plant genomes. Genome Res. Vol. 19: 1912-1923

    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 27

    FileSizeDateAttached by 
     Augustus.png
    No description
    10.34 kB17:11, 3 Oct 2012cui19Actions
     Blast_masked sequence_4.pdf
    No description
    1679.84 kB12:37, 18 Oct 2012cui19Actions
     Blastn-masked.bmp
    No description
    1140.1 kB12:37, 18 Oct 2012cui19Actions
     Comparison gen prediction.xlsx
    No description
    14.05 kB12:17, 18 Oct 2012favilaarActions
     CpG islands.txt
    No description
    2.5 kB23:09, 1 Oct 2012cui19Actions
     cpgplot.1.png
    No description
    8.99 kB23:05, 1 Oct 2012cui19Actions
     FGSH.txt
    No description
    9.52 kB11:29, 23 Oct 2012cui19Actions
     Gene_Prediction_FGENESH.xlsx
    No description
    14.17 kB23:35, 24 Oct 2012favilaarActions
     Gene_Prediction_GeneMark.xlsx
    No description
    14.83 kB23:35, 24 Oct 2012favilaarActions
     GeneMark-comparison.xlsx
    No description
    16.32 kB13:02, 22 Oct 2012cui19Actions
     GeneMark-Graphic.pdf
    No description
    529.61 kB20:27, 1 Oct 2012cui19Actions
     GeneMark.txt
    No description
    22.2 kB20:34, 1 Oct 2012cui19Actions
     GeneMarkPredictions.txt
    Retrieved sequences corresponding to positions of predicted genes from Genemark.
    22.43 kB22:50, 24 Oct 2012hwahbyActions
     GeneMarkVsFGENESH_masked.jpg
    No description
    58.63 kB12:50, 23 Oct 2012favilaarActions
     GeneMarkVsFGENESH_raw.jpg
    No description
    39.26 kB12:45, 23 Oct 2012favilaarActions
     Genomic annotation project_Group 4.pdf
    Presentation
    1544.03 kB22:38, 5 Dec 2012favilaarActions
     GMK.txt
    No description
    5.1 kB11:29, 23 Oct 2012cui19Actions
     RepeatMasker alignment.txt
    No description
    509.43 kB12:07, 18 Oct 2012cui19Actions
     RepeatMasker annotation.txt
    No description
    13.74 kB12:07, 18 Oct 2012cui19Actions
     RepeatMasker report.txt
    No description
    4.12 kB12:07, 18 Oct 2012cui19Actions
     RepeatMasker Results.pdf
    No description
    163.31 kB22:04, 15 Oct 2012favilaarActions
     RepeatMasker seq.txt
    No description
    152.35 kB12:07, 18 Oct 2012cui19Actions
     Repeatmasker_Seq4.png
    No description
    31.82 kB21:37, 24 Oct 2012favilaarActions
     seq4.fa.txt
    No description
    152.36 kB16:50, 17 Sep 2012gribskovActions
     Seq4FGENESH.pdf
    No description
    386.97 kB23:22, 26 Sep 2012cui19Actions
     seqretrieve.py
    Python Script to retrieve specifc regions of genome easily for analysis.
    890 bytes15:37, 23 Oct 2012hwahbyActions
     Transcription factor prediction.pdf
    No description
    192.93 kB22:04, 15 Oct 2012favilaarActions
    You must login to post a comment.