Welcome to Group 3

    Members:

    Anne Brown (brown637@purdue.edu)

    • 2nd year PhD student working on soybeans with Karen Hudson
    • My focus is on Flowering time and Maturity 

    Jieqing Ping (ping@purdue.edu)

    • 3rd year PhD student working on soybean in Dr. Jianxin Ma's lab from Agronomy Department
    • Research focus: Deciphering Soybean Stem Growth Habit toward Molecular Design Breeding

    Joshua Fitzgerald (fitzger0@purdue.edu)

    •  3rd year PhD student studying wheat breeding under Dr. Herb Ohm from the Agronomy Dept
    • Focus: Reconstitution of Two Fusarium Head Blight Genetic Factors into Homoeologous Chromosomes

    2. Basic Characteristics

           2.1 BLAST Search

    Blast using Nucleotide collection (nr/nt)

    blast.png

         Preliminary results of the 150kb sequence, after being queried against the NCBI’s Nucleotide collection, show highly similar sequences in the 1-10kb, 45-60kb, 80-90kb, 110-130kb and 140-142kb regions to those found in Zea mays. Thus, the same query only excluding Zea mays from the search resulted in fewer hits corroborating our assumption that the unknown sequence shares highly similar sequences in Zea mays.

     

    Blastn results (excluding Zea mays)

    blast search without Zea mays.png

    .

     

           2.2 Dot Plot Analysis

    dotplot wordsize10.png

     

            2.3 CpG Island Search

          As CpG islands are associated with genes to some degree, analysis of CpG islands is conducted at the website (http://pro.genomics.purdue.edu/emboss) using the cpgplot. From the figure, we find several regions (32K-42K, 48k-53K, 69K-75K, 133-138K) have higher GC content which may provides useful evidence for predicted genes.

    EMBOSS Output

    cpgplot.1.png

     

            2.4 Compseq

              Compseq was used to show us the content in our sequence. The results indicated that the GC content was 46.6%, and that the CG content is significantly lower than the other base pair matches. There is also a higher frequency for AA and TT base pair matches, which are likely in gene poor regions.  

    • # Output from 'compseq'
      #
      # The Expected frequencies are calculated on the (false) assumption that every
      # word has equal frequency.
      #
      # The input sequences are:
      #	sequence_3
      
      # Word	Obs Count	Obs Frequency	Exp Frequency	Obs/Exp Frequency
      #
      A	38964		0.2597600	0.2500000	1.0390400
      C	34582		0.2305467	0.2500000	0.9221867
      G	35405		0.2360333	0.2500000	0.9441333
      T	39349		0.2623267	0.2500000	1.0493067
      Other	1700		0.0113333	0.0000000	10000000000.0000000
      
      #
      # Word	Obs Count	Obs Frequency	Exp Frequency	Obs/Exp Frequency
      #
      AA	11700		0.0780005	0.0625000	1.2480083
      AC	8082		0.0538804	0.0625000	0.8620857
      AG	9451		0.0630071	0.0625000	1.0081134
      AT	9729		0.0648604	0.0625000	1.0377669
      CA	9758		0.0650538	0.0625000	1.0408603
      CC	8580		0.0572004	0.0625000	0.9152061
      CG	6549		0.0436603	0.0625000	0.6985647
      CT	9692		0.0646138	0.0625000	1.0338202
      GA	9645		0.0643004	0.0625000	1.0288069
      GC	8632		0.0575471	0.0625000	0.9207528
      GG	9078		0.0605204	0.0625000	0.9683265
      GT	8044		0.0536270	0.0625000	0.8580324
      TA	7857		0.0523803	0.0625000	0.8380856
      TC	9287		0.0619137	0.0625000	0.9906199
      TG	10320		0.0688005	0.0625000	1.1008073
      TT	11878		0.0791872	0.0625000	1.2669951
      Other	1717		0.0114467	0.0000000	10000000000.0000000

    3. Repeat Sequence analysis

        To reduce the redundancy and to facilitate gene identificaiton of the DNA sequence, RepeatMasker was used, using maize as the query species. About 58.8% of the sequence is composed of transposable elements, mostly of which are long terminal repeat (LTR) retrotransposons. Comparing with the nearly 85% of TEs in the maize genome, it is quite reasonable that this sequence is from the end of chromosome which possesses more genes. In the maize genome, Gypsy-like retrotransposons are nearly twice as much as copia-like retrotransposons. However, high percentage of Copia-like retrotransposons(34.7%) are observed than Gypsy-like retrotransposons(19.8%) in our sequence,  which shows consistence with the expectation that copia-like elements are highly enriched while Gypsy-like elements are relatively less in the euchromatic region. (1.RepeatMasker Summary 2.RepeatMasker Sequence 3.Detailed Results )

    4. Gene Predictions

              4.1 FGENESH Results (1.FGENESH results using unmasked sequence 2.FGENESH results using repeat-masked sequence)

        FGENESH was run at the website (http://linux1.softberry.com/berry.ph...subgroup=gfind) both using the unmasked sequence and the repeat masked sequence with the monocot plants as the reference database. Then, comparisons are made using NCBI Standard Protein BLAST and Translated BLAST. Most of the prediectd genes in unmarked sequence are found to be retrotransposons. Though time consuming using unmasked sequence predicting, some potential candidate genes may still be missing if just using repeat-masked sequence to predict. Thus, whether using unmasked or masked sequence to predict genes may depend on your time and research purpose.

     

    FGENESH Comparison.JPG

     

            4.2 GeneMark-ES Results (1.GeneMark results using unmasked sequence  2.GeneMark results using repeat-masked sequence)

        GeneMark.hmm was run at the website (http://exon.gatech.edu/eukhmm.cgi) running against three different genomes, two plant species (Arabidopsis thaliana and Zea mays) and one mammal (Gallus gallus). 

          Comparison of the predicted genes against the Maize genome to ones predicted in Arabidopsis thaliana and Gallus gallus. This table will help us determine what genes are real and what ones are not. Table 

          In addition, comparions of the predicted genes using unmasked sequence and repeat-masked sequence are also made.

     

    GeneMark Comparison.JPG

     

         A preliminary investigation into the homology and function of the predicted genes was established through BLAST queries of the translated nucleotide (blastx) and protein (blastp) sequences resulting from the previous gene modeling prediction technologies.  Queries from the FGENESH and GeneMark results give an indication that several of the predicted genes could be true genes from their homology in many closely related species while many of them from unmasked sequence are found to be retrotransposons. Genes predicted in both software packages were aligned based on sequence similarity of initial and final exon location from both the masked and unmasked sequence. (Predicted Genes Summary from BLAST Queries)

    Results from the Unmasked sequence

    Unmasked FGENESH vs GeneMark.JPG

     

    Results from the masked sequence

    gene comparison.png

     

           In order to find the real genes researchers may be interested in, a deeper analysis of the 10 genes from FGENESH and 13 genes from the GeneMark using repeat-masked sequence are conducted. After comparing with the start position of the first exon and the end position of the last exon, some of the genes (10-5&13-7; 10-6&13-8; 10-7&13-9; 10-9&13-11) predicted by both softwares have be found the same while other genes share most of the exons except for the rest two(13-3, 13-13) predicted only by GeneMark. However, these two genes are believe to be wrong prediction which can be ruled out based on the low coverage and the poor E-value of the blastp results. Thus we can use the gene ID from FGENESH since the FGENESH and GeneMark softwares are almost similar in predicting our genes. Finally, the most likely predicted true genes (10-1, 10-6 highlighted in yellow in the table) are picked out based on homology to known proteins of maize in the protein database, high e-values, similarity and coverage, and most importantly, strong EST and RNAseq support. Genes (10-2, Gene10-1.1 in red) gave matches to retrotransposons or some component of these elements and were immediately disregarded as possible gene candidates. Other highlighted in green were predicted by both softwares, have high e-values associated with the protein matches, being mostly uncharacterized or hypothetical proteins, but low coverage and similarity of EST support. Expermental approaches such as real time PCR analysis may be conducted to see whether these genes are really expressed in the maize so that we can get some convincing evidence.

           In addition, the amino acid sequences of both the masked (results shown below) and unmasked sequences were queried against the protein database at NCBI to give a preliminary indication of function and homology to known protein sequences. Our predicted gene10-1 was shown to have serine/threonine phosphatase 6 regulatory subunit 3-like in Brachypodium distachyon and SIT4 phosphatase activity in Arabidopsis. Gene10-6 showed highly homology to histone 2B binding protein in barley, wheat, maize, and Arabidopsis.

    BLAST Hits Masked Sequence.JPG

    5. Analysis of the predicted genes

     

          Blast2Go was used to retrived the GO-terms from the two genes shown to have ESTs. Detailed information about these two genes can be found in the separate page Gene10-1 and Gene10-6.

    Blast2Go results on the two confirmed genes

    functionb2g_stat_20121203_1802.pngcell5gene2b2g_stat_20121203_1749.png

    gene2rocess6b2g_stat_20121203_1749.png

    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 33

    FileSizeDateAttached by 
     BLAST Hits Masked Sequence.JPG
    No description
    135.03 kB11:02, 7 Dec 2012fitzger0Actions
     blast search without Zea mays.png
    No description
    13.93 kB23:19, 3 Oct 2012pingActions
     blast.png
    No description
    19.52 kB13:00, 16 Oct 2012pingActions
     cell5gene2b2g_stat_20121203_1749.png
    No description
    12.03 kB11:57, 6 Dec 2012brown637Actions
     CpG Islands.txt
    No description
    2.52 kB23:42, 3 Oct 2012pingActions
     cpgplot.1.png
    No description
    8.88 kB12:57, 16 Oct 2012pingActions
     data_distribution_b2g_stat_20121203_1717.png
    No description
    10.61 kB18:39, 3 Dec 2012brown637Actions
     dotplat wordsize15.png
    No description
    4.58 kB17:25, 16 Oct 2012pingActions
     dotplot wordsize10.png
    No description
    24.19 kB17:25, 16 Oct 2012pingActions
     FGENESH Comparison.JPG
    FGENESH Comparison
    102.12 kB10:15, 7 Dec 2012fitzger0Actions
    FGENESH-using repeat masked sequence.pdf
    Predicted Genes using masked sequence by FGENESH
    183.61 kB21:51, 7 Dec 2012pingActions
     Function2_b2g_stat_20121203_1719.png
    No description
    19.9 kB18:39, 3 Dec 2012brown637Actions
     functionb2g_stat_20121203_1802.png
    No description
    7.25 kB11:57, 6 Dec 2012brown637Actions
     gene comparison.png
    No description
    28.55 kB22:39, 7 Dec 2012pingActions
     gene2cell5b2g_stat_20121203_1748.png
    No description
    9.63 kB11:59, 6 Dec 2012brown637Actions
     gene2rocess6b2g_stat_20121203_1749.png
    No description
    14.46 kB11:59, 6 Dec 2012brown637Actions
    gene_prediction with proteins.rtf
    GeneMark.hmm results against Zea mays
    59.18 kB20:55, 22 Oct 2012brown637Actions
     GeneMark Comparison.JPG
    GeneMark Comparison
    94.49 kB10:15, 7 Dec 2012fitzger0Actions
    GENEMARK Graphical output using repeat masked sequence.pdf
    No description
    322.11 kB21:48, 7 Dec 2012pingActions
    GENEMARK Results with repeat masked sequence.docx
    No description
    19.93 kB21:50, 7 Dec 2012pingActions
    GENEMARK Results.pdf
    Predicted genes using masked sequence
    44.93 kB16:31, 29 Nov 2012pingActions
     GeneMark-ES Result.pdf
    Predicted genes using unmasked sequence
    152.22 kB13:18, 16 Oct 2012pingActions
     Masked FGENESH vs GeneMark.JPG
    Masked Sequence FGENESH vs GeneMark
    60.65 kB10:15, 7 Dec 2012fitzger0Actions
    Predicted Gene Summary from BLAST Queries.xlsx
    No description
    43.38 kB11:14, 7 Dec 2012fitzger0Actions
     Predicted genes Gallus.xlsx
    Table comparing predicted genes against three different genomes
    46.06 kB21:00, 22 Oct 2012brown637Actions
    Predicted Genes using unmasked sequence by FGENESH.pdf
    No description
    453.29 kB21:50, 7 Dec 2012pingActions
     process2_b2g_stat_20121203_1720.png
    No description
    19.4 kB18:39, 3 Dec 2012brown637Actions
     seq3.fa.cat
    No description
    6.78 kB14:35, 16 Oct 2012pingActions
     seq3.fa.masked
    No description
    152.36 kB14:35, 16 Oct 2012pingActions
     seq3.fa.out
    No description
    7.56 kB14:35, 16 Oct 2012pingActions
     seq3.fa.tbl
    No description
    2.48 kB14:35, 16 Oct 2012pingActions
     seq3.fa.txt
    No description
    152.36 kB16:49, 17 Sep 2012gribskovActions
     Unmasked FGENESH vs GeneMark.JPG
    Unmasked Sequence FGENESH vs GeneMark
    130.09 kB10:15, 7 Dec 2012fitzger0Actions
    Viewing 1 of 1 comments: view all
    So I ran our Sequence using http://exon.biology.gatech.edu/eukhmm.cgi. This website allows us to predict where the exons are in the sequence and what genes they belong to. I Attached the results on here also.
    Posted 14:29, 1 Oct 2012
    Viewing 1 of 1 comments: view all
    You must login to post a comment.