Was this page helpful?

Further Analysis of Predicted Genes

    I conducted further analysis on the genes that were colored light orange and dark orange on the main annotation pageFirstly, the predicted protein (both FGENESH and Genemark, if applicable) was entered into InterProScan.  For the light orange genes, the goal was to see if the genes could possibly be real.  For the dark orange genes, the goal was to confirm what was found in the blastp search.  Furthermore, I wanted to figure out which prediction was more accurate for the genes that were predicted by both FGENESH and Genemark.  Secondly, I ran blastx for the light orange genes in case their low performance with blastp was due to inaccurate splice predictions.  Thirdly, I ran blastn using the EST database (excluding mouse and human) in order to validate the splice sites and to do a further confirmation of the gene.  These three steps consist of the general approach I took, but I had to conduct further tests on certain genes and what I did is documented below.

    Gene 200a

    No hits were returned with InterProScan.

    Blastx with 0-3000 of sequence 200 returned a 7e-13 hit with Far1-related sequence 5 in Arabidopsis and a 6e-7 hit with far-red elongated hypocotyl 3, the top hit from the original blastp search.  Both proteins are transcription factors belonging to the FAR1 family.  The lower E values suggests that the original protein predicted by FGENESH might not have been optimal.  Blasting the gene against the cDNA database gave a strong match to maize (4e-43), but alignment was only with a portion of gene 200a, as shown below.

    Screen shot 2010-12-10 at 12.30.17 AM.png

    The 200a protein, at 124aa long, is substantially shorter than Far1-related sequence 5 and far-red elongated hypocotyl 3, which are 839 and 788aa long respectively.  Given that gene 200a goes from 1081-27 on sequence 200, I think that this gene may have been truncated when the original chromosome segment was divided into 400kb segments for this project.  In support of this hypothesis, gene 200a is the only gene predicted by FGENESH that does not have a poly A tail.  This part of the chromosome should definitely be re-examined using the original undivided DNA sequence.  Another possibility is that this is a pseudogene.

    Gene 200b

    I think genes 200b, 200f and 200g could be paralogs because they all return similar results for blastp and protein domain searches.

    InterProScan returned 3.5e-75 hits to phosphoenolpyruvate dikinase-related (PTHR22931) and chloroplast alpha-glucan water dikinase (PTHR22931:SF2).  This corroborates with the original blast search.

    Blasting the predicted gene with the EST database returned (shown below) strong (E value = 0), almost end-to-end alignment with switchgrass and sorghum, so this gene prediction and the predicted splice sites are most likely correct. 

    Screen shot 2010-12-10 at 12.17.22 AM.png

     

    Running ClustalW with gene 200b and the FGENESH prediction of gene 200g did seem to give significant alignment.  See the attached alignment file.

    Gene 200f

    InterProScan returned 6.6e-10 matches to phosphoenolpyruvate dikinase-related (PTHR22931) and chloroplast alpha-glucan water dikinase (PTHR22931:SF2). 

    I ran blastx with 38-47kb of sequence 200 and got hits to phosphoglucan again, but the E value was about the same as before, at e-15.  Furthermore, only a small piece aligned, as shown below:

    Screen shot 2010-12-10 at 12.56.16 AM.png

    Running the nucleotide sequence of gene 200f with blastn against the cDNA database gave a e-30 match to maize, but I'm not sure how much interpretation can go into this because of the short alignment.  It is interesting that all the alignments were to the end of gene 200f because these ESTs were generated from the poly A tail end.  However, it should be noted that the length of the top two ESTs in the diagram are both 560 nt long.  This means that those ESTs probably do not correspond to gene 200f.  Furthermore, I suspect that gene 200f seems to overlap with the FGENESH prediction of gene 200g.

    Screen shot 2010-12-10 at 1.40.20 AM.png

     

    Gene 200f seemed to have significant alignments with both gene 200b (result) and 200g (result) (FGENESH) but probably more so with 200b.  Alignment of all three simultaneously also showed conservation between the three sequences, but gene 200g has a lot of DNA in between the areas of alignment (see file here).  Since Genemark does not seem to provide mRNA sequences, I had to manually remove the introns from that segment of DNA.  The processed mRNA sequence of gene 200f can be found here.

    Gene 200g

    InterProScan returned pyruvate phosphate dikinase, PEP/pyruvate-binding domain (PF01326) for both Genemark and FGENESH predictions.  The E value for the FGENESH prediction was 3.8e-62.  The Genemark prediction had two segments matching up with E values of 7e-18 and 1.9e-17.  This difference in E value corresponds to the difference in E value found in the initial blastp search.  The FGENESH prediction is probably more accurate than the Genemark prediction.

    Blastn of the FGENESH prediction against cDNA suggested an intron was missed by the gene prediction.  Note the split match to the same cDNA accession from ~1 to ~750.

    Screen shot 2010-12-10 at 12.49.02 PM.png

    Inspecting the gap within that 1-750 segment indeed revealed a potential intron, as there was a GT at the 5' end and an AG at the 3' end:

    GGTTCAGAACGACATCAACCCTCGTGTGCTCCACTTCCTCCTCCTAGTGTACTATCGTCC
    GCCCCTGCCTCCGCCACCTGCGTTGGCAACCCTGGCACATGGGTCCCCTTCCATCTGCCC
    TCCCGATTATGGCTCCACTACTGCTAATCTTTCCGATCACGAGACGACTGGTCCTCCTCC
    CCACGACATTGTTGCTGCTCATTCTACGCTCGCCGCTGGCCTCGCGGTCGCTAATGAGCG
    AGTCGTGAACCTCACTTGGGAGCAAGAAGGCTTTCTGGGTGCCTTGTTCGGTGTGGTATC
    TGGTGCCACACCGGACACAACAGT
    

    Below is a comparison of blastx of the FGENESH gene prediction before (left) removal of the intron (sequence after intron removal found here) and after (right).

    Screen shot 2010-12-10 at 4.07.26 PM.pngScreen shot 2010-12-10 at 4.07.19 PM.png

     

     

     

     

     

     

     

     

    The top hits for both before and after introno removal was phosphoglucan, water dikinase at E values of 0, but query coverage increased from 81%-96%.

    Gene 200m

    No InterPro hits for neither the Genemark prediction nor the FGENESH prediction. 

    I ran blastx with 277-286kb to see if I could get something with a better E value.  Unfortunately, the best hit had an E value of only 0.10.


    I also ran both Genemark and FGENESH predictions in Phyre (Protein structure prediction on the web: a case study using the Phyre server. Kelley LA & Sternberg MJE Nature Protocols. 4, 363 - 371 (2009)) to try to match the gene predictions to something based on protein folding. The best hit for the FGENESH prediction was with a telomere binding protein subunit, but had an E value of only 4.9.  Likewise, the best hit for the Genemark predictionw as also a telomere binding subunit, but the E value was also high, at only 0.98.

    On the other hand, blasting the FGENESH prediction against a cDNA database returned  e-65 and e-56 hits with maize.  The alignment diagram also suggests that an intron might be in the gene prediction.

    Screen shot 2010-12-10 at 2.01.03 AM.png

    Thus, I ran NetPlantGene with the FGENESH gene prediction, but it did not reveal any splice locations.

    The gap shown in the diagram corresponds to the following sequence:

    GCTGCGTGCCATGGGCGAGAGAGCACCGAGCAGAGGGAGCTGGGCGCGCCATGGGAGGAA
    AGTTGGAGCGCCGGCAGAGGGCAGAAGGAG
    

    There is an AG at the 3' end, but the first GT on the 5' end is 5 nt away, so this could be an intron.  I removed the potential intron from the FGENESH gene prediction and ran blastx again, but did not get any signficant matches, as shown below:

    Screen shot 2010-12-10 at 2.39.28 AM.png

     

    To account for the possibility of inaccurate prediction of splice sites by the gene modelling programs, I ran 277-286kb of sequence 200m again, but this time against the EST database.  This search yielded very strong hits (five of the hits had a 0 E value, while about 30-40 had E values in -100 to -179 area), as shown below.  Keeping in mind that the predicted gene 200m starts at about 4323 of this blast seach, we can see that the predicted gene may have started too soon and missed some exons in the 1-3000 area of the blast diagram below.

    Screen shot 2010-12-10 at 11.17.42 AM.png

     

    Based on the observation that the three left-most 'exons' in the above blast search were very close to the the 5' end of my query sequence, I expanded my query sequence to 271000-285000 of sequence 200 and blasted against the cDNA database again.  Indeed, this seemed to reveal at least two more 'exons'.

    Screen shot 2010-12-10 at 11.38.41 AM.png

     

    Increasing the range to 261-285k revealed even more exons.

    Screen shot 2010-12-10 at 11.44.12 AM.png

    Same with 248000-285000.

    Screen shot 2010-12-10 at 12.00.27 PM.png


    Sadly, running blastx with 248-285k of sequence 200 did not have any great matches.  Furthermore, all the hits were with transposons.

    Screen shot 2010-12-10 at 12.06.55 PM.png

    Gene 200o

    The FGENESH prediction had a strong hit (2.1e-166) with the HECTc domain.  Similarly, the Genemark prediction had strong hit with the same HECTc domain (2.6e-67).  Both predictions also had hits with a domain of unknown function from E3 ubiquitin ligase DUF913 (F: 9.6e-60, G: 3.6e-56).  However, only the FGENESH prediction had a hit (4.9e-33) with DUF 908, also from E3 ubiquitin ligase.  These results suggest that gene 200o is indeed a E3 ubiquitin-protein ligase.

    Blasting the FGENESH gene against cDNA returned only hits with E values of 0 and most of them had identities of >90%, further supporting that this gene prediction is corrrect.

    Screen shot 2010-12-10 at 2.09.27 PM.png

    Conclusion Table

    Gene Conclusion
    200 a pseudogene of the FAR1 family
    200 b* a water dikinase
    200 f* pseudogene of water dikinase or an erronerous prediction due to overlap with gene 200g
    200 g* phosphoglucan, water dikinase; there seems to be an intron within the FGENESH prediction
    200 m not completely sure, but probably a pseudogene or false positive
    200 o E3 ubiquitin-protein ligase

     

    *Genes 200b, 200f and 200g may be paralogs.

    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 17

    FileSizeDateAttached by 
     200g 200b 200f alignment
    No description
    10.62 kB14:53, 10 Dec 2010lau3Actions
     mRNA200gFGENESHintronRemoved.rtf
    No description
    1997 bytes17:03, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 1.40.20 AM.png
    Gene 200f cDNA blastn
    6.19 kB02:40, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 11.17.42 AM.png
    blastn of 277-286kb of seq 200 (gene 200m)
    10.94 kB12:20, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 11.38.41 AM.png
    271000 to 285000 of sequence 200 (gene 200m)
    10.92 kB12:39, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 11.44.12 AM.png
    261000 to 285000 of sequence 200 (gene 200m)
    11.94 kB12:44, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 12.00.27 PM.png
    248-285k of seq 200 (gene 200m)
    9.58 kB13:01, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 12.06.55 PM.png
    248-285k blastx (gene 200m)
    12.51 kB13:07, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 12.17.22 AM.png
    Blastn of gene 200b with cDNA
    6.05 kB01:18, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 12.30.17 AM.png
    Blastn of gene 200a with cDNA
    4.87 kB01:30, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 12.49.02 PM.png
    blastn of gene 200g with cDNA
    8.79 kB13:49, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 12.56.16 AM.png
    Blastx of 38-47kb of sequence 200 (gene 200f)
    6.1 kB01:57, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 2.01.03 AM.png
    blastn of FGENESH 200m against ESTs
    55.37 kB03:06, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 2.09.27 PM.png
    Blastx of FGENESH 200g after removal of intron
    11.08 kB15:09, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 2.39.28 AM.png
    blastx of 200m after removal of intron
    5.1 kB03:39, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 4.07.19 PM.png
    blastx of FGENESH 200g after removal of intron
    8.23 kB17:11, 10 Dec 2010lau3Actions
     Screen shot 2010-12-10 at 4.07.26 PM.png
    blastx of FGENESH 200g before removal of intron
    8.16 kB17:11, 10 Dec 2010lau3Actions
    You must login to post a comment.