Stopped working at 4:20pm Dec 10.

    Annotator: Kin Lau

    Sequence annotated: Sequence 200

    Supplementary data: Sequence 200 Supplemental Data

     

    cpgplot.seq200.png

    Sequence 200 is a 400 kb long segment on chromosome 10 of the maize genome.  Approximately 82% of this part of the genome consists of transposons or retroelements.  Sequence 200 contains high AA and TT observed to expected frequencies at 1.32 and 1.27 respectively, perhaps due to repeated sequences, which would corroborate with the high level of transposons and retroelements. 

    Exceptionally high CG or GC content suggests a high density of genes.  But, CG and GC content of this region was moderate at .67 and .88 observed to expected frequency respectively.  Looking at the CpG diagram and referring to the numerical data in S2, I found that the largest potential CpG islands were at approximately 63-72kb, 240-250kb, 329-337kb and 360-370kb.  The 329-337kb region corresponds to the 5' end of gene 200o.  However, the other three regions did not correspond to any of the genes predicted by FGENESH or Genmark (see table below).  Thus, these regions may contain genes that were missed by the gene prediction tools.  Further examination revealed that the 63-72kb, 329-337kb regions were completely masked with Ns by RepeatMasker.  The other two regions were also mostly masked except for 1.3kb in the 240-250kb segment and 423 nt in the 360-370kb segment.  I decided to run the four segments in their unmasked state with blastx to search for genes that might have been missed.  These searches are discussed below after the table.

     

    For the following table, FGENESH and Genmark were ran with a RepeatMasker-processed sequence.  Blast searches were done with the predicted amino acid sequences using the SwissProt database with blastp.  The table is color coded based on how likely that the prediction is really a gene.  Blue means that the prediction is most likely in accurate.  Light orange means that the prediction needs to be examined further and orange means that the prediction very likely corresponds to a real gene.  Note that the dark orange and light orange genes were analyzed further on this page, where you can also find a table listing those dark and light orange genes, along with my interpretation of whether they are functional genes or not, and which proteins they code for or which genes they are related to.

     

    ID Begin End Strand Number of Exons F=FGENESH BLAST
    G=Genmark
    200a 1081 27 - 2 F Weak match (.76) to a transcription factor (Far-red elongated hypocotyl 3) involved in light responses in Arabidopsis.
    200b 24269 26408 + 6 F VERY strong match to water dikinase in rice (e-88) and Arabidopsis (e-59). Both matched up almost end to end.  
    200c 28886 30066 + 2 G very weak match (.63) to histone acetyl transferase in mouse (so probably just a random hit...)
    200d 37729 38837 + 3 G no hits lower than 2.5 E value
    200e 41528 42229 + 3 G no hit below 1.1 E value
      41528 41923 + 1 F very weak match(.72) to ATP-dependent transporter ycf16 in Cyanophora paradoxa
    200f 42374 43426 + 4 G weak matches (both e-5) to phosphoglucan, a water dikinase in Arabidopsis and rice (amino acids 80-102 matches up in both cases)
    200g 46226 52938 + 9 G almost end to end alignment with phosphoglucan, a water dikinase in rice (e-177) and Arabidopsis (e-125).  Domain hit (e-30) with pyruvate phosphate dikinase, PEP/pyruvate binding domain (amino acid 190-477)
      42567 51377 + 13 F VERY strong match to phosphoglucan, water dikinase in rice (0.0) and Arabidopsis (e-160).  Domain hit (e-51) with pyruvate phosphate dikinase, PEP/pyruvate binding domain (amino acid 340-666)
    200h 86608 87575 + 3 G weak match (1.0) to adenine phosphoribosyltransferase in Herpetosiphon aurantiacus (a bacteria).
    200i 119785 120107 + 2 G no hits lower than E value of 5.5
    200j 160164 160408 + 2 G no hits
    200k 182790 182620 - 2 G weak match (.034) to ATP synthase in Pseudomonas aeruginosa
    200l 231599 231071 - 3 G weak match (.51) to an Archaea
    200m 281323 282255 + 3 G weak match (all e-4) to coiled-coil domain-containing protein 102A in humans and cattle and zinc finger CCCH domain-containing protein 13 again in humans.  Amino acids matched up are 62-150, 62-165 and 64-165
      281323 282255 + 2 F same hits and E values as the Genmark sequence.  The aligned amino acids are 43-131, 43-153 and 53-147.
    200n 283469 288179 + 2 G One hit with E value of 3.6
    200o 354585 337907 - 18 G strong match (0.0) to E3 ubiquitin-protein ligase UPL1 and UPL2 in Arabidopsis.  Strong alignment from amino acid 90 to 2669, but there seems to be insertions (or unremoved introns) in gene 200o between amino acids 1600 and 2200 compared with the Arabidopsis gene. Domain hits with HECTc (cd00078) domain (e-81) and a ubiquitin-protein ligase domain of unknown function, DUF913 (e-42)
      359024 343050 - 17 F Almost end to end alignment (E value of 0.0 for both) with E3 ubiquitin-protein ligase UPL 1 and 2 in Arabadopsis, starting at amino acid 66 and ending at 3633.  No insertions like in the Genemark prediction.  Domain hits with HECTc (cd00078) (e-122), DUF913 (e-46) and DUF908 (e-25).
    200p 357184 359230 + 4 G weak hits with humans and a species of bacteria

     

    Running blastx with the four major CpG islands identified

    CpG islands are associated with the 5' ends of genes.  For each of the four CpG islands, I ran blastx three times.  First, I ran the actual CpG island, then I ran 10kb from both ends of the island.  Unfortunately, I did not find anything interesting.  More details about what I saw for each island is described below:

     

    63-72kb

    As expected, the actual island returned only transposons.  The 5' 10kb also returned transposons and retroelements.  The 3' end did not return anything with an E value lower than 0.008.

     

    240-250kb

    The actual island did not return anything with an E value lower than 0.038.  The 5' and 3' ends returned only transposons and retroelements.

     

    329-337kb

    The actual island returned only transposons and retroelements.  Likewise, the 5' end also returned only transposons and retroelements. 

    The 3' end had a e-100 match to E3 ubiquitin ligase TOM1-like from Neurospora crassa, a fungus.  However, the alignment was at approximately nucleotides 343000-345500 in sequence 200, and this overlaps with gene 200o in the table.

     

    360-370kb

    The actual island returned transposons.  The 5' end returned ubiquitin ligase, but it also overlaps with gene 200o.  The 3' end did not return anything with an E value lower than 0.013.

     

    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 5

    FileSizeDateAttached by 
     200f aligned with 200b
    No description
    2.71 kB14:48, 10 Dec 2010lau3Actions
     200g aligned with 200b
    No description
    8.36 kB14:48, 10 Dec 2010lau3Actions
     200g aligned with 200f
    No description
    7.96 kB14:48, 10 Dec 2010lau3Actions
     cpgplot.seq200.png
    No description
    8.7 kB13:07, 18 Sep 2010lau3Actions
     maskedseq.txt
    sequence returned by RepeatMasker
    398.44 kB02:11, 8 Dec 2010lau3Actions
    You must login to post a comment.