Was this page helpful?

In silico normalization

    Table of contents
    1. 1. Synopsis
    2. 2. Sample job
    3. 3. Sample Output
    4. 4. Source Code
    5. 5. Notes

    Part of the Trinity pipeline

    Synopsis

    Large RNA-Seq data sets, such as those exceeding 300M pairs, are best suited for in silico normalization prior to running Trinity, in order to reduce memory requirements and greatly improve upon runtimes.  One way of judging whether the normalization is "safe" is to check whether any kmers are lost.  In the examples below, no kmers are lost.

    Sample job

    #!/bin/sh -l
    
    #PBS -N Trinity_norm3
    #PBS -q mgribsko-d
    #PBS -l nodes=1:ppn=24
    #PBS -l walltime=720:00:00
    
    module load trinity
    
    cd $PBS_O_WORKDIR
    pwd
    
    cat normalize.job
    
    date +"%d %B %Y %H:%M:%S"
    echo " "
    
    normalize_by_kmer_coverage.pl \
    --seqType fq --JM 190G --max_cov 300 \
    --left ../t7/at_wt_hen_med.left.fq \
    --right ../t7/at_wt_hen_med.right.fq \
    --pairs_together --PE_reads_unordered \
    --PARALLEL_STATS --JELLY_CPU 24 \
    --out k25.c300.pctSD200 \
    > norm3.log
    
    echo " "
    date +"%d %B %Y %H:%M:%S"
    

    Sample Output

    Full run, 360,118,799 reads (paired+unpaired)
    Rossmann, nodes=1:ppn=24
    walltime:9:47:20
    this run done with much higher cov than recommended (cov=300)

    Converting input files. (both directions in parallel)CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//trinity-plugins/fastool/fastool --illumina-trinity --to-fasta /group/mgribsko/genomics/arabidopsis/t8/../t7/at_wt_hen_med.left.fq >> left.fa
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//trinity-plugins/fastool/fastool --illumina-trinity --to-fasta /group/mgribsko/genomics/arabidopsis/t8/../t7/at_wt_hen_med.right.fq >> right.fa
    CMD finished (749 seconds)
    CMD finished (1215 seconds)
    Done converting input files.CMD: cat left.fa right.fa > both.fa
    -------------------------------------------
    ----------- Jellyfish  --------------------
    -- (building a k-mer catalog from reads) --
    -------------------------------------------
    
    CMD finished (399 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//trinity-plugins/jellyfish/bin/jellyfish count -t 24 -m 25 -s 21634352035  --both-strands  both.fa
    CMD finished (2690 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//trinity-plugins/jellyfish/bin/jellyfish histo -o jellyfish.K25.min2.kmers.fa.histo mer_counts_0
    CMD finished (30 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//trinity-plugins/jellyfish/bin/jellyfish dump -L 2 mer_counts_0 > jellyfish.K25.min2.kmers.fa
    CMD finished (77 seconds)
    CMD: touch jellyfish.K25.min2.kmers.fa.success
    CMD finished (0 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//Inchworm/bin/fastaToKmerCoverageStats --reads left.fa --kmers jellyfish.K25.min2.kmers.fa --kmer_size 25   --DS  > left.fa.K25.stats
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//Inchworm/bin/fastaToKmerCoverageStats --reads right.fa --kmers jellyfish.K25.min2.kmers.fa --kmer_size 25   --DS  > right.fa.K25.stats
    -reading Kmer occurences...
    -reading Kmer occurences...
    
     done parsing 146105003 Kmers, 146105003 added, taking 298 seconds.
    
     done parsing 146105003 Kmers, 146105003 added, taking 305 seconds.
    CMD finished (14298 seconds)
    CMD finished (22398 seconds)
    -not trusting read ordering, sorting each stats file.
    CMD: sort -k5,5 -T . -S 190G left.fa.K25.stats > left.fa.K25.stats.sort
    CMD: sort -k5,5 -T . -S 190G right.fa.K25.stats > right.fa.K25.stats.sort
    CMD finished (488 seconds)
    CMD finished (718 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//util/nbkc_merge_left_right_stats.pl --left left.fa.K25.stats.sort --right right.fa.K25.stats.sort  --sorted  > pairs.K25.stats
    CMD finished (2292 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//util/nbkc_normalize.pl pairs.K25.stats 300 200 > pairs.K25.stats.C300.pctSD200.accs
    32920920 / 135209248 = 24.35% reads selected during normalization.
    133133 / 135209248 = 0.10% reads discarded as likely aberrant based on coverage profiles.
    0 / 135209248 = 0.00% reads missing kmer coverage (N chars included?).
    CMD finished (438 seconds)
    Normalization complete. See outputs: /group/mgribsko/genomics/arabidopsis/t8/../t7/at_wt_hen_med.left.fq.normalized_K25_C300_pctSD200.fq, /group/mgribsko/genomics/arabidopsis/t8/../t7/at_wt_hen_med.right.fq.normalized_K25_C300_pctSD200.fq
    

    Rerun with existing kmer files
    360,118,799 reads (paired+unpaired)
    Rossmann, nodes=1:ppn=24
    walltime:2:11:29
    this run done with recommended cov (cov=30)

    Converting input files. (both directions in parallel)left file exists, nothing to doright file exists, nothing to do-------------------------------------------
    ----------- Jellyfish  --------------------
    -- (building a k-mer catalog from reads) --
    -------------------------------------------
    
    -not trusting read ordering, sorting each stats file.
    Done converting input files.CMD: sort -k5,5 -T . -S 190G left.fa.K25.stats > left.fa.K25.stats.sort
    CMD: sort -k5,5 -T . -S 190G right.fa.K25.stats > right.fa.K25.stats.sort
    CMD finished (465 seconds)
    CMD finished (710 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//util/nbkc_merge_left_right_stats.pl --left left.fa.K25.stats.sort --right right.fa.K25.stats.sort  --sorted  > pairs.K25.stats
    CMD finished (2280 seconds)
    CMD: /group/bioinfo/apps/apps/trinityrnaseq_r20131110/util/..//util/nbkc_normalize.pl pairs.K25.stats 30 200 > pairs.K25.stats.C30.pctSD200.accs
    5838321 / 135209248 = 4.32% reads selected during normalization.
    133133 / 135209248 = 0.10% reads discarded as likely aberrant based on coverage profiles.
    0 / 135209248 = 0.00% reads missing kmer coverage (N chars included?).
    CMD finished (457 seconds)
    Normalization complete. See outputs: /group/mgribsko/genomics/arabidopsis/t8/../t7/at_wt_hen_med.left.fq.normalized_K25_C30_pctSD200.fq, /group/mgribsko/genomics/arabidopsis/t8/../t7/at_wt_hen_med.right.fq.normalized_K25_C30_pctSD200.fq
    

    Source Code

    The relevant code controlling the selection of reads is in $TRINITY_HOME/util/nbkc_normalize.pl

     while (<$fh>) {
    
            $count_total++;
    
            chomp;
            my $line = $_;
            my ($med_cov, $avg_cov, $stdev, $pct_dev, $core_acc) = split(/\t/);
    
            $core_acc =~ s|/[12]$||;
    
            if ($med_cov < 1) {
    
                $count_no_cov++;   ## this shouldn't happen in modern versions of this process. (bh 10-2013)
                next;
            }
    
            if ($pct_dev > $max_pct_stdev) {
    
                # print STDERR "// excluding $_ as aberrant\n";
    
                $count_aberrant_and_discarded++;
                next;
            }
    
            if (rand(1) <= $max_cov/$med_cov) {
                print "$core_acc\n";
                $count_selected++;
            }
        }
    
    • reads are discarded if there percent standard deviation is greater than the specified threshold
    • if the median coverage ($max_cov) is less than the target ($max_cov) the read is kept
    • if the the median coverage is higher, the read is kept probabalistically where the probability of retaining the read is $max_cov/$med_cov

    Notes

    • Uses only paired reads.  If you have not processed your left and right read files to be exactly the same, be sure to use the --PE_reads_unordered.  If your left reads and reight reads do not match, the error will be
      Error, core accs are not equivalent: [HWI-ST1234:37:C0N56ACXX:4:1101:4556:2136] vs. [HWI-ST1234:40:C0WCUACXX:8:1101:1329:2207] reads
    • You can normalize with different thresholds without rerunning the jellyfish and kmer analysis steps
    Was this page helpful?
    Tag page (Edit tags)
    • No tags
    You must login to post a comment.