Was this page helpful?

Next-Gen Sequencing

    Genomics Users Group:  

    • Meets alternate tuesdays, WSLR 116, 1:00 pm (f.ex. 25 Mar 2014)
      contact gribskov 27 March 2014



    Data Cleaning

    Data cleaning is generally considered to be a necessary first step in any datamining or large data analysis project.  Sequence libraries may contain many contaminants or uninteresting reads, residual vector or adapter sequence, and duplicate or low quality reads.  Removing these will generally improve the speed of subsequent analyses, and often the quality of the result as well.  You must carefully consider what kinds of sequences that you wish to filter and the thresholds you want to use.  It is likely that the default settings supplied with software may not be suitable for your analysis.

    Removing contaminants and uninteresting sequences

    Sequence libraries are often contaminated with RNA or DNA from other organisms present in the sample.  These may include bacteria or fungi present in the sample, viruses that infect either the organism of interest or associated bacteria, intracellular parasites or symbionts, laboratory vectors, insect or human cells (physical contamination), or even residual samples from earlier runs.  In addition, Illumina sequencing typically has phi-X174 added as an internal control.  These control sequences should be removed by the Illumina Cassava software, but some sequences usually remain .

    • Deconseq [Schmeider and Edwards, 2011]
      DeconSeq matches read sequences to user defined libraries using BWA. Typically we use BWA to detect viral, bacterial, mitochondrial, chloroplast, and rRNA sequences.  The rRNA sequences, in particular, are an issue because they are abundant and often cause very slow assembly in the butterfly stage of Trinity.
      Deconseq removes sequences that match library sequences better than a specified coverage (% of bases matched), and identify (% of identical bases in the matching segment)
      Using Deconseq

    Removing adpter sequence

    • Adapter_clean.pl
    • Trimmomatic

    Removing duplicates


    • Trinity, transcriptome assembly
    • RSEM, read quantification

    Differentially Expressed Genes

    Note on trinity (Brian Haas): 

    Our recommended strategy, in the case you want to extract the smaller
    subset of most highly supported transcripts, is to run RSEM to estimate
    expression levels, and then to filter the contigs based on minimum
    expression thresholds.  This is detailed here:
    (filtering at bottom of page)
    • EdgeR
    • DESEQ
    • EBSeq

    User groups



    • Metzker, M. L. (2010). "Sequencing technologies - the next generation." Nat Rev Genet 11(1): 31-46. PDF


    • Wang,Z.,Gerstein, M. & Snyder, M. (2009). "RNA-Seq: a reolutionary tool for transcriptomics." Nature Rev Genet 10:57-63.PDF
    • Oshlack, A., M. Robinson, et al. (2010). "From RNA-seq reads to differential expression results." Genome Biology 11(12): 220. PDF


    • Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One. 2011 6:e17288. doi: 10.1371/journal.pone.0017288. PubMed PMID: 21408061. PDF


    • Oshlack A, Wakefield MJ: Transcript length bias in RNA-seq data confounds systems biology. Biol Direct 2009, 4:14.
    • (DeSeq) Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106. Epub 2010 Oct 27. PubMed PMID: 20979621; PubMed Central PMCID: PMC3218662. PDF
    • Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 2010, 11:94.
    • Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010, 11:R25.
    • (EdgeR) Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616. Epub 2009 Nov 11. PubMed PMID: 19910308; PubMed Central PMCID: PMC2796818.  PDF
    • Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2011 Oct 14. [Epub ahead of print] PubMed PMID: 22003245. doi: 10.1093/biostatistics/kxr031. Article PDF
    • (RSEM) Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011 Aug 4;12:323. doi: 10.1186/1471-2105-12-323. PubMed PMID: 21816040; PubMed Central PMCID: PMC3163565.  PDF
    • Kasper D. Hansen, Rafael A. Irizarry, and Zhijin WU. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, prepub, 2012. doi: 10.1093. PDF
    • (EBSeq) Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould  MN, Stewart RM, Kendziorski C. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013 Apr 15;29(8):1035-43. doi: 10.1093/bioinformatics/btt087. Epub 2013 Feb 21. PubMed PMID: 23428641; PubMed Central PMCID: PMC3624807.  PDF


    • Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009). 
    • Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009). (SOAPdenovo)
    • Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010). (Scripture) PDF Software
    • (Trinity) Grabherr, M. G., B. J. Haas, et al. (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nat Biotech 29(7): 644-652. PDF
    • (Trinity protocol) Haas, B., et al. (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protocols 8(8):1494-1512.
    • Oases
    Was this page helpful?
    Tag page (Edit tags)
    • No tags
    You must login to post a comment.