بخشی از مقاله انگلیسی:
High throughput sequencing of genomes (DNA-Seq) and transcriptomes (RNA-Seq) has opened the way to study the genetic and functional information stored within any organism at an unprecedented scale and speed. For example, RNA-Seq allows in principle for the simultaneous study of transcript structure (such as alternative splicing), allelic information (e.g., SNPs), and expression with high resolution and large dynamic range1 . These advances greatly facilitate functional genomics research in species for which genetic or financial resources are limited, including many ‘non-model’ organisms, which are nevertheless of substantial ecological or evolutionary importance. While many genomic applications have traditionally relied on the availability of a highquality genome sequence, such sequences have only been determined for a very small portion of known organisms. Furthermore, sequencing and assembling a genome is still a costly endeavor in many cases, due to genome size and repeat content. Conversely, since the transcriptome is only a fraction of the total genomic sequence, RNA-Seq data can provide a rapid and cheaper ‘fast track’, within reach of any lab, to delineating a reference transcriptome for downstream applications such as alignment, phylogenetics or marker construction. Indeed, even within a whole genome sequencing project, RNA-Seq has become an essential source of evidence for transcribed gene identification and exon structure annotation. Realizing the full potential of RNA-Seq requires computational methods that can assemble a transcriptome even when a genome sequence is not available. There are primarily two ways to convert raw RNA-Seq data to transcript sequences: with the guidance of assembled genomic sequences or via de novo assembly2, 3. The genome-guided approach to transcriptome studies has quickly become a standard approach to RNA-Seq analysis for model organisms, and several software packages exist for this purpose4, 5. It cannot, however, be applied to organisms without a well-assembled genome, and even if one is present, the results may vary across genome assembly versions. In such cases, a de novo transcriptome assembler is required. However, the process of assembling a transcriptome violates many of the assumptions of assemblers written for genomic DNA. For example, uniform coverage and the ‘one locus – one contig’ paradigm are not valid for RNA: an accurate transcriptome assembler will produce one contig per distinct transcript (isoform) rather than per locus, and different transcripts will have different coverage, reflecting their different expression levels. Several tools are now available for de novo assembly of RNA-Seq. Trans-ABySS 6 , VelvetOases7 , and SOAPdenovo-trans (http://soap.genomics.org.cn/SOAPdenovo-Trans.html) are all extensions of earlier developed genome assemblers. We previously described an alternative and novel method for transcriptome assembly called Trinity8 . Trinity partitions RNA-Seq data into many independent de Bruijn graphs, ideally one graph per expressed gene, and uses parallel computing to reconstruct transcripts from these graphs, including alternatively spliced isoforms. Trinity can leverage strand-specific Illumina Paired-End (PE) libraries, but can also accommodate non-strand-specific and single-end (SE) read data. Trinity reconstructs transcripts accurately with a simple and intuitive interface that requires little to no parameter tuning. Several independent studies have demonstrated that Trinity is highly effective compared to alternative methods (e.g.9-11, The DREAM Project’s Alternative Splicing Challenge (http://www.the-dream-project.org/result/alternativesplicing)). Indicating Trinity’s utility, since its publication in May 2011, it has acquired an avid user base with ~200 citations from May 2011 to March 2013 (http:// scholar.google.com/scholar?oi=bibs&hl=en&cites=14735674943942667509). Trinity users study a broad range of model and non-model organisms from all Kingdoms, and come from small labs and large genome projects alike (e.g., the pea aphid genome annotation v2; Fabrice Legeai, INRA and Terence Murphy, RefSeq NCBI, personal communications). Trinity also has an active developer community, which has greatly enhanced its performance and utility (see http://trinityrnaseq.sourceforge.net). For example, while the runtime performance of the first release was not computationally efficient11, the Trinity developer community has since improved its efficiency, halving memory requirements and increasing processing speed through increased parallelization and improved algorithms (12; M. Ott, personal communication). Furthermore, Trinity was converted into a modular platform that seamlessly uses third-party tools, such as Jellyfish13 for building the initial k-mer catalog. Other third party tools integrated into Trinity have enhanced the utility of its reconstructed transcriptomes. For example, as described below, Trinity now supports tools (e.g., RSEM14 , edgeR15 and DESeq 16) that take its output transcripts and test for differential expression, while accounting for both technical and biological sources of variation17-19 and correcting for multiple hypothesis testing. Given Trinity’s popularity and substantial enhancements since publication, it is important to provide detailed protocols that leverage its various features. The protocols we present below will maximize Trinity’s utility to users for studies in non-model organisms, and inform the broad developer community on areas for future enhancements.