Human plasma contains >40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from >19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised introns RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons.
The reverse transcriptases (RTs) encoded by mobile group II introns and other non-LTR retroelements differ from retroviral RTs in being able to template-switch efficiently from the 5 end of one template to the 3 end of another with little or no complementarity between the donor and acceptor templates. Here, to establish a complete kinetic framework for the reaction and to identify conditions that more efficiently capture acceptor RNAs or DNAs, we used a thermostable group II intron RT (TGIRT; GsI–IIC RT) that can template switch directly from synthetic RNA template/DNA primer duplexes having either a blunt end or a 3-DNA overhang end. We found that the rate and amplitude of template switching are optimal from starter duplexes with a single nucleotide 3-DNA overhang complementary to the 3 nucleotide of the acceptor RNA, suggesting a role for nontemplated nucleotide addition of a complementary nucleotide to the 3 end of cDNAs synthesized from natural templates. Longer 3-DNA overhangs progressively decreased the templateswitching rate, even when complementary to the 3 end of the acceptor template. The reliance on only a single bp with the 3 nucleotide of the acceptor together with discrimination against mismatches and the high processivity of group II intron RTs enable synthesis of full-length DNA copies of nucleic acids beginning directly at their 3 end. We discuss the possible biological functions of the template-switching activity of group II intron- and other non-LTR retroelement– encoded RTs, as well as the optimization of this activity for adapter addition in RNAand DNA-Seq protocols.
Extracellular vesicles (EVs) encompass a variety of vesicles secreted into the extracellular space. EVs have been implicated in promoting tumor metastasis, but the molecular composition of tumor-derived EV sub-types and the mechanisms by which molecules are sorted into EVs remain mostly unknown. We report the separation of two small EV sub-populations from a metastatic breast cancer cell line, with biochemical features consistent with different sub-cellular origins. These EV sub-types use different mechanisms of miRNA sorting (selective and non-selective), suggesting that sorting occurs via fundamentally distinct processes, possibly dependent on EV origin. Using biochemical and genetic tools, we identified the Lupus La protein as mediating sorting of selectively packaged miRNAs. We found that two motifs embedded in miR-122 are responsible for high-affinity binding to Lupus La and sorting into vesicles formed in a cell-free reaction. Thus, tumor cells can simultaneously deploy multiple EV species using distinct sorting mechanisms that may enable diverse functions in normal and cancer biology.
5’ ends are important for determining the fate of RNA molecules. BCDIN3D is an RNA phospho-methyltransferase that methylates the 5’ monophosphate of specific RNAs. In order to gain new insights into the molecular function of BCDIN3D, we performed an unbiased analysis of its interacting RNAs by Thermostable Group II Intron Reverse Transcriptase coupled to next generation sequencing (TGIRT-seq). Our analyses showed that BCDIN3D interacts with full-length phospho-methylated tRNAHis and miR-4454. Interestingly, we found that miR-4454 is not synthesized from its annotated genomic locus, which is a primer-binding site for an endogenous retrovirus, but rather by Dicer cleavage of mature tRNAHis. Sequence analysis revealed that miR-4454 is identical to the 3’ end of tRNAHis. Moreover, we were able to generate this ‘miRNA’ in vitro through incubation of mature tRNAHis with Dicer. As found previously for several pre-miRNAs, a 5’P-tRNAHis appears to be a better substrate for Dicer cleavage than a phospho-methylated tRNAHis. Moreover, tRNAHis 3’-fragment/‘miR-4454’ levels increase in cells depleted for BCDIN3D. Altogether, our results show that in addition to microRNAs, BCDIN3D regulates tRNAHis 3’-fragment processing without negatively affecting tRNAHis’s canonical function of aminoacylation.
Group II introns, self-splicing retrotransposons, serve as both targets of investigation into their structure, splicing, and retromobility and a source of tools for genome editing and RNA analysis. Here, we describe the first cryo-electron microscopy (cryo-EM) structure determination, at 3.8–4.5 Å, of a group II intron ribozyme complexed with its encoded protein, containing a reverse transcriptase (RT), required for RNA splicing and retromobility. We also describe a method called RIG-seq using a retrotransposon indicator gene for high-throughput integration profiling of group II introns and other retrotransposons. Targetrons, RNA-guided gene targeting agents widely used for bacterial genome engineering, are described next. Finally, we detail thermostable group II intron RTs, which synthesize cDNAs with high accuracy and processivity, for use in various RNA-seq applications and relate their properties to a 3.0-Å crystal structure of the protein poised for reverse transcription. Biological insights from these group II intron revelations are discussed.
Thermostable group II intron reverse transcriptases (TGIRTs) with high fidelity and processivity have been used for a variety of RNA sequencing (RNA-seq) applications, including comprehensive profiling of whole-cell, exosomal, and human plasma RNAs; quantitative tRNA-seq based on the ability of TGIRT enzymes to give full-length reads of tRNAs and other structured small ncRNAs; high-throughput mapping of post-transcriptional modifications; and RNA structure mapping. Here, we improved TGIRT-seq methods for comprehensive transcriptome profiling by rationally designing RNA-seq adapters that minimize adapter dimer formation. Additionally, we developed biochemical and computational methods for remediating 5′- and 3′-end biases, the latter based on a random forest regression model that provides insight into the contribution of different factors to these biases. These improvements, some of which may be applicable to other RNA-seq methods, increase the efficiency of TGIRT-seq library construction and improve coverage of very small RNAs, such as miRNAs. Our findings provide insight into the biochemical basis of 5′- and 3′-end biases in RNA-seq and suggest general approaches for remediating biases and decreasing adapter dimer formation.
Prokaryotic CRISPR-Cas systems provide adaptive immunity by integrating portions of foreign nucleic acids (spacers) into genomic CRISPR arrays. Cas6 proteins then process CRISPR array transcripts into spacer-derived RNAs (CRISPR RNAs; crRNAs) that target Cas nucleases to matching invaders. We find that a Marinomonas mediterranea fusion protein combines three enzymatic domains (Cas6, reverse transcriptase [RT], and Cas1), which function in both crRNA biogenesis and spacer acquisition from RNA and DNA. We report a crystal structure of this divergent Cas6, identify amino acids required for Cas6 activity, show that the Cas6 domain is required for RT activity and RNA spacer acquisition, and demonstrate that CRISPR-repeat binding to Cas6 regulates RT activity. Co-evolution of putative interacting surfaces suggests a specific structural interaction between the Cas6 and RT domains, and phylogenetic analysis reveals repeated, stable association of free-standing Cas6s with CRISPR RTs in multiple microbial lineages, indicating that a functional interaction between these proteins preceded evolution of the fusion.
Comparing the abundance of one RNA molecule to another is crucial for understanding cellular functions but most sequencing techniques can target only specific subsets of RNA. In this study, we used a new fragmented ribodepleted TGIRT sequencing method that uses a thermostable group II intron reverse transcriptase (TGIRT) to generate a portrait of the human transcriptome depicting the quantitative relationship of all classes of nonribosomal RNA longer than 60 nt. Comparison between different sequencing methods indicated that FRT is more accurate in ranking both mRNA and noncoding RNA than viral reverse transcriptase-based sequencing methods, even those that specifically target these species. Measurements of RNA abundance in different cell lines using this method correlate with biochemical estimates, confirming tRNA as the most abundant nonribosomal RNA biotype. However, the single most abundant transcript is 7SL RNA, a component of the signal recognition particle. Structured noncoding RNAs (sncRNAs) associated with the same biological process are expressed at similar levels, with the exception of RNAs with multiple functions like U1 snRNA. In general, sncRNAs forming RNPs are hundreds to thousands of times more abundant than their mRNA counterparts. Surprisingly, only 50 sncRNA genes produce half of the non-rRNA transcripts detected in two different cell lines. Together the results indicate that the human transcriptome is dominated by a small number of highly expressed sncRNAs specializing in functions related to translation and splicing.
The thermostable Geobacillus stearothermophilus GsI-IIC intron is among the few bacterial group II introns found to proliferate to high copy number in its host genome. Here, we developed a bacterial genetic assay for retrohoming and biochemical assays for protein-dependent and self-splicing of GsI-IIC. We found that GsI-IIC, like other group IIC introns, retrohomes into sites having a 5'-exon DNA hairpin, typically from a bacterial transcription terminator, followed by short intron-binding sequences (IBSs) recognized by base pairing of exon-binding sequences (EBSs) in the intron RNA. Intron RNA insertion occurs preferentially but not exclusively into the parental lagging strand at DNA replication forks, using a nascent lagging strand DNA as a primer for reverse transcription. In vivo mobility assays, selections, and mutagenesis indicated that a variety of GC-rich DNA hairpins of 7-19 bp with continuous base pairs or internal elbow regions support efficient intron mobility and identified a critically recognized nucleotide (T-5) between the hairpin and IBS1, a feature not reported previously for group IIC introns. Neither the hairpin nor T-5 is required for intron excision or lariat formation during RNA splicing, but the 5'-exon sequence can affect the efficiency of exon ligation. Structural modeling suggests that the 5'-exon DNA hairpin and T-5 bind to the thumb and DNA-binding domains of GsI-IIC reverse transcriptase. This mode of DNA target site recognition enables the intron to proliferate to high copy number by recognizing numerous transcription terminators and then finding the best match for the EBS/IBS interactions within a short distance downstream.
Alignment-free RNA quantification tools have significantly increased the speed of RNA-seq analysis. However, it is unclear whether these state-of-the-art RNA-seq analysis pipelines can quantify small RNAs as accurately as they do with long RNAs in the context of total RNA quantification.
We comprehensively tested and compared four RNA-seq pipelines for accuracy of gene quantification and fold-change estimation. We used a novel total RNA benchmarking dataset in which small non-coding RNAs are highly represented along with other long RNAs. The four RNA-seq pipelines consisted of two commonly-used alignment-free pipelines and two variants of alignment-based pipelines. We found that all pipelines showed high accuracy for quantifying the expression of long and highly-abundant genes. However, alignment-free pipelines showed systematically poorer performance in quantifying lowly-abundant and small RNAs.
We have shown that alignment-free and traditional alignment-based quantification methods perform similarly for common gene targets, such as protein-coding genes. However, we have identified a potential pitfall in analyzing and quantifying lowly-expressed genes and small RNAs with alignment-free pipelines, especially when these small RNAs contain biological variations.
Bacterial group II intron reverse transcriptases (RTs) function in both intron mobility and RNA splicing and are evolutionary predecessors of retrotransposon, telomerase, and retroviral RTs as well as the spliceosomal protein Prp8 in eukaryotes. Here we determined a crystal structure of a full-length thermostable group II intron RT in complex with an RNA template-DNA primer duplex and incoming deoxynucleotide triphosphate (dNTP) at 3.0-A˚ resolution. We find that the binding of template-primer and key aspects of the RT active site are surprisingly different from retroviral RTs but remarkably similar to viral RNA-dependent RNA polymerases. The structure reveals a host of features not seen previously in RTs that may contribute to distinctive biochemical properties of group II intron RTs, and it provides a prototype for many related bacterial and eukaryotic non-LTR retroelement RTs. It also reveals how protein structural features used for reverse transcription evolved to promote the splicing of both group II and spliceosomal introns.
Cellular accumulation of repetitive RNA occurs in several dominantly-inherited genetic disorders. Expanded CUG, CCUG or GGGGCC repeats are expressed in myotonic dystrophy type 1 (DM1), myotonic dystrophy type 2 (DM2), or familial amyotrophic lateral sclerosis, respectively. Expanded repeat RNAs (ER-RNAs) exert a toxic gain-of-function and are prime therapeutic targets in these diseases. However, efforts to quantify ER-RNA levels or monitor knockdown are confounded by stable structure and heterogeneity of the ER-RNA tract and background signal from non-expanded repeats. Here, we used a thermostable group II intron reverse transcriptase (TGIRT-III) to convert ER-RNA to cDNA, followed by quantification on slot blots. We found that TGIRT-III was capable of reverse transcription (RTn) on enzymatically synthesized ER-RNAs. By using conditions that limit cDNA synthesis from off-target sequences, we observed hybridization signals on cDNA slot blots from DM1 and DM2 muscle samples but not from healthy controls. In transgenic mouse models of DM1 the cDNA slot blots accurately reflected the differences of ER-RNA expression across different transgenic lines, and showed therapeutic reductions in skeletal and cardiac muscle, accompanied by improvements of the DM1-associated splicing defects. TGIRT-III was also active on CCCCGG- and GGGGCC-repeats, suggesting that ER-RNA analysis is feasible for several repeat expansion disorders.
RNA is secreted from cells enclosed within extracellular vesicles (EVs). Defining the RNA composition of EVs is challenging due to their coisolation with contaminants, lack of knowledge of the mechanisms of RNA sorting into EVs, and limitations of conventional RNA-sequencing methods. Here we present our observations using thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) to characterize the RNA extracted from HEK293T cell EVs isolated by flotation gradient ultracentrifugation and from exosomes containing the tetraspanin CD63 further purified from the gradient fractions by immunoisolation. We found that EV-associated transcripts are dominated by full-length, mature transfer RNAs (tRNAs) and other small noncoding RNAs (ncRNAs) encapsulated within vesicles. A substantial proportion of the reads mapping to protein-coding genes, long ncRNAs, and antisense RNAs were due to DNA contamination on the surface of vesicles. Nevertheless, sequences mapping to spliced mRNAs were identified within HEK293T cell EVs and exosomes, among the most abundant being transcripts containing a 5′ terminal oligopyrimidine (5′ TOP) motif. Our results indicate that the RNA-binding protein YBX1, which is required for the sorting of selected miRNAs into exosomes, plays a role in the sorting of highly abundant small ncRNA species, including tRNAs, Y RNAs, and Vault RNAs. Finally, we obtained evidence for an EV-specific tRNA modification, perhaps indicating a role for posttranscriptional modification in the sorting of some RNA species into EVs. Our results suggest that EVs and exosomes could play a role in the purging and intercellular transfer of excess free RNAs, including full-length tRNAs and other small ncRNAs.
Cas1 integrase is the key enzyme of the clustered regularly interspaced short palindromic repeat (CRISPR)-Cas adaptation module that mediates acquisition of spacers derived from foreign DNA by CRISPR arrays. In diverse bacteria, the cas1 gene is fused (or adjacent) to a gene encoding a reverse transcriptase (RT) related to group II intron RTs. An RT-Cas1 fusion protein has been recently shown to enable acquisition of CRISPR spacers from RNA. Phylogenetic analysis of the CRISPRassociated RTs demonstrates monophyly of the RT-Cas1 fusion, and coevolution of the RT and Cas1 domains. Nearly all such RTs are present within type III CRISPR-Cas loci, but their phylogeny does not parallel the CRISPR-Cas type classification, indicating that RT-Cas1 is an autonomous functional module that is disseminated by horizontal gene transfer and can function with diverse type III systems. To compare the sequence pools sampled by RT-Cas1-associated and RT-lacking CRISPR-Cas systems, we obtained samples of a commercially grown cyanobacterium—Arthrospira platensis. Sequencing of the CRISPR arrays uncovered a highly diverse population of spacers. Spacer diversity was particularly striking for the RT-Cas1-containing type III-B system, where no saturation was evident even with millions of sequences analyzed. In contrast, analysis of the RT-lacking type III-D system yielded a highly diverse pool but reached a point where fewer novel spacers were recovered as sequencing depth was increased. Matches could be identified for a small fraction of the non-RT-Cas1- associated spacers, and for only a single RT-Cas1-associated spacer. Thus, the principal source(s) of the spacers, particularly the hypervariable spacer repertoire of the RT-associated arrays, remains unknown.
High-throughput single-stranded DNA sequencing (ssDNA-seq) of cell-free DNA from plasma and other bodily fluids is a powerful method for non-invasive prenatal testing, and diagnosis of cancers and other diseases. Here, we developed a facile ssDNA-seq method, which exploits a novel template-switching activity of thermostable group II intron reverse transcriptases (TGIRTs) for DNA-seq library construction. This activity enables TGIRT enzymes to initiate DNA synthesis directly at the 3′ end of a DNA strand while simultaneously attaching a DNA-seq adapter without end repair, tailing, or ligation. Initial experiments using this method to sequence E. coli genomic DNA showed that the TGIRT enzyme has surprisingly robust DNA polymerase activity. Further experiments showed that TGIRT-seq of plasma DNA from a healthy individual enables analysis of nucleosome positioning, transcription factor-binding sites, DNA methylation sites, and tissues-of-origin comparably to established methods, but with a simpler workflow that captures precise DNA ends.
Coupling of structure-specific in vivo chemical modification to next-generation sequencing is transforming RNA secondary structure studies in living cells. The dominant strategy for detecting in vivo chemical modifications uses reverse transcriptase truncation products, which introduce biases and necessitate population-average assessments of RNA structure. Here we present dimethyl sulfate (DMS) mutational profiling with sequencing (DMS-MaPseq), which encodes DMS modifications as mismatches using a thermostable group II intron reverse transcriptase. DMS-MaPseq yields a high signal-to-noise ratio, can report multiple structural features per molecule, and allows both genome-wide studies and focused in vivo investigations of even low-abundance RNAs. We apply DMS-MaPseq for the first analysis of RNA structure within an animal tissue and to identify a functional structure involved in noncanonical translation initiation. Additionally, we use DMS-MaPseq to compare the in vivo structure of pre-mRNAs with their mature isoforms. These applications illustrate DMS-MaPseq's capacity to dramatically expand in vivo analysis of RNA structure.
RNA silencing is a conserved eukaryotic gene expression regulatory mechanism mediated by small RNAs. In Caenorhabditis elegans, the accumulation of a distinct class of siRNAs synthesized by an RNA-dependent RNA polymerase (RdRP) requires the PIR-1 phosphatase. However, the function of PIR-1 in RNAi has remained unclear. Since mammals lack an analogous siRNA biogenesis pathway, an RNA silencing role for the mammalian PIR-1 homolog (dual specificity phosphatase 11 [DUSP11]) was unexpected. Here, we show that the RNA triphosphatase activity of DUSP11 promotes the RNA silencing activity of viral microRNAs (miRNAs) derived from RNA polymerase III (RNAP III) transcribed precursors. Our results demonstrate that DUSP11 converts the 5' triphosphate of miRNA precursors to a 5' monophosphate, promoting loading of derivative 5p miRNAs into Argonaute proteins via a Dicer-coupled 5' monophosphate-dependent strand selection mechanism. This mechanistic insight supports a likely shared function for PIR-1 in C. elegans Furthermore, we show that DUSP11 modulates the 5' end phosphate group and/or steady-state level of several host RNAP III transcripts, including vault RNAs and Alu transcripts. This study shows that steady-state levels of select noncoding RNAs are regulated by DUSP11 and defines a previously unknown portal for small RNA-mediated silencing in mammals, revealing that DUSP11-dependent RNA silencing activities are shared among diverse metazoans.
The mitochondrial tyrosyl-tRNA synthetases (mtTyrRSs) of Pezizomycotina fungi, a subphylum that includes many pathogenic species, are bifunctional proteins that both charge mitochondrial tRNA(Tyr) and act as splicing cofactors for autocatalytic group I introns. Previous studies showed that one of these proteins, Neurospora crassa CYT-18, binds group I introns by using both its N-terminal catalytic and C-terminal anticodon binding domains and that the catalytic domain uses a newly evolved group I intron binding surface that includes an N-terminal extension and two small insertions (insertions 1 and 2) with distinctive features not found in non-splicing mtTyrRSs. To explore how this RNA binding surface diverged to accommodate different group I introns in other Pezizomycotina fungi, we determined x-ray crystal structures of C-terminally truncated Aspergillus nidulans and Coccidioides posadasii mtTyrRSs. Comparisons with previous N. crassa CYT-18 structures and a structural model of the Aspergillus fumigatus mtTyrRS showed that the overall topology of the group I intron binding surface is conserved but with variations in key intron binding regions, particularly the Pezizomycotina-specific insertions. These insertions, which arose by expansion of flexible termini or internal loops, show greater variation in structure and amino acids potentially involved in group I intron binding than do neighboring protein core regions, which also function in intron binding but may be more constrained to preserve mtTyrRS activity. Our results suggest a structural basis for the intron specificity of different Pezizomycotina mtTyrRSs, highlight flexible terminal and loop regions as major sites for enzyme diversification, and identify targets for therapeutic intervention by disrupting an essential RNA-protein interaction in pathogenic fungi.
Next-generation RNA sequencing (RNA-seq) has revolutionized our ability to analyze transcriptomes. Current RNA-seq methods are highly reproducible, but each has biases resulting from different modes of RNA sample preparation, reverse transcription, and adapter addition, leading to variability between methods. Moreover, the transcriptome cannot be profiled comprehensively because highly structured RNAs, such as tRNAs and snoRNAs, are refractory to conventional RNA-seq methods. Recently, we developed a new method for strand-specific RNA-seq using thermostable group II intron reverse transcriptases (TGIRTs). TGIRT enzymes have higher processivity and fidelity than conventional retroviral reverse transcriptases plus a novel template-switching activity that enables RNA-seq adapter addition during cDNA synthesis without using RNA ligase. Here, we obtained TGIRT-seq data sets for well-characterized human RNA reference samples and compared them to previous data sets obtained for these RNAs by the Illumina TruSeq v2 and v3 methods. We find that TGIRT-seq recapitulates the relative abundance of human transcripts and RNA spike-ins in ribo-depleted, fragmented RNA samples comparably to non-strand-specific TruSeq v2 and better than strand-specific TruSeq v3. Moreover, TGIRT-seq is more strand specific than TruSeq v3 and eliminates sampling biases from random hexamer priming, which are inherent to TruSeq. The TGIRT-seq data sets also show more uniform 5' to 3' gene coverage and identify more splice junctions, particularly near the 5' ends of mRNAs, than do the TruSeq data sets. Finally, TGIRT-seq enables the simultaneous profiling of mRNAs and lncRNAs in the same RNA-seq experiment as structured small ncRNAs, including tRNAs, which are essentially absent with TruSeq.
Next-generation RNA-sequencing (RNA-seq) has revolutionized transcriptome profiling, gene expression analysis, and RNA-based diagnostics. Here, we developed a new RNA-seq method that exploits thermostable group II intron reverse transcriptases (TGIRTs) and used it to profile human plasma RNAs. TGIRTs have higher thermostability, processivity, and fidelity than conventional reverse transcriptases, plus a novel template-switching activity that can efficiently attach RNA-seq adapters to target RNA sequences without RNA ligation. The new TGIRT-seq method enabled construction of RNA-seq libraries from <1 ng of plasma RNA in <5 h. TGIRT-seq of RNA in 1-mL plasma samples from a healthy individual revealed RNA fragments mapping to a diverse population of protein-coding gene and long ncRNAs, which are enriched in intron and antisense sequences, as well as nearly all known classes of small ncRNAs, some of which have never before been seen in plasma. Surprisingly, many of the small ncRNA species were present as full-length transcripts, suggesting that they are protected from plasma RNases in ribonucleoprotein (RNP) complexes and/or exosomes. This TGIRT-seq method is readily adaptable for profiling of whole-cell, exosomal, and miRNAs, and for related procedures, such as HITS-CLIP and ribosome profiling.