Background

Accurate gene expression analysis at the transcript level is essential to understand all aspects of plant growth and development and their responses to abiotic and biotic stress. The magnitude and dynamics of transcriptional and post-transcriptional re-programming of the transcriptome provide insights into the cellular complexity of responses to external and internal cues. This complexity can now be readily explored using high-throughput RNA sequencing (RNA-seq) technologies and vastly improved analytical methods and software programs. The ability to quantify the expression levels of individual transcripts from Illumina short-read RNA-seq data was revolutionized by the development of rapid and accurate non-alignment programs, Kallisto and Salmon [1, 2]. However, Kallisto and Salmon require a reference transcriptome for accurate transcript quantification and the power of such analyses greatly depends on the quality and comprehensiveness of the reference transcriptome being used.

RNA sequencing using long-read single-molecule sequencing technologies, namely Pacific Biosciences (PacBio) and Oxford Nanopore sequencing, offers improved integrity of transcript structures. Single-molecule sequencing has the advantage of being able to identify transcription start and end (polyadenylation) sites (TSS and TES, respectively), alternative splicing (AS), alternative polyadenylation (APA) and the correct combinations of different TSS, TES, and splice junctions (SJs). However, sequencing errors are common in single-molecule sequencing, and mis-mapping of reads to the genome significantly increases false splice sites and affects open reading frames of transcripts [3]. Previous work on sequence alignment accuracy found that the main source of error for global sequence alignment was the misplacement of gaps, a phenomenon also called “edge wander” [4]. Misplacement of gaps is strongly affected by sequencing errors. Introns can be considered as “gaps” when the single-molecule long reads are mapped to the genome and can generate many false splice junctions [59]. For example, alignment of high error-containing long reads from a particular locus often disagrees with one another (particularly around splice sites) [6] and high error rates result in a high proportion (27%) of misplaced splice junctions [5]. Strategies to overcome the effects of sequencing errors in the long reads include self- or hybrid correction methods. Self-correction utilizes the raw signal and consensus-based calls to reduce errors while hybrid correction exploits Illumina short reads to correct errors in the long reads [1013]. However, current error correction tools tend to trim or split long reads when lacking local short-read support, over-correct (introduce new, false splice junctions) when mapping to the wrong locations and lose isoforms with low expression [5, 7]. In addition, a considerable number of reads representing fragments of mature mRNAs, likely due to incomplete cDNA synthesis or mRNA degradation, compromise the accurate determination of transcription start and end (poly(A)) sites. While these issues are not generally appreciated, they reduce the overall precision of transcript quantification and downstream analysis of differential expression, AS, APA, and TSS and TES usage.

Iso-seq single-molecule sequencing has been applied to a wide selection of crop plants (e.g., maize, wheat, sorghum, coffee, tea, sugarcane, rice, amaranth, and grape), economically important plants for feed or products (e.g., switchgrass, Bermuda grass, perennial ryegrass, pine, rubber, red clover), wild plant species (e.g., wild strawberry), plants of botanical interest (e.g., Pitcher plants—Nepenthes spp.), and medicinal plants (e.g., Zanthoxylum, safflower, Salvia) [1438]. The majority of the above applications of PacBio sequencing investigated transcriptome diversity and complexity and determined transcription start sites, AS events, and APA sites. However, significant issues surrounding the accuracy of SJs, TSS, and TES identification suggest that many of the above transcriptome studies would benefit from improved methods of transcript structure determination. Accurate and well-curated transcripts also play an important role in improving genome annotations and the identification of novel genes and, particularly, long non-coding RNAs.

In this paper, we report the construction of a new, comprehensive Arabidopsis transcriptome, AtRTD3, based on a wide range of Arabidopsis tissues and treatments. AtRTD3 contains over 169 k transcripts, 78% of which are derived from Iso-seq and have accurately defined SJs, TSS, and TES. It improves the precision of analysis of RNA-seq data for differential gene and transcript expression and differential alternative splicing and now allows analysis of differential TSS and TES usage. We used a new pipeline based on TAMA [7] to analyze the Iso-seq data and developed novel methods to address the impact of sequencing errors and incomplete transcripts. We developed (1) a splice junction-centric approach that allows the identification of high-confidence SJs and (2) a probabilistic 5′ and 3′ end determination method that effectively removes transcript fragments and identifies dominant transcript start and end sites. They allow accurate determination of SJs, TSS, and TES directly from the Iso-seq data and remove the requirement for hybrid error correction or parallel experimental approaches for detecting TSS and TES such as CAGE-seq or poly(A)-seq, respectively. The defined sets of high-confidence SJs, TSS, and TES were used to generate an Iso-seq-based transcriptome (AtIso) consisting of transcripts with accurately defined 5′ and 3′ ends and SJs and the combination of AS events with specific TSS and TES. The high-confidence full-length transcripts in AtIso covered ca. two-thirds of genes in Arabidopsis and confirmed many of the short-read assembled transcripts while resolving assembly artifacts present in AtRTD2 [39]. Around one-third of genes had very low or no Iso-seq coverage. Short-read assembly generates highly accurate SJs but little information on 5′ and 3′ ends. Therefore, AtIso was merged with short-read assemblies, such as AtRTD2 [39] and Araport11 [40] to form AtRTD3, giving preference to Iso-seq transcripts to capture high-confidence SJs, TSSs, and TESs and integrating only those transcripts from AtRTD2 and Araport11 with novel SJs or loci. The resulting AtRTD3 transcriptome contains 40,932 genes and 169,503 transcripts with ca. 78% of transcripts having Iso-seq support. The main function of AtRTD3 is to enable accurate differential gene expression and differential alternative splicing analysis of RNA-seq experiments designed to address a wide range of biological questions. To provide accurate quantification of genes and transcripts, the RTD must be as comprehensive as possible, and the constituent transcripts must be as accurate as possible. AtRTD3 represents a significant improvement over existing Arabidopsis transcriptomes as demonstrated by its improved transcript quantification accuracy and transcript expression profiling over AtRTD2 and Araport11 and by the identification of cold-induced differential TSS and TES usage from analysis of time-series data.

Results

Single-molecule Iso-sequencing of diverse Arabidopsis plant samples

PacBio Iso-seq was performed on total RNA extracted from nineteen samples from different Arabidopsis Col-0 organs, developmental stages, abiotic stress conditions, infection with different pathogens and RNA degradation mutants to capture a broad diversity of transcripts (Additional File 1: Table S1). PacBio non-size selected Iso-seq libraries were made for all nineteen samples using a cap enrichment protocol (Teloprime, Lexogen). In addition, Teloprime v2 (Lexogen) libraries were constructed for six of the above RNA samples and Clontech (Takara Bio) libraries for two of the above samples. Each of the 27 libraries was sequenced on a separate SMRT cell on a PacBio Sequel machine with a 10-h (v3) movie time. The PacBio raw reads were processed using the PacBio IsoSeq3 pipeline to generate circular consensus sequences (CCS) and full-length non-chimeric (FLNC) reads without the clustering and polishing steps, and FLNCs were mapped to the reference genome (TAIR10) (Fig. 1). The numbers of reads, FLNCs and mapped FLNCs along with statistics are shown in Additional File 1: Table S2. The 27 libraries generated 13.7 million Iso-seq reads in total. The total number of CCS was 8.7 M with an average of 322 k CCS per library. The total number of FLNCs generated using lima+refine (see “Materials and methods”) was 7.77 M with an average of 288 k per library. About 7.36 M of the FLNCs mapped onto the Arabidopsis genome, generating 142.9 k transcripts and 14.3 k genes on average per library. We then merged the transcripts generated from the 27 libraries using TAMA merge, where unique transcripts including those with only a single-nucleotide difference at 5′ and 3′ UTR were kept (Fig. 1). The merged transcriptome assembly consisted of 33,154 genes and 2,239,270 transcripts.

Fig. 1
figure 1

Workflow of analysis of PacBio Iso-sequencing. A Raw reads are analyzed using the PacBio Iso-seq 3 pipeline to generate FLNCs which are mapped to the genome (blue boxes). B Mapped FLNCs are collapsed and merged using TAMA to generate transcripts (pink boxes). C Transcripts are quality controlled using datasets of high-confidence (HC) splice junctions (SJs) and transcript start and end sites (TSS/TES). Transcripts with unsupported splice junctions where reads contain mismatches within ±10 nt of an SJ are removed. Transcripts with both high-confidence TSS and TES (determined by binomial probability for highly expressed genes and by end support with > 2 reads for low expressed genes) are retained as HC transcripts. The remaining transcripts which have partial or no TSS and/or TES support were removed unless they overlapped with annotated gene loci. These transcripts, from genes with low coverage by Iso-seq, were combined with the HC transcripts to form AtIso (Arabidopsis Iso-seq based transcriptome)

Sequencing mismatches around splice junctions (SJs) distinguish high- and low-confidence SJs

The challenge with Iso-seq-derived transcripts is to accurately define SJs, TSSs, and TESs. As the merged assembly contained tens of thousands of false SJs (see below), transcripts containing these SJs were identified and removed before defining TSS and TES. Based on the hypothesis that sequence errors in the Iso-seq reads around SJs promote “edge wander” [4] resulting in false SJs, we used TAMA Collapse to extract the mapping information of 30-nt up- and downstream of each SJs from the uncorrected reads from the 27 Iso-seq libraries. (Additional File 2: Fig. S1). We compared the resulting Iso-seq SJs to those of AtRTD2. In total, 124,328 SJs were shared between Iso-seq and AtRTD2 transcripts and 110,992 were unique to the Iso-seq transcripts (Fig. 2A). We then extracted the mismatch profiles for the shared SJs and for those unique to Iso-seq transcripts and determined the number and percentage of mismatches in each position in the 30-nt up- and downstream of the SJ (Additional File 1: Tables S3A and S3B). Thus, the SJs in the Iso-seq transcripts were divided into two sets: (1) a high-confidence set of 124,328 SJs (above) that were also present in AtRTD2 transcripts (extensive quality control measures were used to remove false SJs during the construction of AtRTD2 from short reads—Zhang et al. [39]) and (2) a low-confidence set of 110,992 SJs unique to Iso-seq transcripts (above) that includes novel, bona fide junctions as well as incorrect misplaced SJs. To assess the different characteristics between the two sets of SJs, we calculated position weight matrix (PWM) scores for 5′ and 3′ splice site consensus sequences for each intron (Additional File 1: Table S4). The average PWM scores of the high-confidence SJs (5′ splice site 69.91, 3′ splice site 67.75) were significantly higher than the average of the low-confidence set (5′ 62.79; 3′ 62.67) (Fig. 2B). Taking the threshold PWM of 65 as the criteria for a good quality splice site [39], 79.4% of high-confidence SJs had PWM scores at both 5′ and 3′ splice sites of  65 with only 20.6% having at least one PWM score lower than the threshold (5′ 3.50% and 3′ 17.64%). In contrast, 79.17% of the SJs in the low-confidence set have at least one PWM score lower than the threshold at either the 5′ or 3′ splice site (5′ 52.24% and 3′ 59.07%). Thus, the high-confidence SJs have higher splice site consensus sequence quality characteristics than the low-confidence SJs.

Fig. 2
figure 2

Impact of mismatches around splice junctions on the accuracy of their determination. A Splice junctions (SJs) shared by AtRTD2 and Iso-seq (LDE_30; sjt_30) and unique to each. B Position weight matrix (PWM) scores for splice sites unique to Iso-seq transcripts and shared with AtRTD2. PWM scores for 5′ and 3′ splice site sequences from SJs shared between AtRTD2 and Iso-seq transcripts (high confidence), are significantly higher (t-test, p < 2.26e−16) than those unique to Iso-seq (low confidence). C, D Distribution of the number of mismatches in each position 30 nt upstream (C) and 30 nt downstream (D) of SJs unique to Iso-seq (low confidence) and shared with AtRTD2 (high confidence). See Additional File 1: Tables S3A,B). E Filtering of SJs—the graph shows the number of SJs remaining (expressed as a percentage) after the cumulative removal of SJs with mismatches in the first n positions (1, 2, 3, etc.) flanking SJs. See Additional File 1: Tables S5A,B)

To examine the relationship between the presence of sequencing mismatches in reads around the SJs and the quality of the SJs, we selected the Iso-seq read with the smallest number of mismatches in the 60-nt region around each SJ for the analysis. The mismatch rate in each position in the SJs shared with AtRTD2 (high-confidence set) was in the range of 0.008 to 0.08%. In contrast, the mismatch rate in each position in the low-confidence SJs unique to Iso-seq were up to 100-fold greater and ranged from 1.02 to 4.12% (sequence upstream of SJ) and 0.97 to 7.58% (downstream of SJ) in, for the most part, descending order with distance from the splice junction (Additional File 1: Table S3A and S3B). Plotting the distributions of the mismatches at each position upstream (Fig. 2C) and downstream (Fig. 2D) clearly showed a high number of mismatches in the vicinity of SJs unique to Iso-seq (low confidence) while the SJs shared with AtRTD2 (high confidence) had far fewer mismatches with a more uniform distribution (Fig. 2C, D; Table S3A and B).

The effect of having sequencing mismatches in Iso-seq reads in the region of a SJ is illustrated by the number of SJs that would remain (recall) if SJs with a mismatch in any of the positions were removed. For example, removing those SJs with a mismatch in positions 1–10 on either side of a SJ would remove only 711 SJs from the shared SJs (high confidence) leaving 99.43% of SJs (Fig. 2E; Additional File 1: Table S5A) but 29,606 SJs of the SJs unique to Iso-seq (low confidence), leaving 73.32% of the SJs (Fig. 2E; Additional File 1: Table S5B). Thus, sequencing mismatches in the vicinity of SJs are strongly associated with new, false SJs which carry over into transcripts. Filtering the SJs by removing those with mismatches around the SJs has a significant impact on the low-confidence SJs but a very limited effect on the high-confidence SJs. Thus, examining mismatches around the SJs is an effective strategy to distinguish high- and low-quality SJs and identify false SJs.

Splice junction-centric analysis for accurate splice junction determination

To apply the above observations to overcome the problem of false splice junctions being generated due to mismatches in Iso-seq reads in the vicinity of SJs, we developed a method to identify and retain high-confidence SJs. The original TAMA collapse [7] removes reads with defined mismatches around the SJs. There are two issues with this approach: (1) when an Iso-seq read with multiple SJs is removed due to erroneous mapping of one or more SJs, other correct SJs supported by that read will be discarded at the same time; and (2) as Iso-seq sequencing errors are distributed randomly, some reads with errors around SJs could still be correct and be rescued by other reads that mapped perfectly to the region. We therefore modified the approach to keep all high-confidence SJs irrespective of whether low-quality SJs were present in the rest of the read. In so doing, we constructed a high-confidence set of SJs where each SJ has support from at least one Iso-seq read with zero mismatches in positions ±10 nt from SJs. Using this set of SJs, reads with correctly mapped SJs but mismatches around SJs are still retained, contributing to identification of SJs in the final merged transcript assemblies.

In the transcript set from the 27 libraries, there were 235,320 non-redundant SJs. We first removed SJs with non-canonical motifs leaving 175,827 SJs. Then, we selected the SJs that had support from at least one read with zero mismatches to the genome in the 10 nt region on each side of the SJ. This reduced the number of false SJs caused by the combined effects of mis-mapping of the introns and sequencing errors around SJs, leaving 162,888 SJs. Thus, 71,726 (64.62%) SJs unique to Iso-seq (30.5% of all SJs in Iso-seq) were removed due to lack of experimental evidence for a high-confidence SJ. For comparison, only 706 SJs that are shared between Iso-seq and AtRTD2 (0.46% of all SJs in AtRTD2) were removed using the same filtering parameters. Thus, the SJ-centric approach makes the best use of local information around the SJs of long reads to define the set of high-confidence SJs.

A stratified and probabilistic approach to determine the TSS and TES sites

The combination of Teloprime 5′ capture followed by Iso-seq sequencing from poly(A) tails should, in principle, produce full-length mRNA sequences containing authentic 5′-end/TSS and 3′-end/TES. However, a number of factors affect accurate TSS and TES identification: (1) mRNAs undergo degradation (in vivo or during RNA manipulation) generating truncated transcript fragments. Teloprime 5′ capture is not 100% efficient such that Iso-seq reads from 5′-degraded transcripts are still generated. Similarly, 3′ end degradation and off-priming, where the PCR oligo-dT primer amplifies from poly(A) sequences within the transcript instead of the poly(A) tail, generate 3′ truncated transcript fragments. Thus, reads from transcripts with different degrees of degradation generate multiple false TSSs/TESs; (2) TSS/TES are usually stochastic and not limited to a single-nucleotide location but rather are distributed around a dominant site [41]; and (3) the number of Iso-seq reads varies greatly across a large dynamic range. Consequently, highly expressed genes may contain thousands of individual transcripts including substantial numbers of degradation products. In contrast, for genes with low levels of expression and a limited number of reads or no read coverage, it is difficult to apply statistical inference to determine whether read start/end points are TSSs/TESs. The challenges in accurately identifying TSS/TES for genes with high- and low-read abundance are therefore very different. For highly sequenced genes, the major task is to reduce false TSS/TES from transcript fragments and identify dominant sites. For genes with few reads, the task is to get sufficient experimental evidence to support TSS/TES identification. We have therefore developed and applied two different approaches to end determination depending on the read/transcript abundance. We assumed that for highly sequenced genes, authentic TSS/TES sites would tend to be sequenced more often while the ends from degraded mRNA products would occur randomly. We, therefore, used the binomial function to estimate the probability of having a certain number of Iso-seq read ends at any position at random and used these probabilities to identify positions with non-random (i.e., enriched) ends that represent authentic start or end sites (Additional File 2: Fig. S2A and B). For genes with few reads, we compared start and end sites of different reads to identify similar ends as support for potential TSS/TES (see below; Additional File 2: Fig. S2C).

Identification of significant read start and end genomic locations

For TSS determination, only the Teloprime captured reads were used as the Clontech libraries are more likely to contain truncated fragments with missing TSSs [42]. The exact genomic coordinate of the start of each read (read start genome location-RSGL) was identified giving a total of a total of 616,593 RSGLs. By applying the binomial probability method (Additional File 2: Fig. S2A), 61,014 significant RSGLs enriched for start locations were detected from 17,098 genes. These 17,098 genes tended to be highly sequenced with read numbers ranging from 7 to 48,110 and a median of 110 reads. They accounted for 550,022 of the total RSGLs (89.3%) from which the 61,014 significant non-random RSGLs were identified, an approximately 10-fold reduction of the average RSGL number per gene from 32.17 to 3.57. Thus, the binomial probability method reduces the overall number of RSGLs into a smaller number of high-confidence RSGLs. For the remaining 15,858 genes, no significant RSGLs were detected. These genes had relatively few reads with a median of 2 reads per gene and 80% of genes having fewer than 7 reads. For these genes, we compared the start positions of reads from each gene and required at least two Iso-seq reads with 5′ ends within a sliding window of 11 nt (5 nt on each side) to call a supported RSGL (Additional File 2: Fig. S2C). By this method, the 66,571 remaining RSGLs (from 15,858 low-abundance genes) generated 25,930 supported RSGLs from 7028 genes. Thus, we have defined 61,014 and 25,930 TSSs from with high and low numbers of read genes, respectively.

Before enrichment, a total of 723,903 read end genomic locations (REGLs) were identified. We removed 11,703 reads where 3′ ends were immediately followed by poly(A) sequences in the genome sequence and were likely to be a result of off-priming, leaving a total of 712,200 REGLs. We then applied the binomial distribution method to detect non-random REGLs, as described above for RSGLs. For highly sequenced genes, 84,043 significant REGLs (Additional File 2: Fig. S2B) from 16,728 genes were identified with read abundance per gene varying from 7 to 49,917 and a median of 128. These highly sequenced genes contained 669,642 (94.02%) of the total REGLs and showed very variable end sites. The binomial distribution probability method reduced the average number of REGLs per gene from 40.03 to 5.02. The remaining 13,440 genes had fewer reads with a median of 1 read, and 80% of these genes had fewer than 5 reads. At least two Iso-seq long reads with similar 3′-ends within a sliding window of 11 nt (5 nt on each side) were required to call a supported REGL (Additional File 2: Fig. S2C). On this basis, from the 42,558 REGLs from the 13,440 genes with few reads, 21,664 supported REGLs from 5824 genes were identified. Thus, we have defined 84,043 significant REGLs and 21,664 supported REGLs from genes with high and low numbers of reads, respectively. Finally, 8830 and 7616 genes did not have significant or supported RSGLs or REGLs, respectively, and represented genes with one read or with very few reads with varying start or end locations differing by more than 5 nt.

Validation of significant TSSs and TESs

A transcription start site dataset for Arabidopsis genes at nucleotide resolution was generated previously using paired-end analysis of TSSs (PEAT) [41]. In their study, using a pooled Col-0 root sample, 79,706 mapped and annotated PEAT tag clusters (groups of similar TSSs) were identified, and quality filtering generated 9326 strong tag clusters from protein-coding genes which had groups of TSS locations supported by at least 100 reads. The information for each tag cluster included the start, end, strand and the mode, which is the location within the cluster where the greatest number of 5′ ends was mapped [41].

We compared the significant 61,014 RSGLs from the highly sequenced genes here with the PEAT tag clusters and found that 50.8% were located within 8445 of the 9326 strong tag clusters (90.5%). Thus, the significant RSGLs from the highly sequenced genes showed substantial concentration and overlap with the published set of strong tag clusters. We also compared the significant RSGLs with the mode (genomic locations with the highest number of reads within that tag cluster) of strong tag clusters and found that 6563 (70.4%) strong tag cluster modes co-located with the significant RSGLs with no more than a 1-nt difference (Additional File 2: Fig. S3A). Significant RSGLs were also identified in another 9010 genes not detected in Morton et al. [41]. This is likely due to the much wider range of tissues and treatments used here and differences in gene coverage between the Iso-seq and PEAT analyses. We have also compared our RSGLs to a recent study [43] that carried out genome-wide TSS mapping using 5′ CAP sequencing. In this study, 96,232 TSS tag clusters were detected in 21,359 genes in wild type plants and mutant lines of the FACT (FAcilitates Chromatin Transcription) complex. We found that 55,737 (91.35%) of our significant RSGLs located within the TSS tag clusters of Nielsen et al. [43], covering 16,353 genes. The correspondence of our data with both the above studies shows the high accuracy of the RSGLs detected using our novel method of transcript 5′ end determination.

Arabidopsis polyadenylation sites have been previously identified through direct RNA sequencing of seedling RNA, which found 49,916 cleavage and polyadenylation site (CS) peaks supported by > 9 raw reads from 14,311 genes [44]. We compared the 84,043 significant REGLs with the CS peaks and found that (1) 45,931 (92%) CS peaks from 13,443 genes co-located with significant REGLs within a 50-nt window and 24,927 (49.93%) CS peaks co-located with significant REGLs at the same genomic location (< 1 nt difference) (Additional File 2: Fig. S3B,C). The significant REGLs identified an additional 12,305 TES sites in 5531 genes, including 3663 genes for which no CS peaks were reported [44]. The increased diversity of TES identified from our Iso-seq data are again likely due to the wider range of tissues and treatments used for RNA sequencing.

Significant RSGLs/REGLs show enrichment in motifs related to TSSs and TESs

To further validate the TSSs in the significant 61,014 RSGLs, we looked for common transcription motifs (e.g., TATA box, lnitiator, and Y-patch) in the region of the TSSs (+ 500 to −500 bp) and the Kozak translation start site motif downstream of the TSS, and compared these to the raw 79,706 TSS tag cluster peaks from Morton et al. [41]. The TATA box is a T/A-rich motif ca. 25-35 bp upstream of highly expressed genes that determine expression levels [41, 45, 46] (Additional File 1: Table S6). A sharp peak was observed upstream of the TSSs for both RSGLs and TSS peak clusters consistent with the expected position for a TATA box (Fig. 3A). Thus, there is a good corroboration of our computational derived TSS and the experimentally defined TSS. Despite fewer significant RSGL sites being investigated, we found the number of TSSs with upstream TATA motifs in the RSGLs almost doubled that seen with the PEAT tag cluster peaks (from 3603 sites to 6976 sites) (Fig. 3A). A proportion test shows that the TATA box motif was significantly enriched in RSGLs compared to the TSS cluster peaks in Morton et al. (p < 2.2e−16). The Initiator (Inr) element is pyrimidine-rich, overlaps the TSS site and is important for transcriptional activation [46] while the Y-patch pyrimidine-rich motif, found upstream of TSSs, is unique to plants and found in more than 50% of annotated rice genes [47] (Additional File 1: Table S6). Enrichment of both motifs around the TSS was observed again, with 2236 and 11,477 instances, respectively, in the significant RSGL set and 1208 and 6067 instances, respectively, in the Morton et al. data (both proportion tests p < 2.2e−16) (Fig. 3B, C). Finally, the Kozak consensus translation start sequence is downstream of the TSS and contains the translation start AUG codon [48]. Significant enrichment of Kozak sequences was seen downstream of the TSSs for the significant RSGLs with 316 instances over 116 instances in the Morton data (proportion tests p < 2.2e−16) (Fig. 3D).

Fig. 3
figure 3

Enrichment of sequence motifs associated with TSS and TES sites. A–D TSS sites: A TATA box, B Initiator (Inr), C Y-patch, D Kozak translation start site consensus motif, E,F TES sites: E CFlm binding site and F PAS. Lines indicate number of motifs found in relation to start and end sites from Iso-seq (blue), Morton et al. [41] A–D, and Sherstnev et al. [44] E,F (red); random control (gray)

To further validate the REGLs, we searched the genomic sequences around the TESs (−500 to + 500 bp) of the 84,043 significant REGLs and 49,916 CS peaks from Sherstnev et al. [44] for conserved cleavage and polyadenylation sequence motifs. The polyadenylation signal (PAS) motif (possessing a canonical AATAAA when the PAS is relatively strong) is required for 3′ end polyadenylation while the CFlm motif is the binding site of cleavage factor Im, an essential 3′ processing factor [49, 50] (Additional File 1: Table S6). The number of matching sequences and their positions showed significant enrichment of CFIm sequences (upstream of the PAS) and the poly(A) signal motif at the TES for significant REGLs over CS peaks (Fig. 3E, F, respectively) with 3627 and 1565 instances, respectively, in the CS peaks and 6663 and 3994 instances, respectively, in the significant REGLs (proportion tests p = 5.725e−06 and p < 2.2e−16, respectively) (Fig. 3E, F).

Generation of high-level transcripts for AtIso

To achieve accurate transcript isoforms from the Iso-seq data, we have adopted a strategy that requires evidential support for all the SJs, TSSs, and TESs. We generated datasets of high-confidence SJs, RSGLs, and REGLs which were then used to filter the 2,239,270 transcripts from all the libraries. Given the stochastic nature of TSSs and TESs, we applied a 100-nt window around each significant and supported RSGL and REGL (50 nt on each side) to define high-confidence TSS and TES regions (Additional File 2; Figure S4). This generated 1,674,795 transcripts after sequentially removing transcripts containing poorly supported SJs (117,361 transcripts) or poorly supported TSS and TES (447,114 transcripts) (Fig. 1C). The above filtering criteria also addressed the common issue of excessive numbers of single-exon gene models generated from Iso-seq experiments and many other genome-wide annotation projects [7, 51], which could be the result of genomic DNA contamination. In our data, we also observed that 161,578 (46.6%) out of 346,455 single-exon transcripts were removed due to the TSS/TES filtering. These removed transcripts are probably fragments with missing 5′ or 3′ sites or false positive gene models. As a result, filtering using high-confidence TSS and TES regions also reduced the number of the mono-exonic genes (containing mono-exonic transcripts) from 13,619 to 4477, a reduction of 67% on the number of putative mono-exonic genes. The percentage of mono-exonic genes decreased from 41.3 to 20.9% of the total number of genes after TSS/TES filtering.

Finally, to increase the gene coverage using existing annotations and make the maximum use of the Iso-seq long reads, we retained a further 2483 genes (7398 transcripts) where the reads overlapped Araport transcripts on the same strand with at least 50% overlap. The combined set was merged allowing 50 nt variations at the 5′ and 3′ ends, and the final AtIso dataset contained 24,344 genes with 132,190 high-level transcripts (Additional File 2: Fig. S4).

To investigate the contribution to gene and transcript diversity from each of the different libraries to AtIso, we used the transcript merge information provided by TAMA (_trans_report.txt) to identify which transcripts and genes were merged from the individual libraries (Additional File 1: Table S7; Additional File 2: Fig. S5). The nuclear RNA sample contributed the least number of genes and transcripts to AtIso despite high number of CCS reads generated from this library. The silique library contributed the second lowest number of genes and transcripts which is likely due to the flow cell having the lowest loading efficiency (21%) and generating the lowest number of CCS reads of all the libraries (Additional File 1: Table S2). The contribution to gene and transcript numbers from the rest of the libraries is more consistent ranging from 7.5 to 14 k genes and 10 to 25 k transcripts. Of the 24,344 genes and 132,190 transcripts in AtIso, only 257 genes and 99 transcripts were shared by all 27 libraries while 3939 genes (16.1%) and 81,310 transcript (61.5%) were unique to a single library. Thus, the libraries from the wide range of organ types and conditions are highly complementary and aided the capture transcriptome diversity.

Finally, we performed a saturation analysis which counted the number of genes and transcripts as each library was added. The increase in the number of new genes in AtIso began to plateau after 8 samples had been added eventually reaching 24,344 genes (Additional File 1: Table S8; Additional File 2: Fig. S6A). Interestingly, the nuclear RNA sample added ca. 1.5 k unique genes despite having the lowest number of genes and transcripts identified. This may reflect capture of transcripts which may function and remain in the nucleus (e.g., some lncRNAs). For samples with relatively limited amounts of sample (e.g., flower and root) which were sequenced more than once, each library continued to add unique genes. In contrast, the number of unique transcripts continued to increase with each library adding a few thousand isoforms (Additional File 1: Table S8; Additional File 2: Fig. S6B). The linear growing trend of unique transcripts shows that saturation has not yet been reached with the existing Iso-seq data.

Construction and characterization of the AtRTD3 transcriptome

AtIso contained transcripts from 57% of the genes in Araport11. Splice junction and transcript identity were compared among AtIso, AtRTD2, and Araport11 [39, 40] (Additional File 3). There was high similarity in SJs but very low overlap of transcripts due to poor 5′ and 3′ end determination and different combination of SJs in AtRTD2 and Araport11 compared to the Iso-seq transcripts (Additional File 3). To generate a new, comprehensive transcriptome for Arabidopsis that covered all genes and incorporated the Iso-seq transcripts, long- and short-read assemblies were combined using the following criteria: (1) AtIso had the most accurate transcript data and was used as the backbone for integrating AtRTD2 and Araport11. To maximize the use of Iso-seq transcripts, we kept all AtIso transcripts. (2) As the TSS, TES, and the combination of SJs are less accurate in transcripts assembled from short reads, (a) only transcripts from AtRTD2 and Araport that contained novel SJs or (b) covered novel genomic loci were incorporated from the short-read assemblies. Using these criteria, the three assemblies were merged with TAMA merge, generating the final transcriptome, which we named AtRTD3. AtRTD3 contained 40,932 genes with 169,503 transcripts with a total of 183,568 SJs. AtIso contributed 132,166 (77.97%) transcripts from 25,248 (61.68%) genes, AtRTD2 contributed 24,831 (14.65%) transcripts from 13,683 genes [39], and Araport11 contributed 12,506 (7.38%) transcripts from 11,750 genes [40]. In AtRTD3, the average number of isoforms per gene was 4.4 and nearly 80% of transcripts had Iso-seq support (SJs, TSS, and TES).

We used SQANTI3, the latest version of SQANTI [52] to assess the quality of the long-read transcripts in AtIso and AtRTD3 in comparison with other reference transcriptomes (Araport11 [40] and AtRTD2 [39]). SQANTI catalogs long-read transcript as Full Splice Match (FSM) when the transcript matches a reference at all SJs, Incomplete Splice Matches (ISM), if the transcript misses SJs at either 5′ and 3′, Novel In Catalogue (NIC), when the long-read transcript includes a novel combination of existing donor or acceptor sites, and Novel Not In Catalogue (NNC), when the long-read transcripts contain at least one novel donor or acceptor site. Other categories are Genic, Intergenic, Fusion, and Antisense [52]. When compared to the Araport reference, ca. 35% of AtRTD3 transcripts (ca. 59 k) were FSM and ca. 4% were ISM (Additional File 2: Fig. S7A). Fifty-five percent of AtRTD3 transcripts were novel, either NIC or NNC (Additional File 2: Fig. S7A). These results reflect AtRTD3 having a much higher number of transcripts (169.5 k) than Araport11 (48 k) and consisting mainly of novel isoforms. The number of FSM transcripts in AtRTD3 reflects transcripts with an exact match of SJs, although they might be different defined TSS and TES in the Iso-seq transcripts. The AtRTD2 transcriptome is based on short-read assembly and has many more isoforms than Araport11. Consequently, when the AtRTD3 transcriptome—where ca. 80% of transcripts are derived from Iso-seq—is assessed versus the AtRTD2 annotation (Additional File 2: Fig. S7B), a higher number of FSM and lower number of NNC is found than when assessed against the Araport11 annotation (Additional File 2: Fig. S7A). This indicates that AtRTD3 is more similar to the AtRTD2 than to the Araport11 annotation. TSS and TES were defined using the Iso-seq reads in AtIso. AtRTD3 contained all of the transcripts from AtIso with the addition to transcripts from Araport11 and AtRTD2 to provide full coverage of genes in Arabidopsis. SQANTI3 assessed the quality of TSS in AtIso and AtRTD3 by comparing their positions to PEAT-defined TSS from Morton et al. (2014) [41] which covered around 9 k protein-coding genes. The % of transcripts with PEAT support for these genes was very similar for ISM, NIC, and NNC transcripts (Additional file 2: Fig. S7C, D). However, 60% of FSM transcripts from AtIso had PEAT support which decreased to 45% for AtRTD3 FSM transcripts. The reduction in TSS quality in AtRTD3 reflects the inclusion of isoforms from Araport11 and AtRTD2 where TSS are of lower quality.

Genes and transcripts in AtRTD3 were characterized using TranSuite, a program which identifies mono- and multi-exonic genes and generates accurate translations of transcripts and transcript characteristics [53]. The output includes translations of all transcripts in the RTD and multiple transcript features (Additional File 1: Table S9). These results are summarized in Fig. 4 and Additional File 1: Table S10A and S10B. Almost three quarters (73.5%) of the genes coded for proteins and ca. 26.5% were non-protein-coding genes (Fig. 4A; Additional file 1: Table S10A). Of all genes, 66.5% were multi-exonic and 50% had more than one transcript isoform. Of the genes that produced a single transcript, two-thirds were single-exon genes and one-third were multi-exonic (Fig. 4C; (Additional File 1: Table S8B). For protein-coding genes, 62.9% were multi-exonic with more than one isoform. The 10,827 non-protein-coding genes generated 14,880 transcripts (Fig. 4E); the majority were single-exon genes but 1728 genes were multi-exonic (spliced) with a single transcript and over 5 k genes had more than one isoform (Additional File 1: Table S10A). We also identified 3796 chimeric (read-through) transcripts covering usually two Araport genes with an overlap > 30%.

Fig. 4
figure 4

Gene and transcript characteristics of AtRTD3. A Protein-coding and non-protein-coding genes. B Mono-exonic and multi-exonic genes. C Mono- and multi-exonic genes with single/multiple transcript isoforms for all genes and D for protein-coding genes. E Distribution of transcripts from protein-coding genes (protein-coding and unproductive isoforms) and from non-protein-coding genes. F Protein-coding transcripts with little or no impact on coding sequence (NAGNAG/AS in UTR) and protein-coding variants. G Distribution of transcripts with NAGNAG, AS in 5′ UTR, and AS in 3′ UTR: H distribution of NMD features among unproductive transcripts from protein-coding genes. DSSJ—downstream splice junction; OUORF—overlapping upstream open reading frame

At the transcript level, AtRTD3 contained more than double the number of transcripts compared to AtRTD2 with greatly increased numbers of protein-coding and unproductive transcripts from protein-coding genes: 154,619 (91.2%) AtRTD3 transcripts came from protein-coding genes (Fig. 4E). Of these, ca. 86 k are expected to code for proteins while ca. 68.5 k are probably unproductive (Fig. 4E; Additional File 1: Table S10B). Alternatively spliced transcripts that coded for proteins were divided into those where AS events had little or no effect on the coding sequence (NAGNAG/AS UTR) (30.3%) and those that encoded protein variants (69.7%) (Fig. 4F; Additional File 1: Table S10B). NAGNAG/AS events generate transcripts that code for protein variants differing by only one amino acid and transcripts of genes where AS events occur only in the 5′ and/or 3′ UTRs and hence code for identical proteins. The NAGNAG/AS UTR transcripts were further broken down according to whether AS events were in the 5′ and/or 3′UTR or were NAGNAG (Fig. 4G; Additional File 1: Table S10B). The most frequent AS events were in the 5′ UTR (52.4%) followed by those in the 3′ UTR (21.2%) or NAGNAG events (15.4%) (Fig. 4G). NAGNAG AS events were present in 7% of protein-coding transcripts and 3.5% of all transcripts. Finally, the unproductive transcripts from protein-coding genes were classified by their nonsense-mediated decay (NMD) target features: presence of a premature termination codon (PTC), downstream splice junctions, long 3′ UTR, or overlapping upstream ORF where an upstream ORF overlaps the authentic translation start site [54] (Fig. 4H; Additional File 1: Table S10B). Over 70% of the unproductive transcripts contained the classical combination of NMD target features of a PTC with downstream splice junctions and long 3′UTRs, 8.7% had a PTC with either one of these signals and 6.4% of transcripts contained an overlapping uORF (Fig. 4H; Additional File 1: Table S10B).

Iso-seq increased the number of transcript isoforms for many genes reflecting both discovery of novel AS events and defined TSS/TES variation compared to Araport and AtRTD2 (Additional File 2: Fig. S8). Different TSS in Iso-seq transcripts were observed in genes where alternative TSS had been previously characterized [55], for example, AT1G09140 (SERINE-ARGININE PROTEIN 30) and AT1G22630 (SSUH2-LIKE PROTEIN) (Additional File 2: Fig. S9A and B). Defined Iso-seq TESs in AtRTD3 confirmed the well-established intronic alternative polyadenylation sites in FCA and FPA (not shown) and those in ATHB13 (AT1G69780) and ANKYRIN REPEAT-CONTAINING PROTEIN 2 (AT4G35450) [56] (Additional File 2: Fig. S10A and B). The Iso-seq data also identified novel splice sites and alternative TSS/TES in known and novel lncRNAs. For example, AS transcripts of the antisense lncRNA, FLORE [57] were confirmed (Additional File 2: Fig. S11). AtRTD3 contained 1541 novel genes compared to Araport (Additional file 1: Table S11). All were identified by Iso-seq, and their transcripts therefore have high-confidence TSS/TES and SJs for those which are spliced or alternatively spliced. The majority of the novel genes were lncRNAs with only 109 genes coding for proteins with a CDS of > 100 amino acids; 223 had more than one transcript and 1318 had single transcripts. The novel genes were either intergenic or antisense genes. For example, G12636 is an alternatively spliced intergenic lncRNA, G13263 is a spliced antisense gene with different TSS, and G14744 is an alternatively spliced antisense gene which covers two different protein-coding genes (Additional File 2: Fig. S12A, B and C, respectively). We carried out a functional annotation analysis of the transcripts from the novel genes identified in AtRTD3 using TRAPID 2.0 (http://bioinformatics.psb.ugent.be/trapid_02) [58]. Among the 1985 transcripts, a best similarity search using DIAMOND identified hits for 1320 transcripts from a range of plant species with 1131 (85.68%) coming from Arabidopsis thaliana and 49 (3.71%) from Arabidopsis lyrate (Additional File 1: Table S12). These transcripts were associated with 897 gene families and 4 RNA families. Thus, around two-thirds of the novel transcripts are related to known genes.

Iso-seq also defined 1197 genes with 3796 chimeric transcripts which extended over two or more genes (Additional file 1: Table S13). For example, Iso-seq detected only a single transcript of the upstream MEKK2 gene but multiple chimeric transcripts covering the tandemly arranged MEKK2 and MEKK3 genes (Additional File 2: Fig. S13). Thus, the high quality Iso-seq data increases transcript diversity and provides detailed information of transcript features. Chimeric transcripts have been identified previously in an fpa mutant of the flowering time control protein, FPA, using an algorithm based on reciprocal DRS read abundance at tandem protein-coding genes [59]. Forty-four chimeric RNAs were identified in the fpa mutant of which 12 were confirmed; AtRTD3 contained 5 of the putative chimeric RNAs and two of those corroborated. Similarly, AtRTD3 contained two of the 52 putative chimeric/extended mRNAs were identified in a mutant of the NEW ENHANCER OF ROOT DWARFISM1 gene [60]. The small overlap between the chimeric genes in AtRTD3 and these studies is likely due to the mutants affecting transcription termination in the upstream gene and not being included among the Iso-seq samples in this study.

Finally, we compared the frequency of different AS event types among the different transcriptomes using SUPPA2 [61]. AtRTD3 had the highest number of AS events followed by AtIso (Additional File 1: Table S14). For the most part, the frequency of different AS events is similar with approximately double the number of alternative 3′ splice site (Alt 3′ss) than alternative 5′ splice site (Alt 5′ss) events and relatively few exon skipping events (6–7%). Intron retention (IR) is far more frequent in plants than in animals with around 40% of plant AS events being IR [62] as seen in AtRTD2 and Araport11 (Additional File 1: Table S10). However, AtIso contained a higher number of IR events (50%) which supported the observation that many Iso-seq transcripts from multi-exon genes contained different individual retained introns (e.g., Additional File 1: Fig. S8 and S9) such that Iso-seq appeared to identify more low-abundance transcript variants in highly expressed genes. Finally, the intermediate value of 44% IR events in AtRTD3 reflects the combination of unique transcripts from Iso-seq and short read-derived assemblies.

AtRTD3 and AtIso increase quantification accuracy at the transcript and alternative splicing levels

To evaluate AtRTD3 and AtIso in the performance of transcript and AS quantification, we used high-resolution (HR) RT-PCR data that we had used previously to evaluate AtRTD2 [39]. The HR RT-PCR data was generated using RNA samples of two time-points (T5 and T20) of Arabidopsis plants exposed to cold and which were also used to generate RNA-seq data for direct comparison [63]. Due to the increased transcript/AS diversity in AtRTD3 and AtIso, we were able to analyze 226 AS events from 71 Arabidopsis genes (three biological replicates of each of the T5 and T20 time-points). This generated 1349 data points, which represents a significant increase from the earlier study (127 AS events from 62 genes with a total of 762 data points). For the splicing ratios from HR RT-PCR, transcript structures from AtRTD3 and AtIso were compared to the amplicons in HR RT-PCR and the TPMs of individual transcripts covering the different AS outcomes were used to calculate splicing ratios for each of the AS events or event combinations in that region. For splicing ratios from RNA-seq data, each of the different reference transcriptomes (AtRTD2-QUASI, Araport11, AtIso, and AtRTD3) were used to quantify transcripts using Salmon. The splicing ratio for each AS event was calculated by comparing the abundance of individual AS transcripts with the AS event to the fully spliced (FS) transcript which is usually the most abundant transcript and codes for full-length protein (AS/FS). The scatter plot of splicing ratios from HR RT-PCR and RNA-seq using the different reference transcriptomes (Fig. 5; Additional File 2: Fig. S14) shows that AtRTD3 and AtIso achieve the highest concordance with HR RT-PCR data. This is likely due to the increased integrity of transcript structure (accurate characterization of SJs, TSSs, and TESs and their combinations) as well as increased transcript/AS diversity over AtRTD2 and Araport11. Although AtIso and AtRTD3 performed very similarly in this analysis, AtRTD3 is the transcriptome of choice for RNA-seq analyses due to its far greater gene coverage.

Fig. 5
figure 5

Correlation of splicing ratios calculated from the RNA-seq using different RTDs and HR RT-PCR data. Splicing ratios for 226 AS events from 71 Arabidopsis genes (three biological replicates of the time-points T5 and T20) generated 1349 data points in total. The splicing ratio of individual AS transcripts to the cognate fully spliced (FS) transcript was calculated from TPMs generated by Salmon and A Araport11, B AtRTD2-QUASI, C AtIso, and D AtRTD3 and compared to the ratio from HR RT-PCR. E Correlation coefficients are given for each plot. Note that for clarity of the figures, data points with values that lie substantially outside the range of the graphs are not included in A–D but are included in the correlation values and shown in Additional File 2: Fig. S11

High-resolution gene and transcript expression profiling with AtRTD3

AtRTD3 contains many more transcripts (169,503) than AtRTD2 (82,190). This reflects increased numbers of transcripts with intron retention and other AS events as well as defined TSS and TES variation. For some highly expressed genes with multiple introns, the combination of TSS/TES variation and intron retention events often led to tens of transcript isoforms from a single gene. Although more complex than AtRTD2, we predicted that the majority of isoforms with intron retention represent intermediates of splicing where an intron(s) had not been removed at the time of RNA extraction and that they would therefore have low levels of expression. Similarly, some isoforms with novel AS events would be NMD-sensitive again potentially with low expression levels. In contrast, novel AS isoforms or isoforms with different TSS or TES with significant expression levels would be expected to alter the transcript expression profiles compared to analysis with AtRTD2 where these isoforms were absent (we showed previously the impact of missing transcripts in transcript quantification [39]). To demonstrate the increased resolution obtained with the more complex and diverse AtRTD3, we compared gene and transcript expression profiles using RNA-seq data from an RNA-seq time-course of 5-week-old Arabidopsis plants grown in 12 h dark:12 h light in the transition from 20 to 4 °C [63, 64]. Briefly, transcripts were re-quantified with Salmon using AtRTD3 as reference and the RNA-seq data from 26 time-points (3 biological replicates) was re-analyzed. Time-points were taken every 3 h for the last day at 20 °C (T1–T9), the first day at 4 °C (T10–T17), and the fourth day at 4 °C (T18–T26) (see Fig. 6). Expression profiles were directly compared between AtRTD2 and AtRTD3.

Fig. 6
figure 6

Differential TSS and TES usage. Pairs of transcript isoforms with significant isoform switches and different TSS (A–D) and TES (E, F). A AT1G11280—the shorter .6 transcript is cold-responsive. B AT3G13110—single-exon gene with different TSS where the .1 transcript has rapid cold-induced expression compared to the .2 transcript. C AT1G55960—both transcripts peak at dusk but have different expression behavior with the .11 isoform showing large increases of expression at 20 °C and day 1 at 4 °C declining with continued cold exposure. D AT5G53420—isoforms with very different TSS - .7 isoform expressed rhythmically peaking during the day (light-responsive) at 20 °C before declining rapidly in the cold while the .12 transcript has increased expression in the cold, peaking during the dark. E AT4G14400—the isoforms differ only in their TES but are expressed rhythmically with different phase (3 h offset) at 20 °C and reduced at 4 °C. F AT3G56860—very different TES and expression behavior—antiphasic at 20 °C with cold-induced switch to the shorter .12 isoform. Error bars on points are standard errors of the mean

The more comprehensive nature and accuracy of AtRTD3 is clearly illustrated by the THIAMIN C SYNTHASE (THIC) gene (AT2G29630) which is involved in regulation of thiamin biosynthesis via a riboswitch in the 3′ UTR that controls expression through alternative 3′-end processing or splicing [65, 66]. Three types of transcripts have been identified previously: type I transcripts represent precursor transcripts; type II transcripts have been processed at a polyadenylation site in the second 3′UTR intron (3′-2) and type III transcripts have splicing of intron 3′-2 [65] (Additional File 2: Fig. S15A). Low levels of THIC expression reduce vitamin B1 (thiamin diphosphate—TPP) levels. Low levels of TPP allow the structure of the RNA aptamer to interact with the 5′ splice site of the 3′-2 intron to inhibit splicing and promote processing at the polyadenylation site in the intron. The resultant type II RNA transcripts have relatively short 3′ UTRs, are stable, and give high expression of THIC [65]( Additional File 2: Fig. S15A). With increased levels of TPP, TPP binds to the aptamer leading to structural changes in the riboswitch RNA such that it can no longer interact with and inhibit use of the 5′ splice site of 3′-2. Subsequent splicing of the 3′-2 intron removes the poly(A) site and type III transcripts with longer 3′ UTRs of various lengths are generated leading to increased RNA degradation and reduced expression of THIC (Additional File 2: Fig. S15A). AtRTD3 contained 32 THIC transcript isoforms (Additional File 2: Fig. S15B). The majority have very low expression and either have retention of different introns within the CDS and are likely intermediates of splicing or have other AS events that disrupt the ORF and introduce PTCs. Type I, II, and III transcripts [65] were clearly distinguished by their 3′UTR structures (Additional File 2: Fig. S15B). The 3′ processed type II mRNAs have a shorter 3′UTR than types I and II due to processing at the pA site within intron 3′-2 while type III transcripts have splicing of the 3′-2 intron (removes the first seven nucleotides of the aptamer sequence) and longer 3′UTRs with a range of 3′ends sites [65]. In addition to the type I, II, and II isoforms found in AtRTD3, we observed a novel AS variant where splicing removed only the first aptamer nucleotide. We detected three type I precursor transcript isoforms among the 32 THIC isoforms in AtRTD3 (Additional File 2: Fig. S15B). In contrast, Araport and AtRTD2 contained 4 and 10 transcripts, respectively. Neither AtRTD2 nor Araport contained type II transcripts and possible type I transcripts were much longer than those obtained with Iso-seq suggesting that the 3′UTRs of the transcripts were incorrectly assembled. THIC is highly expressed and under circadian control [66]. In the cold time-series analyzed with AtRTD3 as reference, THIC expression increased during the day and decreased in the dark (Additional File 2: Fig. S15C). The major isoform was the AT2G29630.28 type II RNA; the highest expressed minor isoforms seen during the light period are a type I isoform and another type II isoform (Additional File 2: Fig. S15C). Although the total expression profiles using AtRTD3 and AtRTD2 are very similar, the underlying transcript profiles were quite different and reflect incorrectly assembled transcripts and the absence of type II transcripts in AtRTD2 (Additional File 2: Fig. S15D). Thus, the more comprehensive transcript set in AtRTD3 along with the ability of Iso-seq to identify TES, successfully distinguished the different THIC RNA classes and showed that a type II isoform is the most abundant class [65]. The impact of increased diversity and transcript profiling resolution were also illustrated by the identification of a novel cold-induced isoform with shorter TSS and TES in AT3G17510 (CBL-INTERACTING PROTEIN KINASE 1 - CIPK1) and a novel isoform (AT4G25080.13) encoding an N-terminally truncated protein of AT4G25080 (MAGNESIUM-PROTOPORPHYRIN IX METHYLTRANSFERASE - CHLM) in AtRTD3 (Additional File 2: Fig. S16 and S17, respectively).

Cold- and blue light-induced differential TSS and poly(A) site usage

Differential TSS and TES usage was observed among the expressed isoforms of AT3G17510 (CBL-INTERACTING PROTEIN KINASE 1) (Additional File 2: Fig. S16). To examine differential TSS and TES usage more widely, we first generated lists of genes from AtRTD3 which contained alternative TSS and TES which were more than 100 bp apart (2251 and 1753 genes, respectively). Initially, to show differential TSS usage of some of these genes, we compared the 2251 genes with alternative TSS to 220 genes which had previously been shown to have blue light-induced differential TSS usage [55]. Eighty-two of the genes with alternative TSS defined here had blue light-induced differential TSS usage. We next re-analyzed the RNA-seq time-course data [63] with AtRTD3 as reference and applied the Time-series Isoform Switch (TSIS) program [67] to identify genes with significant isoform switches (IS) (p < 0.001). To identify IS in genes with alternative TSS and TES, we filtered the IS with the lists of genes containing alternative TSS and TES more than 100 bp apart. This identified 2136 significant IS with alternative TSS and 1723 with alternative TES from 583 and 450 different genes, respectively (160 genes had IS involving isoforms with alternative TSS and TES). Genes could contain > 1 isoform switch if they involved different pairs of isoforms from the same gene or where multiple IS occurred between different time-points (the time-series had 26 time-points). However, the IS analysis did not distinguish between IS due alternative splicing or to differential usage of alternative TSS and/or TES. We, therefore, selected prominent IS events where the isoforms had large negative correlation values < −0.5 or where the difference in expression levels of the isoforms was >20TPM. These were then manually inspected to identify transcript isoforms with no AS such that the IS only involved isoforms with alternative TSS or TES (Fig. 6A–D) and TES usage (Fig. 6E, F). For example, the AT1G11280.11 isoform had a TSS 123 bp upstream of the .6 isoform and their poly(A) sites differed by only 3 nt. The .11 transcript (3473 nt including introns) has an intron in the extended region and codes for a protein of 830 amino acids with 10 additional amino acids at the N-terminal end compared to the .6 isoform (Fig. 6A). At 20 °C, the .6 isoform peaked 3 h after dusk (T2) and then declined in expression; cold rapidly induced expression of this transcript in the dark while expression of the .11 transcript does not change significantly in response to light-dark or cold (Fig. 6A). AT3G13110 is a single-exon gene. The .1 and .2 isoforms have the same poly(A) sites but the TSS of .2 is 272 bp upstream of .1. The .2 transcript codes for a protein with a 55-amino-acid N-terminal extension. At 20 °C, there was little expression of the .1 transcript but cold caused a rapid, transient increase in day 1 at 4 °C peaking at dawn (T13) while the .2 transcript showed a modest increase at low temperature. Thus, at 20 °C, the .2 promoter drives expression and cold induces a rapid switch to the .1 promoter (Fig. 6B). The .11 isoform of AT1G55960 has a TSS 104 bp upstream of the .7 isoform and slightly different poly(A) sites (differing by 12 nt); the isoforms code for identical proteins (Fig. 6C). At 20 °C, both isoforms were expressed in the light peaking 3 h after dawn (T5). However, expression levels of .11 were lower than .7 in the dark but showed a large increase in expression in the light at both 20 and 4 °C (day 1) which was lost by day 4 at 4 °C (Fig. 6C). Thus, AT1G55960 has a light- and cold-regulated promoter switch. The TSS of the .12 isoform of AT5G53420 is 717 bp upstream of that of the .7 isoform. The poly(A) site of .12 is also longer by 47 nt and codes for a 265-amino-acid protein including a 79-amino-acid N-terminal extension (Fig. 6D). At 20 °C, the shorter .7 transcript was expressed rhythmically during the day and declined in the cold with a rapid switch to higher expression of the longer .12 isoform mainly in the dark with different phasing of expression (Fig. 6D). This suggests that the promoter driving expression of the .7 transcript is light-responsive and negatively regulated by low temperature while that of the .12 isoform is cold-responsive.

Differential TES usage was shown for the .26 and .27 isoforms of AT4G14400 which have identical TSS and code for the same protein but have different poly(A) sites, 194 nt apart. At 20 °C, expression of .27 was significantly higher than .26 peaking at dusk (T1) while .26 peaked 3 h later in the dark (T2). Expression of the isoforms increased during the day but in day 1 at 4 °C, the .26 isoform increased to a similar level to the .27 isoform (Fig. 6E). The differential phasing of expression of the isoforms was more pronounced at 4 °C (Fig. 6E). The isoforms only differ by their poly(A) sites suggesting that phasing of expression and the cold response of .26 are mediated by alternative polyadenylation. Finally, the .24 and .12 isoforms of AT3G56860 have identical TSS and CDS but very different poly(A) sites with that of the .24 isoform being 1218 nt downstream (Fig. 6F). Both isoforms were expressed at 20 °C in an almost complementary way, but at 4 °C there was a rapid increase in expression of the shorter .12 isoform and decline of the .24 isoform. Thus, the very different cold responses of the two isoforms may be controlled by alternative polyadenylation. The TSIS method only identified a subset of potential differential TSS and TES usage because it was limited to genes which had TSS or TES sites that were > 100 bp apart and where different isoform abundances switched significantly.

Besides defining alternative polyadenylation in 3′UTRs, the TES analysis also identified premature polyadenylation sites. Premature polyadenylation is an important mechanism in regulating gene expression as shown for FCA and FPA [59, 68, 69]. Such polyadenylation events occur in either exonic or intronic sequences with different consequences. Premature polyadenylation that occurs in exons can result in non-stop mRNA transcripts where there is no stop codon in the transcript after the translation start site and ribosomes reaching the 3′ end of the transcript trigger the non-stop decay pathway [70]. Most transcripts from premature polyadenylation in introns have a stop codon before the end of the transcripts but depending on the polyadenylation site can give rise to non-stop RNAs. Recently, the non-stop decay pathway has been shown to function in plants [71] and non-stop RNA transcripts have been identified in disease resistance genes which require FPA for premature polyadenylation [72]. We identified 214 non-stop RNA transcripts from 169 protein-coding genes in AtRTD3 (Additional File 1: Table S15A and B). Disease resistance genes were the most common gene class and included 14 of the ca. 40 FPA-sensitive disease resistance genes with non-stop transcripts [72] as well as ten disease resistance genes with non-stop RNA transcripts not found in that study. Interestingly, two polyadenylation and cleavage factor homologs (PCFS1 and PCFS5) generated non-stop RNAs from premature polyadenylation and one of the FPA transcripts (AT2G43410.8) was a non-stop RNA (Additional File 1: Table S15A and B). The list of genes with non-stop RNAs is unlikely to be complete as only around one-third of the FPA-sensitive disease resistance genes were identified which may reflect the specific effect of the fpa mutant compared to the range of samples used here or differential coverage of genes in the Iso-seq and Oxford nanopore datasets. Nevertheless, defining TSS and TES by Iso-seq allows detailed investigation of mechanisms of post-transcriptional regulation of expression and developmental stage- and condition-specific changes in TSS and TES usage.

Discussion

The accuracy of differential gene expression and differential alternative splicing analyses of RNA-seq data depends on the quality and comprehensiveness of the reference transcriptome. Here, we present a new Arabidopsis RTD (AtRTD3) which has extensive support from single-molecule sequencing (PacBio Iso-seq). Data was generated from a wide range of organs/tissues, abiotic and biotic treatments, and RNA-processing mutants to increase the number and diversity of transcripts. Novel methods were developed to identify high-confidence SJs and TES/TSSs to overcome (1) the sequencing errors particularly around splice junctions which generate thousands of false transcript structures/annotations and (2) the impact of degradation and truncated transcripts/reads on accurate end determination. In AtRTD3, 77.9% of transcripts (from 61.68% of genes) are high-quality Iso-seq-derived transcripts with accurately defined SJs and start and end sites. For those genes with little or no Iso-seq coverage, transcript isoforms were taken from AtRTD2 (14.65%) and Araport11 (7.38%). AtRTD3 contains 169,503 unique transcripts from 40,932 genes reflecting novel genes (mostly lncRNA genes), novel AS transcripts, and defined TSS/TES compared to the short read-derived AtRTD2 [39]. AtRTD3 represents a high-quality, diverse, and comprehensive transcriptome which improves gene and transcript quantification for differential expression and AS analysis and now allows alternative TSS and TES usage to be addressed.

In the production of AtRTD3, we applied a hybrid analysis pipeline using PacBio IsoSeq3 and TAMA and developed new methods of single-molecule sequencing analysis which are generally applicable and will improve downstream analysis and the quality of transcript and transcriptome annotations. We showed previously that redundant or missing transcripts, transcript fragments, and mis-annotation in the 5′ and 3′ ends of transcripts seriously impacted the accuracy of transcript and gene expression quantification with Salmon and Kallisto which require prior knowledge of transcripts [39]. Initial analysis of the Iso-seq data identified issues with false splice junctions, degraded or fragmentary reads/transcripts and that error correction methods using short-read data often trim or split whole transcripts sequences in fragments or generated new errors (over-correction). In addition, the IsoSeq3 analysis pipeline from PacBio used polishing steps which removed splice site variation with small differences such as alternative splicing of a few nucleotides (e.g., NAGNAG sites). These observations provided the motivation to improve methods of analysis of PacBio Iso-seq data. Firstly, we used the IsoSeq3 pipeline up to the generation of FLNCs and then switched to TAMA which gave greater control over transcript processing and was the basis of developing the SJ-centric approach. Secondly, we clearly demonstrated that mismatches in the vicinity of SJs generated transcripts with false splice junctions. We defined criteria to identify high-confidence splice junctions and remove poorly supported SJs. The number of rejected SJs and the high overlap with the accurate short read-determined SJs illustrated the value of the splice junction-centric approach. Thirdly, even with 5′-cap capture, there is extensive variation in transcript start and end sites, much of which reflects degradation of RNA. Distinguishing high-confidence TSS and TES from such degradation products required different methods that take into account the effects of different gene expression levels and the stochastic nature of transcription start and end sites. The high-confidence TSS and TES defined in AtRTD3 were supported by the frequency, position, and distribution of conserved promoter, polyadenylation, and translation start motifs and by good agreement with experimentally defined TSS and poly(A) sites [41, 43, 44]. Such experimental determinations are often limited in the number of genes for which data is generated and the number of transcripts where both the 5′ and 3′ ends are defined. The new pipeline addresses the major issues of accuracy of splice site and TSS/TSS determination in Iso-seq analysis. The methods have three main advantages: (1) the generation of high-confidence SJs removed the need for error correction using short reads and therefore avoided splitting or trimming of the original sequences as well as over-correction, (2) both TSS and TES are generated for a very high proportion of transcripts, and (3) they are determined directly from the single-molecule data without the need for parallel experimental approaches. To date, Iso-seq has been applied to a wide range of plant species (see “Background”); the novel methods here will improve analysis of transcripts in future studies and allow re-analysis of existing data. In addition, AtRTD3 can evolve further with the addition of new or existing Iso-seq datasets analyzed using the methods described here.

The Iso-seq-derived transcripts in AtRTD3 (ca. 80% of transcripts) were full-length with accurate SJs and TES/TSS and correct combinations of TES/TSS and AS events but only covered ca. two-thirds of genes in Arabidopsis. This represents good coverage for Iso-seq in comparison to other studies. For example, a recent study of Iso-seq of nine tissues in rice covered only ca. one-third of rice genes [29]. Coverage of the other genes and transcripts in AtRTD3 came from Araport11 and, primarily, from AtRTD2 due to its far greater transcript diversity [39]. The transcripts from AtRTD2 and Araport11 are of high quality in terms of splice sites but their 5′ and 3′ ends are likely to be inaccurate and are often artificially extended [39]. The quality of SJs in the AtRTD2 transcripts is evidenced by 57.8 k of the 82 k AtRTD2 transcripts being redundant to Iso-seq transcripts in having identical SJs such that the Iso-seq transcripts were preferentially selected. Thus, AtRTD3 has full coverage of the genes in Arabidopsis with two-thirds of genes made up predominantly of Iso-seq transcripts and one-third of high-quality RNA-seq assembled transcripts. AtRTD3 is unique in that all of its transcript annotations have undergone extensive quality controls. As higher accuracy and throughput of single-molecule sequencing technologies improve, the new analysis pipeline exploited here will enable the rapid determination of SJs, TSS, and TES for fully comprehensive transcriptomes.

AtRTD3 contains greatly increased numbers of unique transcripts and particularly transcripts coding for protein variants and unproductive transcripts from protein-coding genes compared to AtRTD2. Although transcript numbers more than doubled in AtRTD3, 60.4% of multi-exonic protein-coding genes had AS agreeing with previous estimates [39, 62]. The increased number of protein variant transcripts includes transcripts from the same genes with alternative TSS and pA sites and the identification of novel AS events which alter coding sequences. The increased unproductive transcripts also included transcripts with the same PTC-generating AS event but with alternative TSS and TES sites and the majority contained classic NMD characteristics. Iso-seq identified novel AS events and, in particular, high numbers of intron retention events. The majority of transcripts with intron retention most likely reflect partially spliced pre-mRNAs and why such transcripts should be more prevalent in Iso-seq is unknown but may be due to lower efficiency of obtaining full short-read coverage of introns in short-read assembly. In plants, transcripts with intron retention have been shown to avoid NMD and to be retained in the nucleus [54, 73]. In contrast, human intron retention transcripts are generally degraded by the NMD pathway [74] but numerous examples of intron retention as a regulatory mechanism have been described [75]. For example, intron detention where partially spliced transcripts remain in the nucleus until required and are then spliced and mRNAs exported and translated represent novel gene regulation mechanisms [75]. In this regard, we have identified ca. 20 k protein-coding transcript isoforms with AS only in the 5′ and/or 3′ UTR such that isoforms coded for the same protein. AS in UTRs can be involved in regulation of expression by introducing short or overlapping uORFs to trigger NMD or affecting translation [54] or nuclear retention of mRNAs determining export of mRNAs [75]. The detailed characterization of such transcripts here provides a basis for future investigation into the regulatory roles of AS in UTRs.

The power of exploiting comprehensive RTDs in analyzing differential expression and differential alternative splicing was demonstrated in Arabidopsis using a cold time-series dataset and AtRTD2 [63, 64]. Thousands of genes with rapid cold-induced significant changes in expression and AS were identified due to the transcript level resolution of expression [63, 64]. AtRTD3 is more comprehensive and for most transcripts (ca. 80%) there is detailed structural information in terms of AS events and TSS/TES which increase the resolution of the analysis. Direct comparison of transcript quantification using AtRTD2 and AtRTD3 showed an increase in accuracy and the impact of missing transcripts and incorrectly assembled transcripts as seen previously [39]. More importantly, the defined TSS and TES clearly demonstrated variation in TSS and TES for many genes and re-analysis of the cold time-series data with AtRTD3 identified differential TSS and TES usage due to low temperature and light/dark conditions. It will now be possible to examine transcriptional and post-transcriptional regulation of gene expression involving differential TSS and TES usage demonstrated here and the impact of AS in UTRs [54, 76] during development and in response to abiotic and biotic stresses. Differential TSS and TES usage illustrates novel regulatory mechanisms. For example, Kurihara et al. [55] identified differential TSS usage in response to blue light and proposed a mechanism whereby blue light induces use of a TSS downstream of an uORF to produce a transcript that avoids NMD and allows expression. As mentioned above, over 20 k transcripts in AtRTD3 have AS only in the UTRs and interplay between TSS/TES usage and AS in the UTRs may have important regulatory roles affecting stability of transcripts, whether they are retained in the nucleus or exported and avoid NMD or are degraded to fine tune gene expression.

The main use of AtRTD3 is in analysis of RNA-seq data and rapid and accurate differential gene expression and differential alternative splicing. A key element of its functionality is in the accurate quantification of transcripts using Salmon or kallisto and AtRTD3 aims to be as comprehensive as possible and to minimize factors that can bias quantification of transcripts.

Despite the increased number of transcripts, one of the limitations of AtRTD3 is the incomplete coverage of genes and transcripts by Iso-seq as seen in the saturation curve (Additional File 2: Figure S6A). Importantly, ca. 80% of protein-coding, unproductive mRNA, and ncRNA transcripts in AtRTD3 were derived from Iso-seq. The transcripts are full-length with defined 5′ and 3′ ends and transcript fragments have been removed to ensure accurate quantification. Nevertheless, gaps in gene and transcript Iso-seq coverage have been filled from the other transcriptomes and some of these short read-based genes will have mis-annotations in the 5′ and 3′ ends of transcripts which can affect transcript quantification [39]. As more single-molecule sequences become available, the short read-based transcripts will be replaced by long-read versions using the methods described here such that AtRTD3 will continue to evolve. A second consideration is whether the greatly increased number of transcripts in AtRTD3 may affect transcript quantification. On the one hand, increased numbers and definition of isoforms give greater resolution of gene expression and the contribution of each isoform (Additional File 2: Figures S15-S17). On the other hand, biological systems are complex, and the increased number of transcripts included higher numbers of novel AS isoforms (protein-coding or targets of NMD), intron retention isoforms which may represent intermediates of splicing or mis-spliced transcripts. Due to the quality control filters used to construct AtRTD3 to address factors affecting accurate transcript quantification [39], we expect transcripts which are intermediates of splicing (with one or more retained introns) or which have splicing errors to have low abundance and little effect on quantification of other isoforms. However, some intron retention transcripts (e.g., exitrons) are regulatory and have higher levels of expression [62, 77]. There is substantial variation in the abundance of NMD transcripts [54] and particular isoforms may be prominent in specific cell types or conditions. It is not possible to predict such variation in transcript expression levels and therefore it is important to capture expression of all transcripts and exploit the ability of RNA-seq data to distinguish the relative contribution of each transcript to the overall expression of a gene and obtain accurate expression levels of, for example, protein-coding isoforms. Ultimately, when all transcript isoforms are full-length with defined 5′ and 3′ ends, we expect accurate quantification of all transcripts irrespective of the complexity of the reference transcriptome. Finally, it is increasingly important with single cell transcriptomics to have a complete and comprehensive transcriptome reference for analysis of RNA-seq data.

Conclusions

In this study, we generated AtRTD3, the most comprehensive and accurate Arabidopsis transcriptome to date. We sequenced a diverse set of samples with different tissues, different environmental conditions, and mutants so that AtRTD3 captured a much greater transcript diversity. We developed novel computational methods to examine the sequencing evidence for splice junctions as well as TSS and TES so that the transcripts derived from this study is well supported from start to the end. AtRTD3 improved the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis for other species.

Materials and methods

Plant material

Plant samples for RNA extraction and Iso-seq sequencing were all from Arabidopsis Col-0 and are summarized in Additional File 1: Table S1 and described below.

  • Different organ samples: flower, silique, and root materials. Col-0 was used for all samples. Roots: roots were harvested from 5-week-old plants grown in liquid culture (12 h light/12 h dark) and harvested at dawn and dusk and pooled. Siliques and inflorescence/flowers: plants were grown in soil in 16 h light/8 h dark conditions at 23 °C; siliques of different sizes (stages) up to early browning and inflorescences containing flowers from buds to mature flowers were harvested from 6-week-old plants and each pooled. For etiolated seedling samples, seedlings were grown for 3, 4, 5, and 6 days in darkness on petri dishes (½ Murashige and Skoog medium) without sugar and samples were pooled.

  • Plants exposed to different abiotic stresses/cues: cold, heat, flood, and time-of-day. Cold: 5-week-old rosettes grown in 12 h light/12 h dark and 20 °C were exposed to 4 °C at dusk for different lengths of time (12 h and 66 h) and samples were pooled; Heat: 5-week-old rosettes and 12-day-old seedlings grown in 16 h light/8 h dark at 23 and 20 °C, respectively, were exposed to high temperatures (27 and 37 °C, respectively) for different lengths of time (1 week and 12 h, respectively), harvested (4 h after dawn) and pooled. Flooded: 5-week-old rosettes grown on soil with 16 h light/8 h dark at 23 °C were either flooded or completely submerged under water for two different time exposures (24 h and 6 days) and pooled; time-of-day: 5-week-old rosettes were grown under 12 h light/12 h dark at 20 °C and were harvested at dawn and 6 h after dawn.

  • UV-C treatment was done as follows: Col-0 seedlings were grown on ½ Murashige and Skoog agar plates at 22 °C under 12 h light/12 h dark conditions until the first pair of true leaves was expanded (9 days after germination). The ultraviolet treatment was performed using a Stratalinker (Stratagene) at 254 nm with 1 kJ/m2. Subsequently, seedlings were incubated in either light or dark. Whole seedlings were collected after 1 and 4 h of incubation and frozen in liquid nitrogen. Equal amounts of RNA from UV-C treated samples were pooled.

  • Plants infected with different pathogens: Botrytis cinerea, Hyaloperonospora arabidopsidis, and Pseudomonas syringae. For B. cinerea infection, detached 5-week-old Arabidopsis (Col-0) leaves (grown at 22 °C, 12 h light/12 h dark, 60% humidity) were placed on agar, and inoculated with 5 × 7 μL droplets of 100,000 spores per mL in 50% grape juice. Infected trays were sealed and kept at 22 °C, 12 h light/12 h dark, 80% humidity. Samples (two infected leaves) were collected by flash freezing in liquid nitrogen at 24, 30, and 36 h post-inoculation. For H. arabidopsidis infection, 14-day-old Col-0 seedlings (grown at 22 °C, 12 h light/12 h dark) were sprayed with 30,000 spores per mL in water of Hpa isolate Noks1, 15 mL per P40 tray (0.375 mL per module), sealed and grown at 18 °C, 12 h light/12 h dark. Infected seedlings were harvested at 4, 5, and 7 days post-inoculation and flash frozen in liquid nitrogen. RNA was extracted and RNA samples pooled within each pathogen (final pool included 2 samples per time point). For P. syringae infection, 3-week-old plants were infected with P. syringae pv tomato DC3000 by infiltrating three leaves of five plants with 2 × 105 cfu/ml at ZT2 (12 h light/12 h dark). Infiltrated leaves were harvested 8 h and 24 h post-infiltration. RNA was extracted from both time-points and pooled.

Material from RNA-processing/degradation mutants (NMD and exosome) and nuclei

Mutants were an NMD double mutant combining the heterozygote of lba1 (upf1) and knockout upf3-1 and exosome mutants: xrn3-3, xrn4-6, and xrn2-1. Seedlings were grown on petri dishes and those of the exosome mutants pooled together. Nuclei were prepared from leaves of 5-week-old plants.

RNA extraction and library construction

For the majority of samples, RNA was isolated with the RNeasy plant mini kit (QIAGEN – including on-column DNase I treatment) according to the manufacturer’s instructions. RNA was extracted from etiolated seedlings, the NMD double mutant and nuclear extracts with the Universal RNA purification kit (EURx). PacBio non-size selected Iso-seq libraries were constructed using Lexogen Teloprime, Teloprimev2 or Clontech kits following the manufacturer’s instructions (Additional File 1: Table S1). Each of the 27 libraries was sequenced on a single SMRT cell (1 M,v3 for Teloprimev2 and Clontech) on a PacBio Sequel machine using a 10-h (v3) movie.

Analysis of PacBio Iso-seq reads

The workflow of the analysis is shown in Fig. 1. The PacBio sequencing data was analyzed using the PacBio IsoSeq3 pipeline to generate and map full-length non-chimeric (FLNC) reads. Further analysis was performed using TAMA [7] (https://github.com/GenomeRIK/tama) to collapse and merge reads/transcripts and apply novel methods to define splice junctions (SJs) and transcript start and end sites (see below).

Processing of raw PacBio IsoSeq reads to FLNCs

The raw PacBio sequencing data (.subreads.bam) from each library was processed individually using the following procedures: (1) CCS calling was carried out using ccs 4.0.0 using the following parameters: --min-rq 0.9 -j 28. (2) Primer removal and demultiplexing was carried out using lima (version v1.10.0) with the parameters: --isoseq --peek-guess. (3) IsoSeq3 (v3.2.2) refine was used to trim poly(A) tails and for rapid concatemer identification and removal to produce the FLNC transcripts (Fig. 1A). For the Clontech libraries, --require-polya is used while for Teloprime 5′ captured reads, lima is run with this parameter turned off. We have deliberately avoided the clustering steps in the IsoSeq3 pipeline in order that small variances around the splice junctions, such as NAGNAG splice junctions can be preserved. The FLNCs were then converted to FASTA format using samtools and mapped to the TAIR10 genome reference using minimap2 (version 2.17-r941) using the following parameters -ax splice:hq -uf -G 6000. The mapping files (bam files) were then sorted and the non-mapped reads were filtered out.

Splice junction-centric approach for accurate splice junctions

From this point, we adopted the TAMA analysis pipeline for the next steps of transcript isoform analysis (Fig. 1B). To overcome the generation of false splice junctions due to mis-mapping of FLNCs to the genome, we developed a splice junction-centric approach to provide highly accurate alignment around splice junctions. An improvement of TAMA was developed that allowed us to examine the mapping mismatches (replacement and indels) between the FLNCs and genome reference. Using this new parameter (-sjt and -lde), we were able to extract the mapping details of any defined regions around the SJs. For each library, we ran the TAMA collapse using the following parameters “-d merge_dup -x no_cap -m 0 -a 0 -z 0 -sj sj_priority -lde 30 -sjt 30” so that (1) small variations of up to 1 nt at SJs, as well as transcription start and end sites, are preserved in the FLNC reads and (2) mapping details of 30 nt around each SJ were extracted. Then TAMA merge (merged -m 0 -a 0 -z 0 -d merge_dup) was used to merge all the transcripts from the libraries and all the redundant FLNCs were removed, while the small variations up to 1 nt at SJs, as well as transcription start and end sites, were preserved in the merged FLNC reads. To accurately determine splice junctions, we examined the high-resolution alignment information around the SJs and found that high-confidence SJs are always supported by at least one alignment with a perfect match between the FLNCs and the genome reference around the SJ. SJs were also compared to those of AtRTD2 and their sequences assessed using position weight matrix—PWM [78]. To derive a list of high-confidence SJs (Fig. 1C) (and thereby identify falsely aligned SJs), our SJ-centric approach employed the following criteria: (1) the presence of canonical splice junction motifs and (2) no mapping mismatches including substitution, deletions, and insertions, with 10 nt around the SJs with support of at least one read.

Determination of transcription start and end sites

For high-abundance genes, we assume that Iso-seq reads with authentic TSS/TES sites would be sequenced more often than those representing degradation products where start/end sites will occur randomly. We can use the binomial distribution to estimate the probability of having m Iso-seq reads underpinning one specific start/end by random.

For m Iso-seq read starts/ends at n genomic locations, with the assumption that the starts/ends of the degraded Iso-seq reads are random, we assume the probability of each read to have a start/end at particular genomic location (p) is equal among all the read start/end locations, thus \( p=\frac{1}{n} \). The probabilities of having k reads at one genomic location at random can be calculated as a bionomial probability \( \mathit{\Pr}\left(k,m,p\right)=\left(\genfrac{}{}{0pt}{}{m}{k}\right)\ {p}^k{\left(1-p\right)}^{m-k} \). A smaller probability would indicate that the start/end genomic location is unlikely to have such a number of reads (low and high) at random. We are interested in identifying the non-random start locations that have higher numbers, so we have applied the following criteria: (1) k should be higher than the average reads for all genomic locations for that gene \( >\frac{m}{n} \); (2) the probability of having k of reads at one genomic location should be small with Pr(k, m, p) < 0.05.

We define the 5′ location of the long read as RSGLs and 3′ location as REGLs. The non-random RSGLs and REGLs with higher-than-expected numbers of reads are defined as significant RSGLs and REGLs, which are likely to be TSS/TES sites. Additionally, we removed REGLs which could be a result of off-priming identified by the REGLs being followed by poly(A) sequences in the genome.

For low-abundance genes where we could not detect significant RSGLs and REGLs, we applied a different set of criteria. Reads were compared and a significant start or end site required at least two long reads supporting that site within a sliding window of 11 nt (5 nt on each side).

To account for the stochastic nature of the TSS/TES, a 100-nt window around significant RSGLs and RSGLs were defined as high-confidence TSS/TES regions. All the merged FLNCs from all of the libraries were then filtered based on the high-confidence SJs and high-confidence TSS/TES regions (Fig. 1C). Transcripts containing SJs, TSS, and TES which did not match the high-confidence set were removed. To generate high-level transcripts, transcripts with small variances in 5′ and 3′ UTR lengths were removed by further collapsing transcripts by running the TAMA merge on the filtered FLNCs using “-m 0 -a 50 -z 50 -d merge_dup” that allows transcripts with variations within 50 nt at UTR regions to be merged. Thus, to achieve accurate transcript isoforms from the PacBio data and generate AtIso (Fig. 1C), we have adopted a strategy that seeks evidence to support all SJs, TSSs, and TESs. Finally, to increase the gene coverage using existing annotations and make the maximum use of the Iso-seq long reads, we retained genes that overlapped with Araport annotation on the same strand (> 50%). These were combined with the genes with TSS/TES support to generate the final set of genes and transcript in AtIso.

TSS and TES motif enrichment analysis

To search for known TSS/TES-related motifs around significant RSGL and REGLs as well as the identified loci of interest in other datasets (potential TSS and TES sites), the following approach was taken. A number of motifs associated with TSS and TES sites were identified (Additional File 1: Table S6). For each identified TSS and TES, the sequence within ±500 nucleotides on each side was extracted from the genome. A regular expression search was carried out in the extracted sequences searching for the known enriched motifs related to TSS and TES. All matching motifs and their positions relative to the site of interest were extracted. From this, the number of instances of the motif was calculated for every position ±500 nucleotides relative to the TSS/TES. As a control, the same number of random sites were taken, and the above analysis was carried out.

Construction of AtRTD3

AtIso represents the most accurate and extensive representation of Arabidopsis transcripts to date. To overcome the low coverage of genetic regions and the lack of transcript diversity in genes with low expression, we integrated the transcripts from short-read assemblies AtRTD2 and Araport into AtIso to generate the comprehensive transcriptome, AtRTD3. In AtRTD3, we kept all the transcripts from AtIso and only introduced transcripts from AtRTD2 and Araport that (1) contained novel SJs (AtRTD2 and Araport) or (2) covered genomic loci in Araport not covered by Iso-seq. The novel SJs were identified in a pairwise fashion in sequential order by, firstly, comparing AtIso and AtRTD2, extracting the transcripts in AtRTD2 with novel SJs that were not in AtIso, and, secondly, repeating the process with transcripts in Araport containing unique SJs (not in AtIso and AtRTD2). The transcripts from Araport covering novel loci that did not overlap with AtIso are also extracted. Finally, all the extracted transcripts mentioned above were merged together with AtIso using TAMA merge (-m 0 -a 50 -z 50 -d merge_dup).

During merge, we give Iso-seq assembled transcripts the highest priority by setting the “cap_flag” as “capped” and “merge_priority” as “1,1,1”, indicating 5′ TSS, splice junctions as well as 3′ TES of Iso-seq assembly all take highest priority during merging. For short-read assemblies, we label “merge_priority” as “uncapped” and “merge_priority” as “2,1,2”. This means that only the SJs were given top priority as they have been validated by short reads. 5′ TSS and 3′ TES from the short-read assembly would be lower priority and contribute less to the determination of the TSS and TES when merging with Iso-seq transcripts.

Annotation of AtRTD3

To annotate AtRTD3, we examined the overlaps of AtRTD3 transcripts with Araport gene annotations using bedtools (intersect -wao). Transcripts were assigned to the Araport genes if they overlap on the same strand (where the overlap covers > 30% of either transcripts). Transcripts that overlap two Araport genes on the same strand would be assigned a gene ID with two concatenated gene names (e.g., AT1G18020-AT1G18030). This allows the identification of biological chimeric transcripts that run-through two or more genes. The origin of these transcripts (AtIso, AtRTD2, or Araport11) are also added in the bed annotation to allow users to distinguish high-confidence transcripts from long-read assemblies from less confident transcripts from short-read assemblies.

Identification of non-stop RNAs in AtRTD3

Transuite outputs the start and end coordinates for both coding sequences and transcripts. For non-stop RNA transcripts, translation proceeds to the end of the transcript so the end coordinate of the CDS would be close to the end coordinate of the transcript (< 3 nts). Firstly, transcripts where the end coordinates of the CDS and that of the same transcript were within 3 nt were extracted. Secondly, any transcripts which contained a stop codon at the end of the transcript (in the last 5 nt) were removed. Thirdly, the coordinates of the longest TES for each of the above gene was compared to the coordinates of the transcripts with no stop codon and if the difference was larger than 100 nt, then transcripts were classified as having premature polyadenylation and missing a stop codon and therefore as a non-stop RNA. Finally, any non-protein-coding genes (e.g., novel transcribed regions, antisense RNAs, pseudo genes) were removed.