Pilot test on 70S ribosome shows the benefits of denoising
We first tested the effect of denoising on the alignment and grouping of particle images. Since we found direct usage of denoised particles in 2D classification could not generate reliable results, we propose a strategy based on the following heuristics for utilizing the denoised information. In this experiment, we used the 2SDR method to generate the denoised surrogates of the 5000 70S ribosome particles. From the set of denoised particles, we randomly picked five as the reference particles. For each reference in Supplementary Fig. 1 (column (a)), we searched in the denoised set to find the twenty most resembling particles by using FRM2D algorithm13. The best alignment parameters of rotation angles and translational shifts in x-and-y-direction were recorded for each particle, but applied to the original non-denoised particles to generate an aligned average. As shown in Supplementary Fig. 1 (column (c)), these averages resemble the projections of 70S ribosome (column (d)) and display more details than those obtained from the control experiment without denoising (column (b)). Deeper investigation reveals that the particle set found using denoising does not overlap well with that found without denoising. We designed a simulation study allowing for measuring the occurrence of true positives. With simulated noisy images of 70S ribosome (SNR = 0.01, defocus range 1.5–2.0 μm) prepared as described in Supplementary, we found the frequencies of true positive were 11.2% and 3.5% for with and without the denoising, respectively (Supplementary Fig. 2). With the SNR increased to 0.05, the frequencies are increased to 94.4% and 61.5% respectively (Supplementary Fig. 3). We further tested the effect of defocus—with SNR kept the same (0.05) but the defocus lowered to 1.0–1.5 μm, the frequencies drop to 93.7% and 51.3% (Supplementary Fig. 4). These results show that the chances of grouping identical or similar particles are markedly higher when denoising is used where the gain by 2SDR is more pronounced when the SNR is lower or the defocus is smaller.
Pre-Pro can be coupled with RELION to give better results
To test whether or not this pre-processing can be coupled with a 2D classification algorithm, we used two small experimental datasets, the 70S ribosome as used for the pilot experiment and the beta-galactosidase (see “Methods”). We compared the result from feeding the images re-positioned by the pre-processing to a classification algorithm with that from using the original images of no pre-processing. To evaluate the performance of a classification process, we used three performance indices: the number of good classes, the resulting initial model from class averages, and the time spent on classification.
We began our test with RELION since it has been used to solve the largest number of cryo-EM structures deposited into the PDB databank. When the pre-processing is used prior to RELION, we abbreviate the procedure as P-RELION. To do so with small datasets, we used RELION (2.0) and prescribed the number of classes to be 50 for these two small sets while leaving the remaining parameters unchanged as the default values. The class averages were sorted according to a quality index of “rlnClassDistribution”, which places the most populated classes on the top rows (Fig. 2). Usually, these classes coincide with the classes of topmost quality. As a result, evaluation of class quality could be performed by inspection from the top row. As shown in Fig. 2(b), clear classes of 70S ribosome obtained through P-RELION spills over into the third row. By contrast, without the pre-processing, there are still blurred classes in the second row (Fig. 2(a)). These alleged improvements are consistent with the statistics that report the accuracy in translations and rotations (see Supplementary Fig. 5). As a result, the number of clear classes is increased by 30% for 70S ribosome and 40% for beta-galactosidase. The yield of resulting particles is increased only marginally, from 90% to 94% for 70S ribosome and from 78% to 84% for beta-galactosidase. These findings together suggest, with the aid of the pre-processor, the homogeneity of each class would be potentially improved when similar number of particles are dispersed into more classes. We further used the class averages to calculate the initial model of beta-galactosidase, by which we used PRIME14 for its speed and robustness. Strikingly, in the absence of symmetry constraint, the 3D model from good classes generated by P-RELION (Fig. 2(f)) displays the symmetry character of beta-galactosidase while the class averages from RELION without the pre-processing does not (Fig. 2(e)). Importantly, the model from P-RELION matches better with the golden 3D model, as judged by the initial model to golden model FSC (Fig. 2(g)) and by docking of an atomic model of beta-galactosidase (Supplementary Fig. 6). These tests demonstrate the pre-processing can be successfully coupled with RELION to improve the classification results to yield better initial model. We also investigated RELION (3.0)15, which is 1.6X faster than RELION (2.0), to have similar findings.
Though RELION classification has been accelerated by GPU parallelism, we are still curious about if there would be any measurable impact on the kinetics of classification introduced by the pre-processing. The overall time, as documented in Table 1, remains roughly the same with the pre-processing because it is set by the default number of iterations, which is 25. To explore whether or not the pre-processor could further accelerate RELION 2D classification, we examined the evolution of particle yield and initial model of the beta-galactosidase produced from different iterations. As shown in Supplementary Fig. 7, P-RELION with 10 iterations can result in a yield of particles and and an initial model comparable to those from RELION with 25 iterations.
Pre-Pro makes ISAC faster and save closer-to-focus particles
To test ISAC, we set the size limit of each class to 100 for both datasets while leaving other parameters unchanged by the default values. When the pre-processing is used prior to ISAC, we abbreviate the procedure as P-ISAC. With the pre-processing, the number of stable classes is increased from 40 to 45 for 70S ribosome and from 37 to 41 for beta-galactosidase (Fig. 3). The pre-processing has increased the occupancy in many classes for both datasets, evidenced by the histogram in Supplementary Figs. 8 and 9a, d. The yield of particles for the 70S ribosome is increased from 78% to 88% while that for the beta-galactosidase is increased from 58% to 67%. Since ISAC was reported to have tendency to lose lower-defocused particles11, we investigated the distribution of the defocus values of the harvested particles. Since each particle was labeled with a defocus value in this set, we searched the medium value to find that 2.5 μm was a good approximation. As shown in Supplementary Fig. 10, the yield of smaller defocus particles (<0.5 μm) from ISAC is 73%, lower than 85%—that of larger defocus particles (>2.5 μm). Interestingly, with the addition of the pre-processor, the yield of smaller defocus particles is increased to 81% while that of larger defocus to 90%, suggesting the potential of the pre-processor in saving more closer-to-focus or lower contrast particles.
ISAC is known to offer high-quality classes11 that one would not anticipate significant improvement of the initial model by the extra pre-processing. Nonetheless, we proceeded to calculate the initial models of the beta-galactosidase from the class averages. The findings show that the initial model from P-ISAC is better than that from ISAC (Fig. 3(e)).
Concerned about the time consumed by ISAC, we measured the duration spent on classifying those two small datasets to find that the pre-processor could help save the time on 2D classification by approximately 30–40% (Table 1), which results in a time-saving by 20% for the entire workflow because 2D classification consumes approximately 60% of the time of the whole workflow for the small datasets.
Using the beta-galactosidase particles harvested through 2D classification, we further performed 3D refinement using the initial model to find better final 3D results could be obtained with the pre-processor Table 1.
Pre-Pro on 80S ribosome is cost-effective and lossless
Since the small datasets limit the resolution by the particle number and pixel resolution, or the quality of data, we set out to diagnose the pre-processor using large datasets that contain information of near-atomic resolution. To this end, we first chose two datasets: 80S ribosome (EMPIAR-10028)16 and TRPV1 ion channel (EMPIAR-10005)17, both contain a large number of good particles that were reconstructed to better than 3.5 Å to support the building of atomic models from De Novo.
The 80S ribosome particles, isolated from a malaria parasite (Plasmodium falciparum) and drugged with of emetine16, are large and mostly rigid. This dataset contains a total of 105,247 particles and has been processed by RELION with the radiation damage issue compensated by B-factor weighting to report a structure with an average resolution of 3.2 Å (0.83 Nyquist), where the resolution of 40S subunit is lower than the average16. Due to the fact that the overall resolution is near the Nyquist limit, we do not expect significant advance on the attainable resolution from this dataset. Concerned about the computation cost of an algorithm on such a large set, we first measured the time spent on the pre-processing. The measurements reported that the denoising step and the 2D reference-free alignment step (Fig. 1(b)) only took 5 min and 50 min, respectively. In this section, we re-curated this large set with 2D classification and then calculated a 3D structure from the resultant particles using CryoSparc 3D refinement18 guided by the initial model generated from the 2D class-average images.
To perform RELION classification on this 80S dataset, we let the prescribed number of classes to vary from 100, 200 to 520. It took a total of 12–32 h to complete RELION classification (Table 1). When the pre-processing was included, the number of clear classes from RELION was increased by roughly 10–20% (Table 1, Supplementary Fig. 11 and 12). It is noted that the increase of good classes does not remarkably increase in the yield of particles—in the case with the prescribed number of 100, the yield changes from 99.0 to 99.4%. We further performed 3D reconstruction using the harvested particles—in the original form neither down-sampled nor re-positioned for both cases of with and without the pre-processing. As shown in Fig. 4(e) and Supplementary Fig. 13, the particle set obtained from P-RELION classification with 100 classes or 200 classes provides a 3D structure with an overall resolution of 3.12 Å (0.86 Nyquist).
To facilitate the test ISAC on the 80S ribosome dataset, we increased the limit of the class size to 200 and binned the images by a factor of 4 (4X down-sampling) to reduce the image to 90 × 90 pixels since the size of original 80S ribosome images is enormous. ISAC succeeded in classification such a large dataset and produced 520 stable classes, but this process took a total of 124 h. Interestingly, with the intervening of the pre-processing 20 h were saved (16% of the total time) (Table 1). In addition, the pre-processing helps ISAC produce only a few more stable classes from this large set (Table 1, Supplementary Fig. 14), while the change in the yield of particles is also insignificant—from 97.7% to 98.2% (Fig. 4(e): column 4). Regarding the final 3D structure, both ISAC and P-ISAC have led to the same overall resolution of 3.10 Å (0.87 Nyquist).
Since the quality of a cryo-EM map can vary from site to site, we use the local resolution method19 to further evaluate the maps. As shown by the heat map provided by CryoSparc’s local resolution program (it re-implements the blocres20 program with GPU acceleration) (Fig. 4(h)), those best maps exhibit a broad range of resolutions, and report that most parts in each map are resolved close to or better than 3.0 Å (deep blue), whereas the low- (red) and medium-resolution (white) regions are sparsely distributed—most of them are localized to 40S subunit, highlighted by the lower-left corner in Fig. 4(h). Guided by the heat maps, we found noticeable modifications in density map could be introduced by the pre-processor to some flexible elements—for example, the protruded stalk of the 60S subunit and some parts in the 40S subunit.
In summary, the particle harvest tests on the 80S ribosome dataset demonstrate our pre-processing can help preserve virtually the entire set of particles of high homogeneity. Importantly, the computation cost of the pre-processing on this large dataset is extremely low—less than 1 h is consumed by the pre-processing, contrasted to tens of hours used by RELION 2D classification and much more by ISAC.
Pre-Pro enhances map interpretability of curated TRPV1
The particles of TRPV1 ion channel, cloned from rat (Rattus norvegicus) and expressed by and purified from a human cell line17, represent a tough dataset because the particle is smaller (300 kDa) where the protein feature is obscured by the amphipol molecules on the surface. The collection of 35,645 particles represents a highly curated set using 2D and 3D classifications and has reported a structure of 3.4 Å (0.70 Nyquist)17. It is noted that the resolution of this very set was extended to 3.3 Å by reprocessing using CryoSparc18. In this test, we re-curated this set with 2D classification and used the resultant particles to calculate a 3D structure using CryoSparc18 guided by the class averages generated initial model.
To optimize RELION classification of the TRPV1 dataset, we screened three prescribed class numbers, 50, 100, 175. Compared to 80S ribosome (Fig. 4(e)), the yields of the TRPV1 particles vary considerably with the class number to exhibit a broad range of distribution (62–77% of 35,645) (Fig. 5(e): column 1–3), while the overall resolutions of the resulting 3D structures vary little with the class number (Fig. 5(f): column 1–3). It is noted that the highest resolution—3.31 Å, virtually the same as that from full CryoSparc processing, was obtained from the least number of particles—approximately 22,000 (62% of 35,645) (Fig. 5(e): column 1), suggesting the existence of a finer subset. When the pre-processing was added prior to RELION, improvements in the overall resolutions are in the range 0.06—0.09 Å (Fig. 5(f): column 1–3; Table 1), lifting the resolutions beyond 3.3 Å. Interestingly, larger improvements were associated with the cases of lower resolutions (Fig. 5(f): column 2; Table 1).
We then tested ISAC classification on the TRPV1 dataset, by which we set the class size to 200. The run with ISAC had failed to converge that no class could be obtained (Fig. 5(e): column 4), yet this faltering could be rescued with our pre-processing, by which 67 classes (Supplementary Fig. 15) containing a total of 6,254 particles were produced (Fig. 5(e): column 4) to yield a 3D structure of 3.8 Å (Fig. 5(f): column 4). These classes are only half-filled and among them the top views are infrequently reported. To test if increased contrast would help restore ISAC, we down-sampled the TRPV1 particle images by 2X. With this down-sampling, ISAC was rescued, producing more than a hundred of classes (Supplementary Fig. 15) that contained approximately 12,000 particles (Fig. 5(e): column 5). However, the success of this restoration cannot be attributed entirely to the increased contrast because the contribution from the confounding factor of reduced image dimension or others cannot be ruled out. When the 2X down-sampling was further aided with the pre-processing, the total number of classes remained the same, but the occupancy in each class was increased, yielding more than 20,000 particles (Fig. 5(e): column 5), approximately twice as many as those without the pre-processing (Supplementary Fig. 15). In addition, with the aid of the pre-processing (Table 1), the time of ISAC spent on the 2X down-sampled dataset was cut from 78 to 43 h—a reduction by 45%, which results in 40% time-saving on the entire workflow. Finally, when P-ISAC was applied to the 3X down-sampled data, a 3D structure of unprecedented resolution was produced for this dataset—3.20 Å (0.75 Nyquist)(Fig. 5(f): column 6, Supplementary Fig. 15). Similar to RELION, the tests on ISAC show that larger improvements by the pre-processing are associated with the cases of lower resolutions (Fig. 5(f): column 2; Table 1).
By comparing these best maps, we found slight modifications in the density map of TRPV1 could be introduced by the pre-processor—for example, in the cytoplasmic region that includes the beta-sheets (red arrow in Fig. 5(h)) and the ankyrin repeats (blue arrow in Fig. 5(h)). Noticeable changes become evident when we zoom in the map—in a protein loop of 13 residues, from residues 456 to 468 (456–458), which exhibits gaps in the original map from RELION, the density of 464–468 is restored in the P-RELION map (indicated by a brown arrow in Fig. 5(i)) and in the ISAC map as well (Fig. 5(i)), while that of 456–458 is further restored in the P-ISAC map (green arrow in Fig. 5(i)).
Compared to the 80S ribosome, classifying this TRPV1 dataset of increased heterogeneity and lower contrast shows fine-tuning optimization parameters would lead to measurable improvements. Notably, when the pre-processor was added, more pronounced improvement was imparted on the less optimized cases to yield similar final results.
Pre-Pro improves resolution of non-curated TRPV1 in nanodisc
So far, all tests on large dataset have been restricted to curated datasets where marginal improvements were gained for the overall resolutions. We suspect that larger impact could be made by the pre-processor on non-curated datasets that have contaminants. To this end, we further tested two datasets, one is the non-curated full dataset of TRPV1 (EMPIAR-10005) and the other is a ligand-bound TRPV1 channel embedded in nanodisc (EMPIAR-10059). The non-curated set of TRPV1 is referred as “NC-TRPV1” and the nanodisc-embedded TRPV1 set as “NanoD-TRPV1”. Compared to the NC-TRPV1 data where the later frames were eliminated17, the radiation damage in NanoD-TRPV1 data was compensated by dose-weighting21.
The original set of NC-TRPV1 downloaded from EMDB database contains slightly more than 80,000 particles. To perform 2D classification, we used 200 classes as the prescribed number for RELION and 3X down-sampling for ISAC as they were the best settings found for the tests on the curated set of 35,645 particles. RELION and ISAC give 50,620 and 43,690 particles respectively, yielding two structures with indistinguishable resolutions—3.57 and 3.56 Å (Table 1 and Supplementary Figs. 15 and 16). The pre-processor increases the respective number to 55,269 and 52,661 and furthers the resolutions to 3.42 and 3.39 Å respectively. These resolutions are comparable to that reached by curating the original set with 2D classification followed by 3D classification17. We noticed that another pass of P-RELION gave 42,868 particles and extended the resolution to 3.37 Å.
The NanoD-TRPV1 downloaded from EMDB database contains 218,787 particles, from which the authors selected 73,929 particles using 2D and 3D classifications to obtain a final structure of 2.95 Å (0.82 Nyquist)21. For this dataset, we used RELION (3.0)15 to speed up the tests since the size of this set is enormous—it is the largest set in this study. We set the prescribed class number to be 100 to save time on 2D classification. With this setting, RELION (3.0) 2D classification was finished within 10 h. Typical with RELION, without and with the pre-processor, RELION sifted similar fractions of particles—70% (153,839) and 76% (166,236) respectively. However, the resolutions of the resulting structures differ substantially—3.01 versus 2.86 Å with the latter obtained through P-RELION (Supplementary Fig. 17). When additional run of P-RELION was applied to the set of 166,236 particles, the resolution was extended to 2.82 Å (0.87 Nyquist). As we compare the RELION map with two P-RELION maps (Fig. 6), improvement of the maps is evident in the cytoplasmic part including the ankyrin repeats. In summary, the tests on non-curated dataset demonstrate the potential of the pre-processor in making larger impact on more heterogeneous data.