# Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics

Aug 28, 2020

### Curation of the unified CCS database

A total of 5119 experimental CCS values were collected from 14 datasets, 4 independent labs, and 2 instrument platforms (Supplementary Table 1), which were reported in recent publications from 2015–2018. To curate the unified CCS database, each dataset was cleaned and standardized with a five-step procedure as follows (Supplementary Fig. 1).

1. (1)

Collection of meta information. For each CCS record, the chemical translation service57 [http://cts.fiehnlab.ucdavis.edu/] was utilized to generate the chemical identifiers for compounds, such as InChIKey, CAS number, PubChem CID, etc. Then, the SMILES structure for each compound was generated using an R package rinchi [https://github.com/CDK-R/rinchi]. The compound formula and exact masses for different adducts (Supplementary Table 8) were calculated using an R package rcdk [https://cran.r-project.org/web/packages/rcdk/index.html]. Finally, the chemical classification for each compound was obtained using ClassyFire34 [http://classyfire.wishartlab.com/].

2. (2)

Quality check. Some CCS records were intentionally removed for those without chemical structures, with ion adducts not included in Supplementary Table 8, or having large m/z errors (>10 ppm). Then, for each dataset, we also removed the inconsistent CCS records from the same instrument platform. For one ion adduct with more than one CCS record, the maximum differences between CCS records were calculated. If the maximum difference was >0.5%, the related CCS records were removed. Otherwise, the averaged CCS value was calculated and assigned as the CCS record.

3. (3)

Outlier removal. The CCS outliers were further removed using the CCS trend lines, which was similar to the CCS compendium26. The trend line of each super class (n ≥ 10) was fitted by a power function, and the CCS records exceeding 99% of the predictive interval were removed. A total of 103 CCS outliers were removed, and two examples of outliers were confirmed in a recent publication30 (Supplementary Fig. 2).

4. (4)

Calculation of unified CCS values. The CCS values from different instrument platforms were further merged as unified CCS values. The unified CCS value is an average of CCS values from different instrument platforms, which is specific to the compound and its adduct. Specifically, for one ion adduct, if it had multiple CCS records obtained from DTIM-MS, the unified CCS value was the average value from the CCS records in DTIM-MS. Otherwise, the unified CCS value was calculated using all CCS records from different platforms. A total of 3539 unified CCS values were generated.

5. (5)

Assignment of confidence levels. For each unified CCS value, we assigned a confidence level using the following rules: Level 1: the unified CCS is calculated using experimental CCS records from ≥2 independent datasets in DTIM-MS instruments, and the maximum CCS difference is ≤1%; Level 2: the unified CCS is calculated using experimental CCS records from ≥2 independent datasets in different commercial instruments (DTIM-MS, TWIM-MS, or TIMS-MS), and the maximum CCS difference is ≤3%; Level 3: the unified CCS is only reported in one dataset from commercial instruments (DTIM-MS, TWIM-MS, or TIMS-MS); Conflict: the unified CCS is calculated using experimental CCS records from ≥2 independent datasets in different commercial instruments (DTIM-MS, TWIM-MS, or TIMS-MS), but the maximum CCS difference is >3%. All predicted CCS values were assigned as level 4 in AllCCS.

### Training and validation sets for CCS prediction

AllCCS employed the unified CCS values for CCS prediction and validation. Specially, 80% of unified CCS values (1851 and 795 CCS values in positive and negative modes, respectively) were randomly selected as the training set (Supplementary Data 1). Here, we only kept seven most common adducts ([M + H]+, [M + Na]+, [M + NH4]+ and [M + H-H2O]+ for positive mode; [M-H], [M + Na-2H], [M + HCOO] for negative mode), and removed CCS values with the confidence level of conflict. In addition, two datasets were used for performance validation: (1) external validation set 1 (metabolites and lipids) consists of 463 and 199 CCS values in positive and negative modes, respectively (Supplementary Data 2); (2) external validation set 2 (drugs and natural products) consists of 107 and 122 CCS values in positive and negative modes, respectively (Supplementary Data 3). Both validation sets were acquired using chemical standards on Agilent DTIM-MS 6560. The acquisition of CCS values and the standard MS/MS spectra followed the previous publications27.

### Molecular descriptor calculation and selection

For each compound, a total of 221 molecular descriptors (MDs) were calculated using the SMILES structure and the R package rcdk. Among them, non-differential MDs were first removed. The missing values for the rest MDs were imputed using the KNN algorithm. All MD values were normalized to Z-score and subjected to selection using the recursive feature elimination with cross validation (RFECV) algorithm (Supplementary Fig. 19). In order to eliminate the scale effect of the training set, 50%, 60%, 70%, 80%, or 90% of the training set were used for RFECV. For each condition, the RFECV was performed by 200 times (1000 times in total). In each RFECV, the least important MD was recursively removed according to the coefficient of the LASSO regression via a tenfold cross validation. The MD combination with highest scores in the cross validation were kept. Finally, MDs with the frequency >700 in 1000 RFECV replications were ultimately selected. In positive and negative modes, 15 and 9 MDs were selected, respectively (Supplementary Table 9). We also demonstrated that the selected MDs showed smaller prediction errors than those obtained from the step-wise selection or the random selection (Supplementary Fig. 20 and Supplementary Table 10). The python software sklearn [https://scikit-learn.org/stable/] was used for RFECV.

### Support vector regression-based CCS prediction

The support vector regression (SVR) algorithm was used to develop the CCS prediction using the selected MDs and CCS values in the training set. The general workflow was similar as our previous publications29. Briefly, two hyper-parameter cost of constraints violation (C) and gamma (γ) were optimized from 105 combinations via a tenfold cross validation with 100 repeats. Seven groups of C value (0.001, 0.005, 0.025, 0.05, 0.1, 0.25, 0.5)/NMD and 15 groups γ-value (2 to 215) were set for parameter optimization. Radial basis function was employed for kernel function. NMD represented the number of selected MDs. Finally, the hyper-parameter combinations were selected as follows: C, 0.1/15 and 0.1/9 in positive and negative modes, respectively; γ, 28 and 213 in positive and negative modes, respectively. As a result, 1.67% and 1.72% of MREs were obtained for the training set in positive and negative modes, respectively (Supplementary Table 11). In addition, the high gamma parameters indicated that the optimized parameters in SVR prediction make the model towards a linear regression, but has better performances comparing to multiple linear regression (Supplementary Table 12).

### Representative structure similarity

The representative structure similarity (RSS) was calculated to characterize the structure similarity between the inputted structure and the training set (Supplementary Fig. 21). The molecular fingerprint of inputted structure was first computed using the R package rcdk. Then, the structure similarity between the inputted structure and each structure in the training set was calculated using the tanimoto coefficient (TC) shown as follows:

$${mathrm{TC}}_{left({mathrm{StrA}}, {mathrm{StrB}}right)} = frac{{N_{{mathrm{StrA}} cap {mathrm{StrB}}}}}{{N_{{mathrm{StrA}}} + N_{{mathrm{StrB}}} – N_{{mathrm{StrA}} cap {mathrm{StrB}}}}}$$

(1)

where NStrA and NStrB were the molecule fingerprints of structures A and B, respectively, and TC(StrA,StrB) was the TC between structure A and structure B. Here, structure A was the inputted structure and structure B was a structure in training set. NStrA∩StrB was the intersection set of structure A and B. Then, RSS score of the inputted structure was calculated using the average of top five TCs:

$${mathrm{RSS}}_{{mathrm{StrA}}} = mathop {sum}limits_{i = 1}^5 {{mathrm{TC}}_i/5}$$

(2)

where RSSStrA was the RSS of the inputted structure A, and TCi represented top i tanimoto coefficient.

### Benchmark of CCS prediction performance

The generation of CCS values using MetCCS27, DeepCCS30, and ISiCLE31 for compounds in the external validation sets was performed as follows. For MetCCS, the webserver [http://www.zhulab.cn/MetCCS/] was used to predict CCS values. The inputted molecular descriptors of each compound were calculated by ChemAxon MarvinSketch (Version 16.10.24) and ALOGPS [http://www.vcclab.org/web/alogps/]. For DeepCCS, CCS values were calculated using the SMILES structures and the python package downloaded from the internet (https://github.com/plpla/DeepCCS, on April 2nd, 2019). For ISiCLE, CCS values generated from ISiCLE Lite v0.1.0 were directly downloaded from the webserver [https://metabolomics.pnnl.gov/ccs/] on March 11th, 2019. All CCS values were provided in Supplementary Data 4.

### AllCCS webserver

The AllCCS webserver was hosted on a Linux server from Alibaba Cloud, and free-accessible for non-commercial use via http://allccs.zhulab.cn/. AllCCS webserver has three major functions: (1) the unified and predicted CCS databases, (2) the CCS prediction, and (3) metabolite annotation. The predicted AllCCS database contains a total of 1,670,596 compounds and 11,697,711 predicted CCS values. These compounds are downloaded from KEGG35, HMDB36, LMSD37, MINE38, DrugBank39, DSSTox40, and UNPD41 databases (Supplementary Table 13). The CCS prediction function provides a visualized interface for users to predict CCS values with the inputted SMILES structures. The metabolite annotation provides a feature match function to search the AllCCS database with experimental m/z and CCS values. In addition, it also provides a candidate rank function to perform multi-dimensional annotation by integrating the annotation results from in-silico MS/MS prediction tools. The tutorial of AllCCS is available on the website.

### CCS match, MS/MS match, and multi-dimensional match

A trapezoidal score function was developed to measure the CCS match. First, it removed the candidates with CCS values exceeding the maximum tolerance, then calculated the CCS match score (Sccs) using a trapezoidal function as Eq. 3:

$$S_{{mathrm{ccs}}} = left{ {begin{array}{*{20}{c}} {1,}hfill & {{mathrm{{Delta}}}_{{mathrm{rela}}}{mathrm{TOL}}_{{{{mathrm{min}}}}}} hfill\ {1 – frac{{left( {{mathrm{{Delta}}}_{{mathrm{rela}}} – {mathrm{TOL}}_{{{{mathrm{min}}}}}} right)}}{{{mathrm{TOL}}_{{{{mathrm{max}}}}} – {mathrm{TOL}}_{{{{mathrm{min}}}}}}},} & {{mathrm{TOL}}_{{{{mathrm{min}}}}} le {mathrm{{Delta}}}_{{mathrm{rela}}} le {mathrm{TOL}}_{{{{mathrm{max}}}}}} \ {0,}hfill & {{mathrm{{Delta}}}_{{mathrm{rela}}}, > , {mathrm{TOL}}_{{{{mathrm{max}}}}}} hfillend{array}} right.$$

(3)

where TOLmin and TOLmax are minimum and maximum tolerances, respectively. The default values for TOLmin and TOLmax are 2% and 4%, respectively. The Δrela is relative CCS error calculated as Eq. 4.

$${mathrm{{Delta}}}_{{mathrm{rela}}} = frac{{left| {{mathrm{CCS}}_{Pr {mathrm{ed}}} – {mathrm{CCS}}_{{mathrm{Exp}}}} right|}}{{{mathrm{CCS}}_{{mathrm{Exp}}}}} times 100$$

(4)

The experimental MS/MS spectra and their possible candidates were imported into in-silico MS/MS tools to perform MS/MS match. Three in-silico MS/MS prediction tools, such as MetFrag44, CFM-ID43, and MS-FINDER45 were used in this work. The format of imported data was modified according to the requirements of each tool. The brief procedures are described as follows: (1) MetFrag: the command line version MetFragCL (version 2.4.5-CL) was downloaded from https://ipb-halle.github.io/MetFrag/, and the parameter file was generated via R package ReSOLUTION [https://github.com/schymane/ReSOLUTION]; (2) CFM-ID: the software version 2.4 was downloaded from https://sourceforge.net/projects/cfm-id/files/. The pre-trained model params_se_cfm and the parameter file param_output0.log were used. The predicted MS/MS spectra were provided as MSP format in Supplementary Data 5. (3) MS-FINDER: the software version 3.24 was downloaded from http://prime.psc.riken.jp/Metabolomics_Software/MS-FINDER/index.html, and run with the console. The detail parameters of each tool were provided in Supplementary Table 14. The experimental MS/MS spectral library was downloaded from GNPS with a total of 13,499 compounds (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp; accessed on May 23th, 2020). The spectral match utilized reverse dot-product scores, and its parameters were kept same with our previous publication32.

Multi-dimensional match was performed by integrating the CCS match score and MS/MS match score as Eq. 5:

$$S_{{mathrm{integrated}}} = W_{{mathrm{ccs}}} times S_{{mathrm{ccs}}} + W_{{mathrm{MS/MS}}} times S_{{mathrm{MS/MS}}}$$

(5)

where SCCS and SMS/MS are CCS and MS/MS match scores, respectively. Here, SMS/MS is the similarity between experimental MS/MS and in-silico MS/MS, which is obtained from different in-silico MS/MS tools with different scoring methods. The SMS/MS is rescaled to 0–1 before integration. The Wccs and WMS/MS are weights for the CCS and MS/MS match scores, respectively. The Wccs and WMS/MS were optimized as 0.3 and 0.7, respectively (Supplementary Fig. 22).

### Chemicals

LC–MS grade methanol (MeOH) and water (H2O) were purchased from Honeywell (Muskegon, MI, USA). LC–MS grade acetonitrile (ACN) was purchased from Merck (Darmstadt, Germany). LC–MS grade methylene dichloride (CH2Cl2) was purchased from Fisher Scientific (Morris Plains, NJ, USA). Ammonium hydroxide (NH4OH) and ammonium acetate (NH4OAc) were purchased from Sigma (St. Louis, MO, USA). The chemical standard succinoadenosine was purchased from J&K (Shanghai, China), while other chemical standards were purchased from TopScience (Shanghai, China).

### Sample preparation

Aging mouse liver tissues (c57BL-6J; 36-week vs. 104-week; n = 10 for each group) were dissected, frozen with liquid nitrogen, and stored at −80 °C. The mouse tissue studies were approved by Animal Ethics and Welfare Management Committee of Interdisciplinary Research Center on Biology and Chemistry, Chinese Academy of Sciences (Shanghai, China). Metabolite extraction followed our published protocol10. In brief, 10 mg of mouse liver tissues was firstly homogenized with 200 μL of H2O and 20 ceramic beads (diameter, 0.1 mm) using a homogenizer (Precellys 24, Bertin Technologies) at the low-temperature condition. The protein concentration of the homogenized solution was measured with the Pierce BCA Protein Assay Kit (Catalog No. 23225, Thermo Fisher) for normalization. One-hundred microliters of homogenized solution was used for metabolite extraction. A total of 100 μL of H2O and 800 μL of solvent mixture of ACN:MeOH (1:1, v/v) was added, and vortexed for 30 s, and sonicated for 10 min at 4 °C water bath. After incubation for 1 h at −20 °C, the sample was further centrifuged for 15 min at 16,200 × g and 4 °C. The supernatant was collected and evaporated to dryness at 4 °C. The dry extracts were then reconstituted into 100 μL of ACN:H2O (1:1, v/v), followed by sonication at 4 °C for 10 min, and centrifuged at 16,200 × g and 4 °C for 5 min to remove the insoluble debris before LC–IM–MS/MS analysis.

Other biological samples were prepared as follows. For plasma, 100 μL of human plasma (Catalog No. HPH-0500, Equitech-Bio. Inc, USA) was extracted using 400 μL of solvent mixture of MeOH:ACN (1:1, v/v) in the centrifuge tube, and then the mixture was vortexed for 30 s and sonicated for 10 min at 4 °C water bath. The rest of the procedure was the same as described for mouse liver tissue sample. For cell samples, RIPK1-/- mouse embryonic fibroblasts (MEFs) cell line (generated from RIPK1 KO mice) were provided from Prof. Junying Yuan’s Lab (Chinese Academy of Sciences, Shanghai). One milliliter of MeOH:ACN:H2O (2:2:1, v/v/v) solvent mixture was added to the samples, followed by vortex for 30 s and sonication for 10 min at 4 °C water bath. Then the samples were incubated in liquid nitrogen for 1 min, thawed on ice, and sonicated for 10 min at 4 °C water bath. The above vortex–freeze–thaw cycle was repeated three times. The rest of the procedure was the same as described for mouse liver tissue sample. For fruit fly head samples, the sample collection and extraction followed our previous publication10.

### LC–IM–MS/MS analysis

A UHPLC system (Agilent 1290 series) coupled to a quadruple time-of-flight mass spectrometer equipped with an ion mobility drift tube (Agilent DTIM-QTOF-MS 6560, Agilent Technologies, USA) was used for LC–IM–MS/MS data acquisition. The LC separation was performed on a Waters BEH Amide column (particle size, 1.7 μm; 100 mm (length) × 2.1 mm (i.d.)) maintained at 25 °C. The solvent A was 100% H2O with 25 mM NH4OAc and 25 mM NH4OH, and solvent B was 100% ACN. The flow rate was 0.3 mL/min, and the gradient was described as follows: 0–1 min: 95% B, 1–14 min: 95% B to 65% B, 14–16 min: 65% B to 40% B, 16–18 min: 40% B, 18–18.1 min: 40% B to 95% B and 18.1–23 min: 95% B. The sample injection volume was 2 μL.

The data acquisition was operated in IM-Q-TOF mode. The source parameters were set as follows: sheath gas temperature, 325 or 275 °C in positive or negative modes; dry gas temperature, 300 °C; sheath gas flow, 11 L/min; dry gas flow, 8 L/min; capillary voltage, 4000 V or −3000 V in positive or negative modes, respectively; and nebulizer pressure, 20 or 25 psi in positive or negative modes, respectively. The TOF mass range was set as m/z 50–1700 Da. For ion mobility parameters, the nitrogen (N2) was used for the drift gas. Other related IM parameters were set as follows: entrance and exit voltages of drift tube, 1600 and 250 V; trap filling and trap release times, 20,000 and 150 μs. The pressure of drift tube was set at 3.95 Torr. The MS/MS spectra were acquired in the “Alternating frames” mode, and the collision energy was fixed at 20 V in frame 2. The CCS values were calculated with single electric field method. All data acquisitions were carried out using MassHunter Workstation Data Acquisition Software (Version B.08.00, Agilent Technologies, USA).

Chemical standards were first dissolved at 0.01 mg/mL in either H2O, MeOH, CH2Cl2, DMSO, or their mixture with different proportions depending on compound polarity and solubility, and subject to measurements of CCS values and MS/MS spectra. The CCS values were independently measured three times across 2 months using a single-field approach on Agilent DTIM-QTOF-MS 6560 instrument according to our previous publication27. The MS/MS spectra were acquired using targeted MS/MS method with three different collision energy levels (10, 20, and 40 V).

### Data processing and metabolite annotation

Raw MS data files (.d) were first recalibrated using IM–MS Reprocessor (Version B.08.00, Agilent Technologies). Then, the smoothing and saturation repair were performed using PNNL PreProcessor (Version 2018.06.02). The CCS calibration was performed by IM–MS Browser software (Version B.08.01, Agilent Technologies). The pre-processed data files were submitted for feature finding, alignment, and MS/MS spectra extraction using Mass Profiler (Version 10.0, Agilent Technologies). Finally, the peak table and MS/MS spectra (CEF format) files were exported for metabolite annotation. One MS/MS spectrum with highest intensity was selected for each feature, similar to the protocol in LipidIMMS Analyzer32. The detail parameters of data processing tools were provided in Supplementary Table 15. The metabolites were annotated using multi-dimensional match as we described before. The m/z tolerance was set at 25 ppm, and only [M+H]+ and [M-H] adducts were considered for positive and negative modes, respectively. The MS-FINDER was used for in-silico MS/MS match, and kept chemical structures within top 3 formulas for unknown metabolite annotation. The known metabolite database (KEGG and HMDB) and the extended database were used for known and unknown metabolite annotation, respectively.

### Generation of unknown metabolites

Unknown metabolites were generated based on in-silico enzymatic reaction via BioTransformer46 (version 1.0.8). The command line tool was used and downloaded from [https://bitbucket.org/djoumbou/biotransformer/src/master/]. The SMILES structures of KEGG compounds were used for in-silico reaction, and the “EC-based transformation” was used for metabolic transformation. The reaction step was set as 2. All generated metabolites were merged by InChIKey, and their SMILES structures were converted via Open Babel58. Finally, a total of 100,404 unknowns were finally generated and included in the extended database (Supplementary Data 6). These compounds and their predicted CCS values were also provided in AllCCS webserver.

### Metabolic pathway and structure enrichment analysis

For the analysis of aging mice samples, the peak intensity table from Mass Profiler was first normalized to the protein concentration from BCA. Then, zero imputation with KNN algorithm was performed. Student’s t-test was used for calculating p-value. The metabolic pathway and chemical structure enrichment analyses were performed via hypergeometric test59 and Kolmogorov–Smirnov (KS) test60, respectively. All chemical classes of unknowns were obtained using ClassyFire. The quantitative analysis followed our previous publication10, and the z-scale normalization of peak intensities was used in this work.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.