SiGMA: Highly accurate and large-scale collision cross sections
prediction with graph neural networks
communications chemistry 2023 (Q1,IF 7.211)

  • Renfeng Guo 1 *
  • Youjia Zhang 1, 2 *
  • Yuxuan Liao 1 *
  • Qiong Yang 1
  • Ting Xie 1
  • Xiaqiong Fan 1
  • Zhonglong Lin 1

    1Central South University    2Huazhong University of Science and Technology    3Yunnan Academy of Tobacco Agricultural Sciences

    *Equal contribution    Corresponding author

Abstract

The collision cross section (CCS) values derived from ion mobility spectrometry (IMS) can be used to improve the accuracy of compound identification. Here, we have developed the Structure included graph merging with adduct method for CCS prediction (SigmaCCS) based on graph neural networks using 3D conformers as inputs. A model was trained, evaluated, and tested with >5,000 experimental CCS values. It achieved a coefficient of determination of 0.9945 and a median relative error of 1.1751% on the test set. The model-agnostic interpretation method and the visualization of the learned representations were used to investigate the chemical rationality of SigmaCCS. An in-silico database with 282 million CCS values was generated for three different adduct types ([M+H]+, [M+Na]+, and [M-H]-) of 94 million compounds. Its source code is publicly available at https://github.com/zmzhang/SigmaCCS. Altogether, SigmaCCS is an accurate, rational, and off-the-shelf method to directly predict CCS values from molecular structures.

overview

Workflow of SigmaCCS. (a) Dataset curation: a curated dataset with 5,597 experimental CCS values was used to train, validate and test the SigmaCCS model. It was obtained through a five-step cleaning pipeline from CCSbase. (b) Conformer generation: the molecular object of each molecule was constructed from its SMILES string, and the 3D conformer was generated and optimized by ETKDG and MMFF94. The attributes of each atom and bond in the molecule were calculated by RDKit. (c) Molecular graph construction: the molecular graph of each molecule was established by initializing the node attribute matrix, the edge attribute matrix, and the adjacency matrix with attributes calculated in the previous step and its connection table. (d) Edge-conditioned convolution: the atomic vector of each atom in the molecule was learned from the curated dataset with edge-conditioned convolution, and the molecular vector was generated from atom vectors through global sum pooling. (e) Adduct encoding: the adduct ion type ([M+H]+, [M+Na]+, and [M-H]-) was encoded as a one-hot vector. The molecular vector and the one-hot vector of adduct type were concatenated to obtain the feature vector. (f) CCS prediction: the feature vector was fed into the fully connected layers and feedforwarded to the output layer to predict the CCS value. (g) Database generation: the SigmaCCS model was used to predict CCS values of 94,161,201 compounds in PubChem. Three different adduct ions of each molecule were predicted. There are >280,000,000 predicted CCS values in the CCS database. The complete workflow of SigmaCCS was implemented in Python (v3.7.7).

Results

overview

Performance evaluation of different methods. (a) SigmaCCS on the test set. (b) DeepCCS on the test set. c Performance comparison of SigmaCCS with AllCCS, MetCCS, DeepCCS, and ISiCLE on the external test set.