CN111858570A

CN111858570A - CCS data standardization method, database construction method and database system

Info

Publication number: CN111858570A
Application number: CN202010642071.7A
Authority: CN
Inventors: 朱正江; 周智伟
Original assignee: Shanghai Institute of Organic Chemistry of CAS
Current assignee: Shanghai Institute of Organic Chemistry of CAS
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2020-10-30

Abstract

The invention discloses a CCS data standardization method, a database construction method and a database system. The CCS data standardization method comprises the following steps: and performing quality inspection on the collected CCS data, removing abnormal values, calculating a uniform CCS value, and distributing confidence. The database system comprises a standardization module, a prediction module and a database; the standardization module is used for carrying out standardization processing on the collected CCS data measured by the experiment; the prediction model of the prediction module is obtained by data training processed by the standardization module, and the CCS of the compound can be predicted according to the input structural information of the compound to be predicted; the database contains CCS data processed by the normalization module and CCS data predicted by the prediction module. The invention provides a CCS data source with high coverage and high credibility for users.

Description

CCS data standardization method, database construction method and database system

Technical Field

The invention belongs to the technical field of databases, and relates to a data processing method, in particular to a data standardization and database construction method and a database system.

Background

The goal of non-targeted metabolomics is to measure as many metabolites as possible in a complex system comprehensively and to determine essential metabolites that are associated with phenotypic perturbations. The high complexity of the living body enables the metabolic products generated in the life process to have the characteristics of numerous, complex structure, more isomers, wide concentration distribution range and the like. A liquid chromatography-mass spectrometry combined method (LC-MS technology) is a main research method of non-targeted metabonomics at present. The identification of metabolites remains a major bottleneck in liquid chromatography-mass spectrometry-based (LC-MS) non-targeted metabolomics. The standard strategy for metabolite identification is to match experimentally determined primary mass spectra and tandem mass spectra (MS/MS or MS2) in biological samples with standard spectral libraries (such as METLIN, MASSBANK and NIST) or in silico predicted MS/MS spectra. However, standard spectral libraries have limited coverage and in-computer predictions lack high accuracy. Other bioinformatics methods (e.g., GNPS, MetDNA) also use MS2 spectra and molecular network algorithms for metabolite annotation. All of these strategies require high quality experimental MS2 spectra. However, the MS2 spectrum of low molecular weight metabolites is very sparse and often lacks characteristic fragment ions for reliable identification. Some metabolite isomers have highly similar MS2 spectra. Furthermore, many experimental factors, such as the complexity of the biological matrix, low concentrations, co-elution of isomer metabolites, present a significant challenge to the acquisition of high quality MS2 spectra. These problems lead to low coverage and high error rates for metabolite annotation, and therefore new physicochemical properties need to be developed for metabolite annotation.

In recent years, Ion mobility-Mass spectrometry (IM-MS) has become a promising technology in non-targeted metabolomics research due to its ability to provide multidimensional separation and high selectivity. Meanwhile, IM-MS realizes rapid separation of tiny structural differences through the difference of ion mobility. Wherein the ion mobility can be further characterized using a Collisional Cross Section (CCS) of the metabolite ions. IM-MS is able to distinguish between metabolite isomers that are common in biological samples. Unlike Retention Time (RT) and MS2 spectra, which are susceptible to many experimental factors, CCS has high reproducibility in a variety of instruments and laboratories. At the same time, CCS values are a unique physicochemical property that can be used to improve the accuracy of metabolite annotation. Therefore, the construction of CCS database has important research significance for large-scale metabolite identification. In addition, the CCS database can further integrate IM-MS and the existing LC-MS/MS (liquid phase-tandem mass spectrometry) method, so that four-dimensional metabonomics data including MS1 (primary mass spectrometry), RT (retention time), CCS (collision cross-sectional area) and MS/MS (secondary mass spectrometry) can be simultaneously acquired in one-time analysis, and the possibility is provided for multi-dimensional metabolite identification.

The existing CCS database establishment methods comprise two methods, namely, measurement of a standard substance through experiments and theoretical calculation.

The method for measuring the standard substance through experiments needs to purchase the corresponding standard substance and obtain the corresponding cross-sectional area of collision through measurement on the IM-MS. The coverage of the database established by the method is limited by the number of standards that can be purchased. And standards for many compounds are often not available and are expensive. Meanwhile, different instrument platforms exist in the market, so that the standard method for experimental measurement has system errors and is influenced by experimental conditions and operators. In addition, the reported data of the cross-sectional area of collision is often dispersed on different periodicals published at different times, and the acquisition difficulty is high. There is also a temporary lack of suitable methods for cross-validation of data. These deficiencies prevent the method of experimentally measuring standards from building a high-coverage, high-confidence database.

And the collision cross-sectional area is obtained through theoretical calculation, and the collision cross-sectional area of the metabolite can be calculated through a computational chemistry tool (such as MOBCAL) and an ion-gas molecular collision model method. The biggest limitation of the method is that large calculation errors exist, and the precision of the method is different from the standard value of an experiment by 3-30%. In addition, this method requires the researchers to have a deep computational background and powerful computational resources. Even so, it often takes days or weeks to complete the calculation of a molecular collision cross-sectional area. The time, manpower and financial resources required by theoretical calculation are greatly improved, and the establishment of a high-coverage and high-reliability collision cross-section database by the method is limited.

In addition, the existing databases established by the above two methods are distributed in different periodicals or databases, and there is no platform for uniformly storing and managing data, which causes great obstacles to query and use of data. The data of the cross-sectional area of the collision obtained by different methods may be inconsistent, the presentation mode of the data is disordered, and the mark of the data credibility is lacked.

Disclosure of Invention

The invention aims to provide a CCS data standardization method, a database construction method and a database system, which provide a CCS data acquisition source with high coverage and high reliability for users.

A method of normalizing CCS data, comprising:

supplementing the CCS data of each experimental measurement collected from an ion mobility mass spectrometer platform with basic information related to the compound structure corresponding to the CCS data;

performing quality inspection and processing on the CCS data after the basic information is supplemented;

removing abnormal values of the CCS data subjected to quality inspection and processing;

for the same compound CCS data from multiple ion mobility mass spectrometer instrument platforms, a uniform CCS for the compound was calculated.

The quality checking and processing the collected CCS data comprises one or more of the following operations:

Deleting CCS data for compounds having a chemical formula and/or adduct form and/or mass to charge ratio outside of specified ranges;

deleting CCS data from the same ion mobility mass spectrometer platform but with different CCSs;

calculating a maximum difference between the plurality of CCSs for an adduct ion having a plurality of CCSs; deleting CCS data for the adduct ion if the maximum difference is greater than a set threshold; otherwise, calculating an average of the plurality of CCSs as the CCS of the adduct ion;

wherein the maximum difference is a ratio between a difference between a maximum CCS and a minimum CCS and an average of the CCSs.

The removing abnormal values of the CCS data after the quality inspection and the processing comprises the following steps:

fitting a trend line of each compound type by adopting a power function to the CCS data subjected to quality inspection and treatment, and calculating a confidence interval; and deleting the CCS data with the confidence interval exceeding the set threshold.

Calculating a uniform CCS for the same compound with CCS data from multiple ion mobility mass spectrometry instrument platforms comprises: averaging CCSs of the compound from all ion mobility mass spectrometer instrument platforms to obtain a uniform CCS for the compound.

The normalization method further includes assigning a confidence level to each CCS data as follows:

calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is smaller than a first set difference threshold, the confidence coefficient of the uniform CCS is of a first level, and the instrument types of the different ion mobility mass spectrometer platforms are DTIM-MS;

calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is smaller than a second set difference threshold, the confidence coefficient of the uniform CCS is of a second level, and the instrument types of the different ion mobility mass spectrometer platforms are not limited;

assigning a confidence level to the CCS data acquired from only one ion mobility mass spectrometer platform to be a third level, wherein the instrument type of the ion mobility mass spectrometer platform is not limited;

calculating a uniform CCS for CCS data of experimental measurement acquired from different ion mobility mass spectrometer platforms, wherein the maximum CCS difference is greater than a second set difference threshold, the confidence level of the uniform CCS is marked as conflict, and the type of the ion mobility mass spectrometer platform is not limited;

And the first level, the second level, the third level and the confidence coefficient of the conflict are decreased in sequence.

A construction method of a CCS database comprises the following steps:

collecting CCS data of experimental measurement from an ion mobility mass spectrometer platform;

carrying out standardization processing on the collected CCS data by adopting the standardization method of the CCS data, storing the processed CCS data, and generating a training data set;

training a machine learning-based prediction model based on the training data set so that the trained prediction model can predict and output a series of CCS values of the compound based on the input structural information of the compound to be predicted;

and predicting the CCS value of the compound by the prediction model based on the structural information of the compound to be predicted, which is input by a user, and storing and/or returning the predicted CCS.

A CCS database system, comprising:

the standardization module is used for carrying out standardization processing on CCS data of experimental measurement collected from an ion mobility mass spectrometer platform according to the standardization method;

the prediction module comprises a machine learning-based prediction model and is used for predicting the CCS value of the compound according to the structural information of the compound to be predicted, which is input by a user; the prediction model is obtained through training, and a training data set is obtained through data processed by the standardization module;

And the database comprises CCS data of the experimental measurement processed by the standardization module and CCS data predicted by the prediction module.

The structural information of the compound comprises a plurality of molecular descriptors selected in the following manner: based on the training data set, selecting a plurality of molecular descriptors that contribute most to the predicted CCS value by a recursive feature elimination cross-validation method.

When a prediction model based on machine learning is trained, accuracy evaluation is carried out on CCS data obtained through prediction according to the following method: calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to a rule that the higher the structural similarity score is, the more similar the CCS is, calculating the average value of the former N structural similarity scores with the highest score as a parameter for evaluating the accuracy of the CCS obtained by the prediction model, wherein the larger the numerical value of the parameter is, the more accurate the CCS obtained by prediction is; wherein N is a positive integer and is a set value.

The CCS database system also comprises a compound card establishing module, which is used for establishing the compound card for the CCS data stored in the database by adopting the following format: the compound card takes the structure of a compound as a unique identifier, and generates and displays the following information for each compound: CCS data for the compound initially gathered, basic information for the compound, uniform CCS for the compound, experimentally measured CCS data for the compound, predicted CCS for the compound, linkage of the compound in other databases.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

a brand-new data cleaning and standardization method is provided, so that collected data are more reliable, the problem that due to the fact that different types of instrument platforms exist in the market, system errors possibly exist in CCS data from different sources is solved, and a standardized experimental measurement CCS database is built.

In the prior art, the formats of the CCS data obtained by different modes and different sources are not uniform, so that obstacles are brought to data retrieval. The invention sets uniform serial numbers and formats for CCS data of all substances in the database, so that target data can be quickly and accurately found through retrieval of related fields.

A brand-new CCS prediction model based on machine learning is disclosed, and high-quality and high-coverage experimental measurement CCS data are used as a training data set to obtain an optimized CCS prediction model based on machine learning (based on a support vector machine algorithm), so that collision cross-sectional areas of a large number of small molecules can be rapidly predicted through the model, and a high-coverage and high-accuracy prediction CCS database is obtained.

There is no way to estimate the prediction accuracy of each molecule for the predicted cross-sectional area of collision. Previous methods can only evaluate the overall effect of the model based on a known external validation dataset. The small molecules of the CCS are input into a prediction model for prediction by using a known experimental measurement CCS, the predicted CCS is obtained, and the error between the predicted value and the experimental value is compared to estimate the approximate effect of the model. This method does not allow an accurate estimate of each compound molecule, especially for compounds lacking experimental measurements. The invention obtains a structural similarity score by comparing the structural similarity of the predicted molecule and all small molecule compounds in the training data set, and considers that the molecules with similar structures have relatively close collision cross-sectional areas, thereby systematically estimating the accuracy of each predicted molecule collision cross-sectional area.

At present, no unified database integrates all experimental measurements and the collision cross-sectional area obtained by prediction calculation. The invention integrates the collision cross-sectional areas obtained by the two methods by utilizing a uniform platform, displays all information of the compound in a compound card mode, numbers all the compounds uniformly and is convenient for data query.

At the current time node, the present invention has collected data sets from 14 experimental cross-sectional areas of impact, which is the largest experimental measurement CCS database known at present. And a high-accuracy prediction model established based on the experimental measurement CCS database predicts 1670596 compounds in 7 mainstream databases, including KEGG, HMDB, LMSD, MINE, drug bank, DSSTox and unpad, for a total of 11697711 records of the cross-sectional area of collision. This is the most comprehensive database of cross-sectional areas of collisions available. Finally, the database is suitable for but not limited to metabolites, drug molecules, natural products and pesticide molecules, so that the defects of small coverage and limited application range of the conventional CCS database are overcome.

Drawings

FIG. 1 is a flowchart of a method for standardizing CCS data according to a first embodiment of the present invention;

FIG. 2 is a power function fitting curve of lipids and lipid compounds according to one embodiment of the present invention;

fig. 3 is a block diagram of a CCS database system according to a third embodiment of the present invention.

Detailed Description

The invention will be further described with reference to examples of embodiments shown in the drawings.

Example one

Because the CCS data obtained by different methods and different sources are not uniform in format, and there may be conflicts and errors between data, the data needs to be cleaned and standardized. The invention provides a method for standardizing CCS data, and FIG. 1 is a flow chart of the method for standardizing CCS data. The method comprises the following steps:

(1) complete information collection

The CCS data collected from the ion mobility mass spectrometer platform for each experimental measurement yields its complete information (Meta information, i.e., basic information about the structure of the compound) including: distributing a uniform database number to the collected CCS data according to the corresponding compound; acquiring different types of compound structural formulas such as SMILES, InChI, InChIKey and the like by using a CTS tool and R-wrapped rinchi; calculating the molecular formula and the accurate mass of the compound by using R inclusion rcdk; ClassyFire software was used to obtain classification information for compounds.

(2) Checking data quality

The quality inspection and processing of the collected CCS data specifically comprises the following steps:

the following low-quality CCS data are deleted first, including: there is no corresponding chemical formula, the adduct form is not within the specified range, and the m/z (mass to charge ratio) error exceeds 10 ppm.

CCS data from the same ion mobility mass spectrometer platform but with different CCS were deleted.

For adduct ions with multiple CCS, the maximum difference between the multiple CCS is first calculated. If the maximum difference is greater than a set threshold (0.5% in this example), then all CCS data associated with the adduct ion are deleted; otherwise, the average of multiple CCS will be calculated as the CCS of the final compound in the adduct form. Wherein the maximum difference is a ratio of a difference between the maximum CCS and the minimum CCS to an average of the CCS.

(3) Removing outliers

Removing abnormal values of CCS data subjected to quality inspection and processing, wherein the abnormal values comprise: fitting a trend line of each compound category to all CCS data by adopting a power function, and calculating a confidence interval; and deleting the CCS data with the confidence interval exceeding the set threshold.

The following is a specific example of the present embodiment, including the steps of:

(a) And (5) sorting the CCS values according to the compound classes, and counting the number n of the CCS values. For compound classes with n ≧ 10, performing the fitting of step (b); for compounds with n <10, no treatment is performed;

(b) for each class of CCS data, a power function fit is performed using mass-to-charge ratio (m/z) (CCS ═ a × m/z)^b(ii) a Wherein a and b are fitting coefficients, m is the mass of the corresponding compound, and z is the charge amount carried by the corresponding compound) to obtain a trend line of the category CCS;

(c) the 99% confidence interval of the trend line is calculated, and CCS that exceeds the 99% prediction interval is considered to be outliers (i.e., outliers in this example). For example, in fig. 2, methyl behenate exceeded the 99% confidence interval of the trend lines for lipids and lipid compounds and was therefore judged to be outliers and removed. In fig. 2, the interval between the two curves is the 99% confidence interval.

(4) Calculating a unified CCS value

Calculating a uniform CCS for the same compound having CCS data from multiple ion mobility mass spectrometer instrument platforms, comprising: the CCS of the compound from all ion mobility mass spectrometer instrument platforms were averaged to obtain a uniform CCS for the compound.

For an adduct ion, if it has multiple CCSs obtained from DTIM-MS (name of ion mobility Mass Spectrometry Instrument type), the uniform CCS is the average of the CCSs in DTIM-MS. Otherwise, all CCS from different instrument platforms will be averaged to compute a uniform CCS. In this example, 3539 unified CCS were generated in total.

(5) Assigning confidence levels

For each CCS data, a confidence is assigned as follows:

for a unified CCS calculated from experimentally measured CCS data collected from different ion mobility mass spectrometer instrument platforms, and if the maximum CCS difference is smaller than a first set difference threshold (1% in this embodiment), the confidence of the unified CCS is a first Level (Level 1), wherein the instrument types of the different ion mobility mass spectrometer platforms are DTIM-MS;

for a uniform CCS calculated from experimentally measured CCS data collected from different ion mobility mass spectrometer instrument platforms, where the maximum CCS difference is less than a second set difference threshold (3% in this example), the confidence of the uniform CCS is of a second Level (Level 2), where the instrument types of the different ion mobility mass spectrometer instrument platforms are not limited (e.g., DTIM-MS, TWIM-MS, or TIMS-MS);

assigning a confidence Level of a third Level (Level 3) to CCS data collected from only one ion mobility mass spectrometer instrument platform, wherein the instrument type of the ion mobility mass spectrometer instrument platform is not limited (e.g., DTIM-MS, TWIM-MS or TIMS-MS);

calculating a uniform CCS for CCS data of experimental measurements acquired from different ion mobility mass spectrometer instrument platforms, wherein the maximum CCS difference is greater than a second set difference threshold, the confidence level of the uniform CCS is marked as a conflict, and the instrument types of the ion mobility mass spectrometer instrument platforms are not limited (such as DTIM-MS, TWIM-MS or TIMS-MS);

The first level, the second level, the third level and the confidence of the conflict are decreased in sequence.

Example two

The embodiment discloses a method for constructing a CCS database, which comprises the following steps:

experimentally measured CCS data were collected from all ion mobility mass spectrometer instrument platforms.

The collected CCS data is normalized using the normalization methods described above, the processed CCS data is stored, and a training data set is generated (e.g., by supplementing information for a plurality of molecular descriptors thereof). To solve the problem that CCS databases of experimental measurements are relatively scattered and coverage is relatively small, the present embodiment currently collects CCS data sets from 14 experimental measurements. These data sets contained 4 laboratories, 2 instrument platforms, resulting in a total of 3539 uniform CCS.

Based on the training data set, training a machine learning-based prediction model, so that the trained prediction model can predict and output a series of CCS values of the compound under different adduct forms based on input structural information (such as a plurality of molecular descriptors of the compound) of the compound to be predicted. The multiple molecular descriptors were calculated by inputting the compound structure into the rcdk package in the R software, and were selected as follows: based on all the standardized CCS data (strain) and molecular descriptors (independent variables), a plurality of molecular descriptors that contribute most to the predicted CCS are selected by a Recursive Feature Elimination Cross Validation (RFECV). In this example, 15 molecular descriptors in the positive ion mode and 9 molecular descriptors in the negative ion mode are selected as inputs to the prediction model.

When the prediction model is trained, the CCS value obtained by prediction according to the prediction model is evaluated according to the following method: calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to the rule that the higher the structural similarity score is, the more similar the CCS value is, by calculating the average of the similarity scores with the highest N scores, a parameter of Representative Structural Similarity (RSS) is obtained, and the accuracy of the CCS value obtained by the prediction model is evaluated. Specifically, the range of structural similarity can be defined from 0 to 1, with 1 representing an identical molecular structure and 0 representing an identical molecular structure.

And predicting the CCS value of the compound by the prediction model based on a plurality of molecular descriptors of the compound to be predicted, which are input by a user, and storing and/or returning the predicted CCS data.

The CCS data measured by the experiment and the predicted CCS data obtained by the prediction model are used as data composition of the database.

EXAMPLE III

This embodiment discloses a database system, and fig. 3 is a schematic structural diagram of the database system.

The database system includes:

the standardization module is used for carrying out standardization processing on CCS data of experimental measurement collected from an ion mobility mass spectrometer platform according to the standardization method of the CCS data;

The prediction module comprises a prediction model based on machine learning and is used for predicting the CCS value of the compound to be predicted according to the structural information of the compound input by a user and storing and/or returning the predicted CCS data; the prediction model is obtained through training, and a training data set is obtained through CCS data processed by a standardization module;

the database comprises CCS data of experimental measurement processed by the standardization module and CCS data predicted by the prediction module;

the compound card establishing module is used for establishing the compound card for the CCS data stored in the database by adopting the following format: the compound card takes the structure of a compound as a unique identifier, and generates and displays the following information for each compound: basic information for the compound, the uniform CCS value for the compound, experimentally measured CCS value for the compound, predicted CCS value for the compound, linkage of the compound in other databases. In this example, the basic information of the compound includes the name of the compound, molecular formula, precise mass, different types of structural formulae, classification of the compound, and the like. For some compounds, the absence of the compound card structure is indicated in blank form, as for the predicted CCS data, experimentally measured CCS values and links in other databases are shown in blank form.

In this embodiment, the structural information of the compound includes a plurality of molecular descriptors, which are selected as follows: based on the training data set, a plurality of molecular descriptors that contribute most to the predicted CCS value are selected by Recursive feature elimination Cross-validation (RFECV). In this embodiment, 15 molecular descriptors in the positive ion mode and 9 molecular descriptors in the negative ion mode are selected as the plurality of molecular descriptors selected above in this manner. In this example, a plurality of molecular descriptors were calculated by inputting the compound structure into the rcdk package of the R software.

When a prediction model based on machine learning is trained, accuracy evaluation is carried out on CCS data obtained through prediction according to the following method:

calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to the rule that the higher the structural similarity score is, the more similar the CCS is, the larger the numerical value of the parameter is, the more accurate the CCS is obtained by prediction, by calculating the average value of the former N structural similarity scores with the highest scores as a parameter for evaluating the accuracy of the CCS obtained by the prediction model; wherein N is a positive integer and is a set value.

In addition, if the normalization module already includes a confidence degree assigned to CCS data measured by an experiment according to the method for normalizing CCS data of the present invention, the CCS database system in this embodiment further includes a predicted value confidence degree assignment module, configured to label the confidence degrees as a fourth Level (Level 4) for CCS values predicted by the prediction module; the fourth level of confidence is lowest.

The embodiments described above are presented to enable those skilled in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to these embodiments may be made, and the generic principles described herein may be applied to other embodiments without the use of the inventive faculty. Therefore, the present invention is not limited to the embodiments described herein, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A method for normalizing CCS data, comprising:

2. The method of normalizing CCS data according to claim 1, wherein:

3. The method of normalizing CCS data according to claim 1, wherein:

4. The method of normalizing CCS data according to claim 1, wherein: calculating a uniform CCS for the same compound with CCS data from multiple ion mobility mass spectrometry instrument platforms comprises:

averaging CCSs of the compound from all ion mobility mass spectrometer instrument platforms to obtain a uniform CCS for the compound.

5. The method of normalizing CCS data according to claim 1, wherein: the normalization method further includes assigning a confidence level to each CCS data as follows:

6. A construction method of a CCS database is characterized by comprising the following steps:

standardizing the collected CCS data by adopting the CCS data standardization method of any one of claims 1-5, storing the processed CCS data, and generating a training data set;

7. A CCS database system, comprising

A normalization module for normalizing CCS data of experimental measurements gathered from an ion mobility mass spectrometer instrument platform according to the method of any one of claims 1-5;

8. The CCS database system of claim 7, wherein:

The structural information of the compound comprises a plurality of molecular descriptors selected in the following manner:

based on the training data set, selecting a plurality of molecular descriptors that contribute most to the predicted CCS value by a recursive feature elimination cross-validation method.

9. The CCS database system of claim 7, wherein:

calculating the structural similarity of the predicted compound and all compounds in the training data set to obtain a structural similarity score; according to a rule that the higher the structural similarity score is, the more similar the CCS is, calculating the average value of the former N structural similarity scores with the highest score as a parameter for evaluating the accuracy of the CCS obtained by the prediction model, wherein the larger the numerical value of the parameter is, the more accurate the CCS obtained by prediction is;

wherein N is a positive integer and is a set value.

10. The CCS database system of claim 7, wherein: the system also comprises a compound card establishing module which is used for establishing the compound card by adopting the following format for the CCS data stored in the database:

The compound card takes the structure of a compound as a unique identifier, and generates and displays the following information for each compound: CCS data for the compound initially gathered, basic information for the compound, uniform CCS for the compound, experimentally measured CCS data for the compound, predicted CCS for the compound, linkage of the compound in other databases.