CN109726510B

CN109726510B - Protein saccharification site identification method

Info

Publication number: CN109726510B
Application number: CN201910061890.XA
Authority: CN
Inventors: 杨润涛; 陈金桂; 张承进; 张丽娜; 宋勇
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2022-12-23
Anticipated expiration: 2039-01-23
Also published as: CN109726510A

Abstract

The application provides a protein saccharification site identification method, which comprises the steps of collecting a saccharification site training data set, extracting a peptide chain from the saccharification site training data set, coding and representing protein by utilizing a peptide chain digital vector, the accessible surface area of amino acid in the peptide chain, the secondary structure probability of the amino acid in the peptide chain and the gray correlation degree of the peptide chain, selecting a maximum correlation minimum redundancy (mRMR) feature selection algorithm to find an optimal feature set, and then training on a support vector machine to obtain a predictor, so that the protein saccharification site identification is carried out. According to the method for identifying the protein glycation site, the amino acid sequence in the peptide chain, the accessible surface area of the amino acid in the peptide chain, the secondary structure probability of the amino acid in the peptide chain and the gray correlation degree of the peptide chain are fully considered, and the method is favorable for improving the accuracy of identifying the protein glycation site.

Description

Protein saccharification site identification method

Technical Field

The application relates to the technical field of protein function prediction, in particular to a method for identifying a protein saccharification site.

Background

The process of reducing the association of sugar molecules with proteins through covalent bonds in the absence of enzymes is called saccharification. Glycation is one of the most important post-translational modification Processes (PTMs) of proteins, involving a two-step reaction. Firstly, rearranging unstable Schiff base to form a more stable Amadori product; later advanced glycation end products (AGEs) are then produced. AGEs themselves or their cross-linked products can cause direct changes in protein structure and function. AGEs can damage various organs of the body when they accumulate to some extent. More and more studies have shown that AGEs are present in parts such as eyeball protein, plasma, erythrocytes, arteries and kidneys, and it has also been found through immunochemical methods that the amount of AGEs in each tissue increases with age, thereby causing various diseases such as diabetes, alzheimer's disease and atherosclerosis. The main symptom of pre-diabetes is hyperglycemia, and two factors that induce hyperglycemia are insulin resistance and beta cell failure. There is increasing evidence that AGEs not only contribute to insulin resistance, but also directly damage beta cells, leading to impaired function and even beta cell death. Since glycation reactions mostly occur between the epsilon amino group of lysine and the aldehyde or ketone group of reducing sugars, the synergistic interaction between lysine glycation and oxidation has attracted strong interest to researchers.

With the support of high throughput sequencing technologies, the number of proteins found increases exponentially, and it is time-consuming and expensive to identify the function of each glycation site of a protein only by conventional methods based on mass spectrometry and the like. For this reason, researchers have developed various machine learning-based methods to predict protein glycation sites. Such as: johansen et al manually collected 400 papers to obtain a first glycation site dataset, and based on the dataset, constructed a neural network-based glycation site predictor; based on the data set collected by Johansen et al, liu et al developed an improved predictor using a support vector machine algorithm; xu et al discussed the use of sequence order information and position specific amino acid bias in glycation site prediction and trained a predictor called "Gly-PseAAC" using another training dataset; zhao et al encode peptides using secondary structure information, AAindex, k-space amino acid equivalent features, screen features and construct a prediction model using a new two-step feature selection algorithm based on the data set collected by Xu et al; islam et al proposed a method named iProtGly-SS for extracting features from sequences and secondary structure information, using a feature selection algorithm to find the best feature set, and training a predictor based on a support vector machine algorithm.

Although these models have been developed to predict glycation sites, several problems remain. First, some of the protein peptide chains in the data set used in the previous article have been updated in Uniprot, which, if used further, can introduce unnecessary noise during training. Secondly, researchers only use the characteristics of the single peptide chain and ignore the relationship among the peptide chains, and the accuracy of the result is influenced due to incomplete extracted characteristic information.

Disclosure of Invention

The application provides a protein saccharification site identification method, which is used for identifying a protein saccharification site and improving the accuracy of protein saccharification site identification.

The application provides a protein saccharification site identification method, which comprises the following steps:

collecting a saccharification site training data set, and extracting a peptide chain P = A from the saccharification site training data set _-η A _-(η-1) ...A _-2 A _-1 KA ₁ A ₂ ...A _η-1 A _η K is lysine, eta is the number of amino acids upstream or downstream of lysine, A is one of 20 natural amino acids;

representing amino acids in the peptide chain by using 20-dimensional binary codes, and converting the peptide chain into a 20 (2 eta + 1) -dimensional digital vector;

calculating the accessible surface area of the amino acids in the peptide chain;

calculating the secondary structure probability of amino acids in the peptide chain;

calculating the grey correlation degree of the peptide chain;

obtaining a feature number vector of the peptide chain, wherein the feature number vector comprises the 20 (2 eta + 1) -dimensional number vector, the accessible surface area of amino acids in the peptide chain, the probability of secondary structure of amino acids in the peptide chain and the gray correlation degree of the peptide chain;

screening a plurality of features from the feature digital vector based on a maximum correlation minimum redundancy algorithm to obtain an optimal feature set;

training and obtaining a predictor based on a support vector machine according to the optimal feature set;

identifying protein glycation sites based on the predictor.

Alternatively, in the method for identifying a glycation site of a protein, the method further comprises:

expanding the peptide chain using the symbol X pair when the number of amino acids upstream or downstream of a lysine in the peptide chain is less than η;

the 20-dimensional binary encoding for X is 00 000 000 000 000 000 000 000,X with the probability of 0,X secondary structure being 0 for an accessible surface area.

Alternatively, in the method for identifying a protein glycation site, η =11.

Optionally, in the method for identifying a glycation site of a protein, the method further comprises:

collecting a saccharification site test data set, and extracting a test peptide chain from the saccharification site test data set;

the predictor was evaluated by Sensitivity (SEN), specificity (SPC), accuracy (ACC) and Mahalanobis Correlation Coefficient (MCC) from the test peptide chain.

and adjusting the feature quantity of the optimal feature set, and according to the adjusted optimal feature set, training based on a support vector machine to obtain a predictor and searching for the predictor with higher Accuracy (ACC).

and when the Accuracy (ACC) of the predictor is higher, counting the occupation amount of each feature type in the optimal feature set, and acquiring the feature type with the largest influence on the predictor.

The method for identifying the protein glycation sites comprises the steps of collecting a glycation site training data set, extracting a peptide chain from the glycation site training data set, coding and representing protein by utilizing a peptide chain digital vector, the accessible surface area of amino acid in the peptide chain, the secondary structure probability of the amino acid in the peptide chain and the gray correlation degree of the peptide chain, selecting a maximum correlation minimum redundancy (mRMR) feature selection algorithm to find an optimal feature set, and training on a support vector machine to obtain a predictor so as to identify the protein glycation sites.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic diagram showing the structure of a method for identifying a glycation site of a protein according to the present embodiment;

fig. 2 is a flowchart illustrating a structure of a compilation result output control method according to an embodiment of the present disclosure;

fig. 3 is a flowchart of another structure of a compilation result output control method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

FIG. 1 is a structural flow chart of a method for identifying a glycation site of a protein according to an embodiment of the present application. As shown in FIG. 1, the method for identifying a glycation site of a protein provided in the embodiment of the present application includes:

representing amino acids in the peptide chain by using 20-dimensional binary codes, and converting the peptide chain into a 20 (2 eta + 1) -dimensional numerical vector;

calculating the accessible surface area of amino acids in the peptide chain;

calculating the grey correlation degree of the peptide chain;

identifying protein glycation sites based on the predictor.

The method for identifying a protein glycation site provided in the examples of the present application will be described in detail with reference to specific examples.

The saccharification site training data set can be obtained in an existing paper or database. Specifically, the method comprises the following steps:

johansen et al manually screened the first saccharification dataset, referred to as dataset A, from 400 articles. However, after comparison with peptide sequences of proteins in the Uniprot database, it was found that some peptide sequences in the data set have been updated due to the continuous progress of the technology. We replaced the old peptide chain in dataset a with the newer peptide chain in Uniprot. And peptide chains with similarity greater than 60% were deleted. 68 positive samples and 90 negative samples were obtained.

Xu et al extracts a data set from the CPLM database that contains 223 positive samples and 446 negative samples, referred to as data set B.

The last data set is an independent training set in Liu et al, which is called data set C, but after comparison, it is found that there is a large overlap between data set C and data set B. To prevent redundancy of the training data, overlapping portions of the data set C are deleted.

To this end, we obtained a data set of 310 positive samples and 576 negative samples. However, when training a model, a large number of negative samples may cause imbalance of training samples, so that a trained predictor is biased in prediction, and the samples are preferably determined as negative samples. For which the K-nearest neighbor algorithm is used to remove some redundant negative samples to reduce their statistical noise. For each negative sample, the nearest K neighbors are found, where K is the number of negative samples divided by the number of positive samples. The negative examples are removed if at least one of the K nearest neighbors belongs to the positive subset.

Finally, a saccharification site training data set was collected, comprising 310 positive samples and 421 negative samples. Peptide chains, in which many, but not all, lysines may be present, are extracted from the collected glycation site training dataset. The peptide chain P is described as follows:

P＝A _-η A _-(η-1) ...A _-2 A _-1 KA ₁ A ₂ ...A _η-1 A _η

k is lysine, centered on lysine, where eta is the number of upstream or downstream amino acids, and is a natural number, A _-η Is one of 20 natural amino acids.

In the embodiment of the present application, η =11. If the number of upstream or downstream amino acids is less than η, it will be extended with the special symbol "X" to prepare the peptide chain.

In the examples of the present application, features were extracted from each peptide chain.

The amino acids in the peptide chain are represented using a 20-dimensional binary code, and the peptide chain is converted into a 20 (2 η + 1) -dimensional numerical vector. Each peptide chain consists of a number of natural amino acids, and the amino acid sequence is converted to a numerical vector using a 20-dimensional binary code. The coding sequence for the 20 natural amino acids is as follows, "A", "R", "N", "D", "C", "Q", "E", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V". For example, alanine (a) encodes "10 000 000 000 000 000 000", arginine (R) encodes "01 000 000 000 000 000", and sequentially. The special symbol "X" is "00 000 000 000 000 000 000". Thus, by converting the peptide chain into a 20 (2 η + 1) -dimensional numerical vector (AAS), the influence of amino acids around a glycation site can be considered in the process of identifying a protein glycation site, and the accuracy of identifying a protein glycation site can be improved.

The Accessible Surface Area (ASA) of the amino acids in the peptide chain is calculated, in the examples of the present application, by the SPIDER3 tool for each amino acid, the value of the special symbol "X" being set to zero. The amino acid ASA in the peptide chain determines key factors of the peptide chain property, the basic structure of the peptide chain property is reflected, and the amino acid ASA is considered in the identification of the protein glycation site, so that the accuracy of the identification of the protein glycation site is improved.

Each amino acid in the peptide chain provides information for our understanding of the local 3D structure of the protein. There are three types of secondary structures, namely alpha helix, beta sheet and random coil, P (h) denotes the probability of alpha helix, P (e) denotes the probability of beta sheet and P (c) denotes random coil. Secondary Structure Probability (SSP) of amino acids in the peptide chain was calculated, and was obtained by running the SPIDER3 tool and predicting the Secondary Structure Probability of each amino acid. In the present application, P (h), P (e) and P (c) of the special symbol "X" are all 0.

Grey correlation (Gary) was used to measure the proximity between glycated and non-glycated peptide chains. In 1982, dun et al proposed the grey system theory in order to investigate the systematic uncertainty. This theory holds that if the information of the system is completely known, it is called "white system"; if this system is not known at all, it is called "black system"; if the portion is known, the system is called a "gray system". The grey correlation degree is one of the main components of Liu et al's grey system theory. The biological features surrounding the glycation sites are not completely understood, and therefore the glycation sites are predicted to be a gray system. To avoid complete loss of sequence information, the amino acid sequence is represented using the pseudo amino acid composition (PseAAC). Specifically, a website built by Weekly et al may be used to generate the value of PseAAC.

Peptide chain P is expressed as:

wherein: 20+ λ is the amino acid sequence expressed using a pseudo amino acid composition; λ is from 0 to 6, preferably 6 in this application; i is the peptide chain index and indicates the peptide chain number.

The grey relation coefficient is defined as the ratio of,

is the j-th position in the pseudo amino acid composition form of the q-th peptide chain,

is the j-th digit of the peptide chain i in the form of a pseudo-amino acid composition.

Wherein:

the grey correlation is defined as:

P ⁱ represents any peptide chain in the training set, P ^q Is a target amino acid sequence of a saccharification site training data set, and the target amino acid sequence is other than any peptide chain. ρ is a coefficient of distinction, and takes a value between 0 and 1. In the embodiment of the present application, the intermediate value ρ =0.5 is preferable.

ω _j Is a weighting factor that must be satisfied

Degree of gray correlation

Represents the peptide chain P of interest ^q With random peptide chains P in the training dataset ⁱ The degree of similarity between them. When P is present ^q ＝P ⁱ Then Γ (P) ^q ,P ⁱ ) =1, i.e. the two peptide chains are completely similar.

And integrating the extracted peptide chain characteristics to obtain a characteristic digital vector of the peptide chain. For example, a peptide chain contains 23 amino acids, and after binary coding, a 23 × 20= 460-dimensional digital vector is obtained; the accessible surface area of an amino acid is represented by a 23 × 1= 23-dimensional numerical vector; representing the probability of the secondary structure by a 23 × 3=69 dimensional digital vector; and representing the gray correlation degree of the peptide chain by using a 731-dimensional numerical vector according to the number 731 of the collected saccharification site training data sets. Thus, each peptide chain has 460+23+69+731 + 1283 dimensions, i.e. each peptide chain is characterized by a characteristic number vector of 1283 dimensions.

Features in the feature number vector are filtered based on a maximum correlation minimum redundancy (mRMR) algorithm. The mRMR algorithm ranks the features, with top ranked features considered as "good" features, with the greatest correlation and minimal redundancy between features and with the classification goals. These "good" features may provide more information for glycation site prediction.

The maximum correlation is defined as:

the minimum redundancy is defined as:

I(x _i (ii) a c) Is mutual information between the feature I and the object class c, I (x) _i ；x _j ) Is the mutual information between feature i and feature j, and | S | is the number of samples. Maximizing D (S, c) is maximizing the facies between features and classes in the feature set SAnd (7) closing. Minimizing R (S) is minimizing the degree of cross-correlation of features in S.

And (3) addition integration:

max Φ (D, R), Φ = D-R, Φ being the difference between the maximum correlation and the minimum redundancy

Suppose we have S _m-1 A set of features. From the remaining feature set X-S _m-1 To maximize Φ.

The incremental algorithm optimizes the following conditions that,

S _m is the best feature set.

And (4) training based on a Support Vector Machine (SVM) to obtain a predictor according to the selected optimal feature set. I.e. for a given training sample x _i And its corresponding classification label y _i The classification task can be described as:

s.t.y _i (ωx _i +b)≥1-ξ _i ,(ξ _i ≥0,i＝1,...,l)

where ω represents the importance of different features in the training samples when constructing the classification hyperplane. Xi _i Is a non-negative slack variable. C is a penalty parameter, and the larger C is, the larger the penalty of misclassification is.

The support vector machine is a machine learning method based on a statistical learning theory. It generally involves classifying linearly inseparable high-dimensional datasets. A predictor is obtained based on the training of the support vector machine, so that the accuracy of identifying the protein glycation sites is improved conveniently.

In the embodiments of the present application, the method for identifying a protein glycation site further comprises: collecting a saccharification site test data set, and extracting a test peptide chain from the saccharification site test data set;

In the application, 51 new positive samples and 81 new negative samples are found after the saccharification site training data set and the CPLM database are compared; in addition, 3 newly included protein sequences with glycation sites are found in the CPLM database, including 11 positive samples and 14 negative samples; in the PLMD database, 2 new protein sequences were found, including 3 positive and 5 negative samples. Thus, the 65 positive samples and 100 negative samples described above were used as the saccharification site test data sets.

The definition is as follows:

wherein the content of the first and second substances,

FN: false Negative, is judged as a Negative sample, but is actually a positive sample.

FP: false Positive, is judged as a Positive sample, but is in fact a negative sample.

TN: true Negative, is determined to be a Negative sample, and in fact is also a Negative sample.

TP: true Positive, is determined to be a Positive sample, and is in fact a Positive sample.

Sensitivity (SEN, S) _n ): the proportion of all positive examples is shown, and the recognition capability of the predictor on the positive examples is measured.

Specific potency (SPC, S) _p ): the proportion of all negative examples is shown in a paired mode, and the identification capacity of the predictor for the negative examples is measured.

Accuracy (ACC): the number of samples in a pair is divided by the number of all samples, and generally speaking, the higher the accuracy, the better the predictor.

Mahi Correlation Coefficient (MCC): when the difference between the number of positive samples and the number of negative samples is large, the prediction capability can be more fairly reflected.

Based on the saccharification site training data set and the saccharification site testing data set provided by the embodiment of the application, the feature quantity of the optimal feature set is adjusted, a predictor is obtained through training, and the predictor with higher Accuracy (ACC) is found according to the ACC. The test accuracy is shown in figure 2. Therefore, in the embodiment of the present application, when the number of features of the optimal feature set is 170, that is, when the data dimension is 170, the obtained predictor has the best classification prediction judgment capability. At this time, the classification accuracy of the independent test set reaches 69.091%, at this time, the number of positive samples of the model prediction pair is 41, the number of negative samples of the model prediction pair is 73, the number of positive samples of the model prediction error is 24, and the number of positive samples of the model prediction error is 27.

The estimation obtained predictor compares the estimation parameters of the predictor obtained in the embodiment of the application and the existing predictor, and the details are shown in the table I.

Table one:

	S _n	S _p	ACC	MCC
					application predictor	63.120％	73.921％	69.091％	36.425％
Contrast predictor	54.085％	69.387％	63.038％	23.225％

Further, in the embodiment of the present application, when the Accuracy (ACC) of the predictor is high, the occupation amount of each feature type in the optimal feature set is counted, and the feature type having the largest influence on the predictor is obtained. In the present example, when the number of features of the optimal feature set is 170, the binary-coded features of the amino acid sequence have the greatest effect on the prediction of glycation sites (more than half of the number of selected features), the second most significant factor is the gray level correlation, and the reachable surface area and the secondary structure probability also have certain effect on the identification of protein glycation sites, as shown in fig. 3. Thus, it can be shown that the close relationship between the respective elements is not negligible when the identification of the glycation site of the protein is performed.

The method for identifying the protein glycation site comprises the steps of collecting a glycation site training data set, extracting a peptide chain from the glycation site training data set, coding and representing protein by utilizing a peptide chain digital vector, the accessible surface area of amino acid in the peptide chain, the secondary structure probability of the amino acid in the peptide chain and the gray correlation degree of the peptide chain, selecting a maximum correlation minimum redundancy (mRMR) feature selection algorithm to find an optimal feature set, and then training on a support vector machine to obtain a predictor, so that the protein glycation site is identified. The method for identifying the protein glycation site provided by the embodiment of the application fully considers the amino acid sequence in the peptide chain, the accessible surface area of the amino acid in the peptide chain, the secondary structure probability of the amino acid in the peptide chain and the gray correlation degree of the peptide chain, and is favorable for improving the accuracy of identifying the protein glycation site.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments, and the relevant parts are referred to the partial description of the method embodiment. Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for identifying a glycation site, the method comprising:

collecting a saccharification site training data set, and extracting a peptide chain P = A from the saccharification site training data set _-η A _-(η-1) ...A _- ₂ A _-1 KA ₁ A ₂ ...A _η-1 A _η K is lysine, eta is the number of amino acids upstream or downstream of lysine, A is one of 20 natural amino acids;

representing amino acids in the peptide chain by using 20-dimensional binary codes, and converting the peptide chain into a 20 (2 eta + 1) -dimensional digital vector; wherein: expanding the peptide chain using the symbol X pair when the number of amino acids upstream or downstream of a lysine in the peptide chain is less than η; the 20-dimensional binary encoding of X is 0 for both secondary structure probabilities of 00 000 000 000 000 000 000,000X and 0,x accessible surface area;

calculating the accessible surface area of amino acids in the peptide chain;

calculating the grey correlation degree of the peptide chain;

identifying protein glycation sites based on the predictor.

2. The method of identifying a protein glycation site according to claim 1, wherein η =11.

3. The method for identifying a glycation site of a protein according to claim 1, further comprising:

the predictor was evaluated by sensitivity SEN, specificity SPC, accuracy ACC and markov correlation coefficient MCC according to the test peptide chain.

4. The method of identifying a protein glycation site according to claim 3, characterized by further comprising:

and adjusting the feature quantity of the optimal feature set, training based on a support vector machine according to the adjusted optimal feature set to obtain a predictor, and searching for the predictor with higher accuracy ACC.

5. The method for identifying a glycation site of a protein according to claim 4, further comprising:

and when the accuracy ACC of the predictor is high, counting the occupation amount of each feature type in the optimal feature set, and acquiring the feature type with the largest influence on the predictor.