CN117095754A

CN117095754A - Method for classifying proteins by machine learning

Info

Publication number: CN117095754A
Application number: CN202311353697.6A
Authority: CN
Inventors: 王曙蒙; 常天安
Original assignee: Jiangsu Zhengda Tianchuang Biological Engineering Co ltd
Current assignee: Jiangsu Zhengda Tianchuang Biological Engineering Co ltd
Priority date: 2023-10-19
Filing date: 2023-10-19
Publication date: 2023-11-21
Anticipated expiration: 2043-10-19
Also published as: CN117095754B

Abstract

The invention relates to the technical field of protein data processing, and provides a method for classifying proteins by utilizing machine learning, which comprises the following steps: acquiring map data of protein molecules; acquiring an amino acid structure gradient sequence according to the graph data of protein molecules, acquiring a structure complexity index and a similarity position change index according to the amino acid structure gradient sequence, acquiring a protein molecule space and a protein diversity index according to the structure complexity index and the similarity position change index, and acquiring local density according to the protein molecule space; protein similarity is obtained according to the protein diversity index and the local density, and a chameleon clustering algorithm is utilized to obtain a cluster of protein molecules according to the protein similarity; and obtaining a protein classification result according to the cluster of protein molecules. The invention improves the chameleon clustering algorithm by utilizing the structural change in the protein and the similarity position change between amino acids, thereby improving the accuracy of protein classification.

Description

Method for classifying proteins by machine learning

Technical Field

The invention relates to the technical field of protein data processing, in particular to a method for classifying proteins by utilizing machine learning.

Background

The protein is a high molecular compound formed by combining amino acids according to a certain sequence and combining polypeptide chains according to a certain mode, and is a main carrier of life activities. The map data is a nonlinear data composed of edges and nodes, whereas protein molecules are composed of amino acids in a "dehydration condensation" manner, and each protein molecule can be naturally stored with the map data.

In the existing protein classification method, the graphic data of proteins are learned by using a Graphic Neural Network (GNN) to assist in completing protein classification. However, learning the map data of proteins using the map neural network (GNN), lack of consideration for embedding the protein structure into a uniform dimension, results in poor ability of learning the map data of the map neural network (GNN), and results in poor accuracy of protein classification.

Disclosure of Invention

The invention provides a method for classifying proteins by machine learning, which solves the problem of poor accuracy of protein classification, and adopts the following technical scheme:

one embodiment of the present invention provides a method for protein classification using machine learning, the method comprising the steps of:

acquiring map data of protein molecules;

acquiring the amino acid structure gradient sequence of each protein molecule according to the graph data of each protein molecule; acquiring the structure complexity index of each protein molecule according to the amino acid structure gradient sequence of each protein molecule; obtaining the similarity position change index of each protein molecule according to the graph data and the amino acid structure gradient sequence of each protein molecule; acquiring protein molecule space according to the structure complexity index and the similarity position change index of all protein molecules; obtaining the local density of each protein molecule according to the protein molecule space;

acquiring a protein diversity index of each protein molecule according to the structure complexity index and the similarity position change index of each protein molecule; obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between the two protein molecules; obtaining a clustering result of protein molecules based on the protein similarity by adopting a chameleon clustering algorithm;

and obtaining a protein diversity size sequence according to the clustering result of the protein molecules, and obtaining a protein classification result according to the protein diversity size sequence.

Preferably, the method for obtaining the amino acid structure gradient sequence of each protein molecule according to the graph data of each protein molecule comprises the following steps:

the number of connecting edges of each node in the graph data of each protein molecule is taken as the structural complexity of each node in the graph data of each protein molecule, and a sequence formed by the structural complexity of all nodes in the graph data of each protein molecule according to the ascending order of numerical values is taken as the amino acid structure gradient sequence of each protein molecule.

Preferably, the method for obtaining the structural complexity index of each protein molecule according to the amino acid structure gradient sequence of each protein molecule comprises the following steps:

in the method, in the process of the invention,index of structural complexity representing the x-th protein molecule in a protein pool, < >>Coefficient of variation representing the number within the amino acid structure transition sequence of the x-th protein molecule in the protein pool,/->Representing an exponential function based on natural constants, < ->Representing the number of values within the amino acid structure transition sequence of the x-th protein molecule of the protein pool,/->And->Respectively representing the data values of the ith and (i-1) th data in the amino acid structure gradient sequence of the xth protein molecule in the protein collection,/>And->Respectively represent the maximum value and the minimum value of the numerical values in the amino acid structure transition sequence of the xth protein molecule in the protein collection, < ->Is an error constant.

Preferably, the method for obtaining the similarity position change index of each protein molecule according to the map data of each protein molecule and the amino acid structure gradient sequence comprises the following steps:

in the method, in the process of the invention,index of change in similarity position indicating the x-th protein molecule in the protein pool,/for the protein pool>Representing Euclidean distance function, ">And->Respectively represent the amino acid structure gradient sequence of the xth protein molecule in the protein collectionPosition coordinates of a node represented by the ith and (i-1) th data in the column on the map data of the xth protein molecule, +.>The number of values in the amino acid structure transition sequence of the x-th protein molecule in the protein pool is shown.

Preferably, the method for obtaining the protein molecule space according to the structural complexity index and the similarity position change index of all protein molecules comprises the following steps:

normalizing the structure complexity index of all the protein molecules, taking the normalization processing result of the structure complexity index of each protein molecule as the normalization structure complexity index of each protein molecule, and taking the normalization structure complexity index of each protein molecule as the abscissa of each protein molecule under a two-dimensional coordinate system;

normalizing the similarity position change indexes of all protein molecules, taking the normalization processing result of the similarity position change indexes of each protein molecule as the normalization similarity position change index of each protein molecule, and taking the normalization similarity position change index of each protein molecule as the ordinate of each protein molecule under a two-dimensional coordinate system;

and obtaining the two-dimensional positioning coordinates of each protein molecule according to the abscissa and the ordinate of each protein molecule under the two-dimensional coordinate system, and taking a two-dimensional data space formed by the two-dimensional positioning coordinates of all protein molecules as a protein molecule space.

Preferably, the method for obtaining the local density of each protein molecule according to the protein molecule space comprises the following steps:

taking the two-dimensional coordinates of all protein molecules in the protein molecule space as the input of a DPC density peak value clustering algorithm, taking a preset cut-off distance as the parameter of the DPC density peak value clustering algorithm, and taking the output of the DPC density peak value clustering algorithm as the local density of each protein molecule.

Preferably, the method for obtaining the protein diversity index of each protein molecule according to the structural complexity index and the similarity position change index of each protein molecule comprises the following steps:

for each protein molecule, the product of the structural complexity index of the protein molecule and the similarity position change index is taken as the protein diversity index of the protein molecule.

Preferably, the method for obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between the two protein molecules comprises the following steps:

in the method, in the process of the invention,represents the protein similarity between the x-th protein molecule and the g-th protein molecule in the protein pool,/->And->Protein diversity index indicating the x-th and g-th protein molecules in the protein pool, respectively,/->And->The local densities of the x-th and g-th protein molecules in the protein pool are shown, and M represents the proportionality constant.

Preferably, the method for obtaining the clustering result of the protein molecules based on the protein similarity by adopting a chameleon clustering algorithm comprises the following steps:

taking all protein molecules in a protein molecule space as input of a chameleon clustering algorithm, taking protein similarity among the protein molecules as measurement distance among the protein molecules, and obtaining a clustering result of the protein molecules based on the measurement distance by adopting the chameleon clustering algorithm.

Preferably, the method for obtaining the protein diversity size sequence according to the clustering result of the protein molecules and obtaining the protein classification result according to the protein diversity size sequence comprises the following steps:

calculating the protein diversity index average value of all protein molecules in each cluster in the protein molecule clustering result, and taking a sequence formed by the protein diversity index average values of all the clusters in the protein molecule clustering result according to the ascending order of the numerical values as a protein diversity size sequence;

taking the cluster represented by each element in the protein diversity size sequence as a protein class, and obtaining a protein classification result according to all protein classes.

The beneficial effects of the invention are as follows: according to the invention, an amino acid structure gradient sequence is obtained by using graph data of protein molecules, then, a structure complexity index and a similarity position change index of each protein molecule are obtained according to the amino acid structure gradient sequence and the graph data of the protein molecules, further, the local density of the protein molecules is obtained by using a DPC density peak clustering algorithm, a protein diversity index is obtained based on the structure complexity index and the similarity position change index, the measurement distance in a chameleon clustering algorithm is improved based on the local density and the protein diversity index, and further, the clustering result of the protein molecules in a protein set is obtained by using the chameleon clustering algorithm. The method has the advantages that the measurement distance in the chameleon clustering algorithm is improved by considering the structural change inside the protein and the similarity position change between amino acids, so that the characteristic distance between protein molecules is more accurate, and the precision of protein classification is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of a method for classifying proteins using machine learning according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for classifying proteins using machine learning according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to FIG. 1, a flowchart of a method for classifying proteins using machine learning according to one embodiment of the present invention is shown, the method comprising the steps of:

step S001, obtaining the image data of the protein molecules.

In the invention, firstly, the graph data of protein molecules are extracted, and protein classification is completed by a machine learning algorithm based on the graph data of protein molecules. A plurality of protein molecules are collected from the existing protein database, the number of the protein molecules collected in the invention is n, and the empirical value of n is 500. And taking all the collected protein molecules as a protein set, and converting each protein molecule in the protein set by using graph data to obtain graph data of the protein molecules, wherein the graph data of the protein molecules are nonlinear data consisting of nodes represented by amino acids in the protein and edges represented by chemical bonds between the amino acids, and the graph data are converted into known technologies and are not redundant.

Thus, map data of all protein molecules in the protein pool were obtained.

Step S002, obtaining amino acid structure gradient sequence according to the graph data of protein molecules, obtaining structure complexity index and similarity position change index according to the graph data of protein molecules and the amino acid structure gradient sequence, obtaining protein molecule space according to the structure complexity index and similarity position change index, and obtaining the local density of each protein molecule by using a density peak clustering algorithm.

Based on the graphical data of all protein molecules in the protein pool, the basic information of the protein molecules needs to be analyzed in order to obtain better clustering results. For each protein molecule, each node in the graph data of the protein molecule represents an amino acid, each side represents a chemical bond between the amino acids, and due to the different constituent elements of different amino acids, i.e., the different structures, one amino acid has at least one amino group and one carboxyl group, and the amino group and the carboxyl group of the amino acid are used for "dehydration condensation" to form a chemical bond. Therefore, the number of connecting edges of each node reflects the structural complexity of the amino acid to a certain extent, the invention obtains the number of connecting edges of each node in the graph data of each protein molecule, and takes the number of connecting edges of each node as the structural complexity of each amino acid. A flow chart of an embodiment of the present invention is shown in fig. 2.

Based on the analysis, the structural complexity of each node in the graph data of each protein molecule, namely the structural complexity of each amino acid, is obtained, and a sequence formed by the structural complexity of all nodes in the graph data of each protein molecule according to the ascending order of numerical values is used as an amino acid structure gradient sequence of each protein molecule. The amino acid structure gradient sequence of the protein molecule can reflect the structural characteristics of the protein molecule to a certain extent, and the greater the difference between the values in the amino acid structure gradient sequence, the more complex the structure of the protein molecule can be explained, because the amino acid structure gradient sequence is a numerical value increasing sequence and each numerical value in the amino acid structure gradient sequence reflects the complexity of the structure of the amino acid.

Calculating the structural complexity index of each protein molecule in the protein collection:

in the method, in the process of the invention,index of structural complexity representing the x-th protein molecule in a protein pool, < >>Coefficient of variation representing the number within the amino acid structure transition sequence of the x-th protein molecule in the protein pool,/->Representing an exponential function based on natural constants, < ->Representing the number of values within the amino acid structure transition sequence of the x-th protein molecule of the protein pool,/->And->Respectively representing the data values of the ith and (i-1) th data in the amino acid structure gradient sequence of the xth protein molecule in the protein collection,/>And->Respectively represent maximum value and minimum value in amino acid structure gradient sequence of the xth protein molecule in protein collection, < + >>Is an error constant, avoids denominator of 0, < ->The empirical value of (2) is 0.01.

Coefficient of variation of numerical values within amino acid structure transition sequences of the xth protein molecule in a protein poolThe larger the difference between the data values of the ith and (i-1) th protein molecules in the amino acid structure transition sequence of the xth protein molecule in the protein pool ≡>The larger the difference between the maximum and the minimum in the amino acid structure transition sequence of the x-th protein molecule in the protein collection ∈>The larger the protein molecule, the more complex the structure of the amino acid, and the more different amino acids, resulting in a more complex structure of the protein molecule, the greater the structural complexity index of the protein molecule.

Further, by combining the similarity between amino acids and the Euclidean distance, the more distant the amino acids with higher similarity are, the larger the position change characteristic between the amino acid structure gradient sequence of the protein molecule and the amino acid in the map data of the protein molecule is, and the specific characteristic of the map data of the protein molecule can be better reflected.

Calculating the similarity position change index of each protein molecule in the protein collection:

in the method, in the process of the invention,index of change in similarity position indicating the x-th protein molecule in the protein pool,/for the protein pool>Representing Euclidean distance function, ">And->Respectively representing the position coordinates of the nodes represented by the ith and (i-1) th data in the amino acid structure gradient sequences of the xth protein molecule in the protein collection on the graph data of the xth protein molecule,/for the (x)>The number of values in the amino acid structure transition sequence of the x-th protein molecule in the protein pool is shown.

Euclidean distance between position coordinates of nodes represented by the ith and (i-1) th data in amino acid structure transition sequences of the xth protein molecule in protein collection on map data of the xth protein moleculeThe larger the amino acid structure gradient sequence, the closer the values of the ith and (i-1) data are, which shows that the chemical formulas of two amino acids are similar to a certain extent, and the larger the Euclidean distance between the nodes represented by the two amino acids in the graph data is, the larger the similarity position change index of the protein molecule is.

Further, in order to obtain a more accurate protein classification result, based on the obtained structural complexity index and the similarity position change index of each protein molecule, the structural complexity index and the similarity position change index of the protein molecules in the protein set are respectively normalized, and the normalization is a known technology, so that redundant description is not made.

Specifically, the structural complexity index and the similarity position change index of the xth protein molecule after normalization treatment are respectively recorded asAnd->Two-dimensional positioning of the xth protein molecule is marked as +.>. Based on all of the protein poolsMapping the two-dimensional positioning coordinates of the protein molecules to a two-dimensional coordinate system to obtain a two-dimensional data space, and taking the two-dimensional data space as a protein molecule space. Taking two-dimensional coordinates of all protein molecules in a protein molecule space as input of a DPC density peak value clustering algorithm, taking a preset cut-off distance as input parameters of the DPC density peak value clustering algorithm, taking the experience value of the preset cut-off distance as 10, taking output of the DPC density peak value clustering algorithm as local density of each protein molecule, wherein the DPC density peak value clustering algorithm is a known technology and is not redundant.

To this end, the local density of each protein molecule is obtained.

Step S003, obtaining a protein diversity index based on the structure complexity index and the similarity position change index, improving the measurement distance in the chameleon clustering algorithm based on the local density and the protein diversity index, and further obtaining the clustering result of protein molecules in the protein set by using the chameleon clustering algorithm.

Generally, the structure of a protein is complex, protein molecules have diversity, and the further the distance between amino acids with higher similarity in the structure of the protein molecules is, the more complex the structure is, which is a representation of the diversity of the protein molecules to a certain extent.

Thus, the protein diversity index is calculated for each protein molecule in the protein pool:

in the method, in the process of the invention,protein diversity index, indicative of the x-th protein molecule in a protein pool, < >>Index of structural complexity representing the x-th protein molecule in a protein pool, < >>And (3) representing the similarity position change index of the xth protein molecule in the protein collection.

When the structural complexity index of protein moleculesThe larger the same time the similarity position change index of protein molecule +.>The larger the protein molecule, the more the degree of diversity is, the protein molecule protein diversity index +.>The larger.

The chameleon clustering algorithm is a hierarchical clustering algorithm, and can realize high-quality and rapid clustering. However, in building a neighbor graph, the metric distance is often euclidean distance, only the positions among the data points are considered, the attribute relationship among the data points is not considered, and the obtained clustering result is poor. Therefore, in order to obtain a better clustering result, the invention also considers the relationship among the attributes of the data points.

Protein similarity between every two protein molecules was calculated:

in the method, in the process of the invention,represents the protein similarity between the x-th protein molecule and the g-th protein molecule in the protein pool,/->And->Protein diversity index indicating the x-th and g-th protein molecules in the protein pool, respectively,/->And->The local densities of the x-th and g-th protein molecules in the protein set are respectively represented, M represents a proportionality constant, and the empirical value is 2.

Thus, protein similarity between protein molecules is determined by measuring the similarity between protein diversity indices of protein molecules and the similarity between local densities. Differences between the protein diversity index of the x and g protein molecules after normalization treatment in protein poolsThe larger and the difference between the local densities of the x and g th protein molecules after normalization treatment in the protein pool ∈>The larger the protein similarity between the x-th protein molecule and the g-th protein molecule in the protein pool is, the larger the protein similarity is.

The protein similarity among protein molecules is used as a measurement distance among the protein molecules, all the protein molecules in a protein molecule space are used as chameleon clustering algorithms to be input, k values in a k neighbor graph are set to be 20, the weight value during clustering is 1.5, the measurement threshold value is 5, the output of the chameleon clustering algorithm is used as all clustering clusters in a clustering result of the protein molecules, and the chameleon clustering algorithm is a known technology and is not redundant.

Step S004, obtaining a protein classification result according to the clustering result of the protein molecules.

According to all the clustering clusters in the obtained clustering result of the protein molecules, calculating protein diversity index average values of all the protein molecules in each clustering cluster, taking a sequence formed by the protein diversity index average values of all the clustering clusters according to the ascending order of numerical values as a protein diversity size sequence, and taking a class formed by the protein molecules in a first clustering cluster in the protein diversity size sequence as a first protein class; taking the class consisting of protein molecules in a second cluster in the protein diversity size sequence as a second protein class; and so on.

Therefore, the distribution of protein classes is completed according to the protein diversity index average, and the first protein class represents the class with the smallest protein diversity index average, which indicates that the protein molecules contained in the first protein class have lower diversity; and the last protein class represents the class with the largest average value of the protein diversity index, which indicates that the protein molecules contained in the last protein class have higher diversity.

Thus, a method for protein classification using machine learning was completed.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for classifying proteins using machine learning, the method comprising the steps of:

acquiring map data of protein molecules;

2. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the amino acid structure transition sequence of each protein molecule from the map data of each protein molecule comprises the steps of:

3. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the structural complexity index of each protein molecule based on the amino acid structure transition sequence of each protein molecule comprises the steps of:

4. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the similarity position change index of each protein molecule from the map data and the amino acid structure change sequence of each protein molecule comprises the steps of:

in the method, in the process of the invention,the change of the similarity position of the xth protein molecule in the protein collection is indicatedCount (n)/(l)>Representing Euclidean distance function, ">And->Respectively representing the position coordinates of the nodes represented by the ith and (i-1) th data in the amino acid structure gradient sequences of the xth protein molecule in the protein collection on the graph data of the xth protein molecule,/for the (x)>The number of values in the amino acid structure transition sequence of the x-th protein molecule in the protein pool is shown.

5. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining protein molecule space according to the structural complexity index and the similarity position variation index of all protein molecules comprises the steps of:

6. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the local density of each protein molecule from the protein molecule space is as follows:

7. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the protein diversity index of each protein molecule according to the structure complexity index and the similarity position variation index of each protein molecule comprises the steps of:

8. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between two protein molecules comprises the following steps:

in the method, in the process of the invention,represents the x-th protein molecule in the protein poolProtein similarity between g protein molecules, < >>And->Protein diversity index indicating the x-th and g-th protein molecules in the protein pool, respectively,/->And->The local densities of the x-th and g-th protein molecules in the protein pool are shown, and M represents the proportionality constant.

9. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the clustering result of protein molecules based on the protein similarity by using a chameleon clustering algorithm is as follows:

10. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the protein diversity size sequence according to the clustering result of protein molecules and obtaining the protein classification result according to the protein diversity size sequence comprises the following steps: