CN117095754A - Method for classifying proteins by machine learning - Google Patents

Method for classifying proteins by machine learning Download PDF

Info

Publication number
CN117095754A
CN117095754A CN202311353697.6A CN202311353697A CN117095754A CN 117095754 A CN117095754 A CN 117095754A CN 202311353697 A CN202311353697 A CN 202311353697A CN 117095754 A CN117095754 A CN 117095754A
Authority
CN
China
Prior art keywords
protein
protein molecule
index
molecule
molecules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311353697.6A
Other languages
Chinese (zh)
Other versions
CN117095754B (en
Inventor
王曙蒙
常天安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Zhengda Tianchuang Biological Engineering Co ltd
Original Assignee
Jiangsu Zhengda Tianchuang Biological Engineering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Zhengda Tianchuang Biological Engineering Co ltd filed Critical Jiangsu Zhengda Tianchuang Biological Engineering Co ltd
Priority to CN202311353697.6A priority Critical patent/CN117095754B/en
Publication of CN117095754A publication Critical patent/CN117095754A/en
Application granted granted Critical
Publication of CN117095754B publication Critical patent/CN117095754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to the technical field of protein data processing, and provides a method for classifying proteins by utilizing machine learning, which comprises the following steps: acquiring map data of protein molecules; acquiring an amino acid structure gradient sequence according to the graph data of protein molecules, acquiring a structure complexity index and a similarity position change index according to the amino acid structure gradient sequence, acquiring a protein molecule space and a protein diversity index according to the structure complexity index and the similarity position change index, and acquiring local density according to the protein molecule space; protein similarity is obtained according to the protein diversity index and the local density, and a chameleon clustering algorithm is utilized to obtain a cluster of protein molecules according to the protein similarity; and obtaining a protein classification result according to the cluster of protein molecules. The invention improves the chameleon clustering algorithm by utilizing the structural change in the protein and the similarity position change between amino acids, thereby improving the accuracy of protein classification.

Description

Method for classifying proteins by machine learning
Technical Field
The invention relates to the technical field of protein data processing, in particular to a method for classifying proteins by utilizing machine learning.
Background
The protein is a high molecular compound formed by combining amino acids according to a certain sequence and combining polypeptide chains according to a certain mode, and is a main carrier of life activities. The map data is a nonlinear data composed of edges and nodes, whereas protein molecules are composed of amino acids in a "dehydration condensation" manner, and each protein molecule can be naturally stored with the map data.
In the existing protein classification method, the graphic data of proteins are learned by using a Graphic Neural Network (GNN) to assist in completing protein classification. However, learning the map data of proteins using the map neural network (GNN), lack of consideration for embedding the protein structure into a uniform dimension, results in poor ability of learning the map data of the map neural network (GNN), and results in poor accuracy of protein classification.
Disclosure of Invention
The invention provides a method for classifying proteins by machine learning, which solves the problem of poor accuracy of protein classification, and adopts the following technical scheme:
one embodiment of the present invention provides a method for protein classification using machine learning, the method comprising the steps of:
acquiring map data of protein molecules;
acquiring the amino acid structure gradient sequence of each protein molecule according to the graph data of each protein molecule; acquiring the structure complexity index of each protein molecule according to the amino acid structure gradient sequence of each protein molecule; obtaining the similarity position change index of each protein molecule according to the graph data and the amino acid structure gradient sequence of each protein molecule; acquiring protein molecule space according to the structure complexity index and the similarity position change index of all protein molecules; obtaining the local density of each protein molecule according to the protein molecule space;
acquiring a protein diversity index of each protein molecule according to the structure complexity index and the similarity position change index of each protein molecule; obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between the two protein molecules; obtaining a clustering result of protein molecules based on the protein similarity by adopting a chameleon clustering algorithm;
and obtaining a protein diversity size sequence according to the clustering result of the protein molecules, and obtaining a protein classification result according to the protein diversity size sequence.
Preferably, the method for obtaining the amino acid structure gradient sequence of each protein molecule according to the graph data of each protein molecule comprises the following steps:
the number of connecting edges of each node in the graph data of each protein molecule is taken as the structural complexity of each node in the graph data of each protein molecule, and a sequence formed by the structural complexity of all nodes in the graph data of each protein molecule according to the ascending order of numerical values is taken as the amino acid structure gradient sequence of each protein molecule.
Preferably, the method for obtaining the structural complexity index of each protein molecule according to the amino acid structure gradient sequence of each protein molecule comprises the following steps:
in the method, in the process of the invention,index of structural complexity representing the x-th protein molecule in a protein pool, < >>Coefficient of variation representing the number within the amino acid structure transition sequence of the x-th protein molecule in the protein pool,/->Representing an exponential function based on natural constants, < ->Representing the number of values within the amino acid structure transition sequence of the x-th protein molecule of the protein pool,/->And->Respectively representing the data values of the ith and (i-1) th data in the amino acid structure gradient sequence of the xth protein molecule in the protein collection,/>And->Respectively represent the maximum value and the minimum value of the numerical values in the amino acid structure transition sequence of the xth protein molecule in the protein collection, < ->Is an error constant.
Preferably, the method for obtaining the similarity position change index of each protein molecule according to the map data of each protein molecule and the amino acid structure gradient sequence comprises the following steps:
in the method, in the process of the invention,index of change in similarity position indicating the x-th protein molecule in the protein pool,/for the protein pool>Representing Euclidean distance function, ">And->Respectively represent the amino acid structure gradient sequence of the xth protein molecule in the protein collectionPosition coordinates of a node represented by the ith and (i-1) th data in the column on the map data of the xth protein molecule, +.>The number of values in the amino acid structure transition sequence of the x-th protein molecule in the protein pool is shown.
Preferably, the method for obtaining the protein molecule space according to the structural complexity index and the similarity position change index of all protein molecules comprises the following steps:
normalizing the structure complexity index of all the protein molecules, taking the normalization processing result of the structure complexity index of each protein molecule as the normalization structure complexity index of each protein molecule, and taking the normalization structure complexity index of each protein molecule as the abscissa of each protein molecule under a two-dimensional coordinate system;
normalizing the similarity position change indexes of all protein molecules, taking the normalization processing result of the similarity position change indexes of each protein molecule as the normalization similarity position change index of each protein molecule, and taking the normalization similarity position change index of each protein molecule as the ordinate of each protein molecule under a two-dimensional coordinate system;
and obtaining the two-dimensional positioning coordinates of each protein molecule according to the abscissa and the ordinate of each protein molecule under the two-dimensional coordinate system, and taking a two-dimensional data space formed by the two-dimensional positioning coordinates of all protein molecules as a protein molecule space.
Preferably, the method for obtaining the local density of each protein molecule according to the protein molecule space comprises the following steps:
taking the two-dimensional coordinates of all protein molecules in the protein molecule space as the input of a DPC density peak value clustering algorithm, taking a preset cut-off distance as the parameter of the DPC density peak value clustering algorithm, and taking the output of the DPC density peak value clustering algorithm as the local density of each protein molecule.
Preferably, the method for obtaining the protein diversity index of each protein molecule according to the structural complexity index and the similarity position change index of each protein molecule comprises the following steps:
for each protein molecule, the product of the structural complexity index of the protein molecule and the similarity position change index is taken as the protein diversity index of the protein molecule.
Preferably, the method for obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between the two protein molecules comprises the following steps:
in the method, in the process of the invention,represents the protein similarity between the x-th protein molecule and the g-th protein molecule in the protein pool,/->And->Protein diversity index indicating the x-th and g-th protein molecules in the protein pool, respectively,/->And->The local densities of the x-th and g-th protein molecules in the protein pool are shown, and M represents the proportionality constant.
Preferably, the method for obtaining the clustering result of the protein molecules based on the protein similarity by adopting a chameleon clustering algorithm comprises the following steps:
taking all protein molecules in a protein molecule space as input of a chameleon clustering algorithm, taking protein similarity among the protein molecules as measurement distance among the protein molecules, and obtaining a clustering result of the protein molecules based on the measurement distance by adopting the chameleon clustering algorithm.
Preferably, the method for obtaining the protein diversity size sequence according to the clustering result of the protein molecules and obtaining the protein classification result according to the protein diversity size sequence comprises the following steps:
calculating the protein diversity index average value of all protein molecules in each cluster in the protein molecule clustering result, and taking a sequence formed by the protein diversity index average values of all the clusters in the protein molecule clustering result according to the ascending order of the numerical values as a protein diversity size sequence;
taking the cluster represented by each element in the protein diversity size sequence as a protein class, and obtaining a protein classification result according to all protein classes.
The beneficial effects of the invention are as follows: according to the invention, an amino acid structure gradient sequence is obtained by using graph data of protein molecules, then, a structure complexity index and a similarity position change index of each protein molecule are obtained according to the amino acid structure gradient sequence and the graph data of the protein molecules, further, the local density of the protein molecules is obtained by using a DPC density peak clustering algorithm, a protein diversity index is obtained based on the structure complexity index and the similarity position change index, the measurement distance in a chameleon clustering algorithm is improved based on the local density and the protein diversity index, and further, the clustering result of the protein molecules in a protein set is obtained by using the chameleon clustering algorithm. The method has the advantages that the measurement distance in the chameleon clustering algorithm is improved by considering the structural change inside the protein and the similarity position change between amino acids, so that the characteristic distance between protein molecules is more accurate, and the precision of protein classification is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of a method for classifying proteins using machine learning according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for classifying proteins using machine learning according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to FIG. 1, a flowchart of a method for classifying proteins using machine learning according to one embodiment of the present invention is shown, the method comprising the steps of:
step S001, obtaining the image data of the protein molecules.
In the invention, firstly, the graph data of protein molecules are extracted, and protein classification is completed by a machine learning algorithm based on the graph data of protein molecules. A plurality of protein molecules are collected from the existing protein database, the number of the protein molecules collected in the invention is n, and the empirical value of n is 500. And taking all the collected protein molecules as a protein set, and converting each protein molecule in the protein set by using graph data to obtain graph data of the protein molecules, wherein the graph data of the protein molecules are nonlinear data consisting of nodes represented by amino acids in the protein and edges represented by chemical bonds between the amino acids, and the graph data are converted into known technologies and are not redundant.
Thus, map data of all protein molecules in the protein pool were obtained.
Step S002, obtaining amino acid structure gradient sequence according to the graph data of protein molecules, obtaining structure complexity index and similarity position change index according to the graph data of protein molecules and the amino acid structure gradient sequence, obtaining protein molecule space according to the structure complexity index and similarity position change index, and obtaining the local density of each protein molecule by using a density peak clustering algorithm.
Based on the graphical data of all protein molecules in the protein pool, the basic information of the protein molecules needs to be analyzed in order to obtain better clustering results. For each protein molecule, each node in the graph data of the protein molecule represents an amino acid, each side represents a chemical bond between the amino acids, and due to the different constituent elements of different amino acids, i.e., the different structures, one amino acid has at least one amino group and one carboxyl group, and the amino group and the carboxyl group of the amino acid are used for "dehydration condensation" to form a chemical bond. Therefore, the number of connecting edges of each node reflects the structural complexity of the amino acid to a certain extent, the invention obtains the number of connecting edges of each node in the graph data of each protein molecule, and takes the number of connecting edges of each node as the structural complexity of each amino acid. A flow chart of an embodiment of the present invention is shown in fig. 2.
Based on the analysis, the structural complexity of each node in the graph data of each protein molecule, namely the structural complexity of each amino acid, is obtained, and a sequence formed by the structural complexity of all nodes in the graph data of each protein molecule according to the ascending order of numerical values is used as an amino acid structure gradient sequence of each protein molecule. The amino acid structure gradient sequence of the protein molecule can reflect the structural characteristics of the protein molecule to a certain extent, and the greater the difference between the values in the amino acid structure gradient sequence, the more complex the structure of the protein molecule can be explained, because the amino acid structure gradient sequence is a numerical value increasing sequence and each numerical value in the amino acid structure gradient sequence reflects the complexity of the structure of the amino acid.
Calculating the structural complexity index of each protein molecule in the protein collection:
in the method, in the process of the invention,index of structural complexity representing the x-th protein molecule in a protein pool, < >>Coefficient of variation representing the number within the amino acid structure transition sequence of the x-th protein molecule in the protein pool,/->Representing an exponential function based on natural constants, < ->Representing the number of values within the amino acid structure transition sequence of the x-th protein molecule of the protein pool,/->And->Respectively representing the data values of the ith and (i-1) th data in the amino acid structure gradient sequence of the xth protein molecule in the protein collection,/>And->Respectively represent maximum value and minimum value in amino acid structure gradient sequence of the xth protein molecule in protein collection, < + >>Is an error constant, avoids denominator of 0, < ->The empirical value of (2) is 0.01.
Coefficient of variation of numerical values within amino acid structure transition sequences of the xth protein molecule in a protein poolThe larger the difference between the data values of the ith and (i-1) th protein molecules in the amino acid structure transition sequence of the xth protein molecule in the protein pool ≡>The larger the difference between the maximum and the minimum in the amino acid structure transition sequence of the x-th protein molecule in the protein collection ∈>The larger the protein molecule, the more complex the structure of the amino acid, and the more different amino acids, resulting in a more complex structure of the protein molecule, the greater the structural complexity index of the protein molecule.
Further, by combining the similarity between amino acids and the Euclidean distance, the more distant the amino acids with higher similarity are, the larger the position change characteristic between the amino acid structure gradient sequence of the protein molecule and the amino acid in the map data of the protein molecule is, and the specific characteristic of the map data of the protein molecule can be better reflected.
Calculating the similarity position change index of each protein molecule in the protein collection:
in the method, in the process of the invention,index of change in similarity position indicating the x-th protein molecule in the protein pool,/for the protein pool>Representing Euclidean distance function, ">And->Respectively representing the position coordinates of the nodes represented by the ith and (i-1) th data in the amino acid structure gradient sequences of the xth protein molecule in the protein collection on the graph data of the xth protein molecule,/for the (x)>The number of values in the amino acid structure transition sequence of the x-th protein molecule in the protein pool is shown.
Euclidean distance between position coordinates of nodes represented by the ith and (i-1) th data in amino acid structure transition sequences of the xth protein molecule in protein collection on map data of the xth protein moleculeThe larger the amino acid structure gradient sequence, the closer the values of the ith and (i-1) data are, which shows that the chemical formulas of two amino acids are similar to a certain extent, and the larger the Euclidean distance between the nodes represented by the two amino acids in the graph data is, the larger the similarity position change index of the protein molecule is.
Further, in order to obtain a more accurate protein classification result, based on the obtained structural complexity index and the similarity position change index of each protein molecule, the structural complexity index and the similarity position change index of the protein molecules in the protein set are respectively normalized, and the normalization is a known technology, so that redundant description is not made.
Specifically, the structural complexity index and the similarity position change index of the xth protein molecule after normalization treatment are respectively recorded asAnd->Two-dimensional positioning of the xth protein molecule is marked as +.>. Based on all of the protein poolsMapping the two-dimensional positioning coordinates of the protein molecules to a two-dimensional coordinate system to obtain a two-dimensional data space, and taking the two-dimensional data space as a protein molecule space. Taking two-dimensional coordinates of all protein molecules in a protein molecule space as input of a DPC density peak value clustering algorithm, taking a preset cut-off distance as input parameters of the DPC density peak value clustering algorithm, taking the experience value of the preset cut-off distance as 10, taking output of the DPC density peak value clustering algorithm as local density of each protein molecule, wherein the DPC density peak value clustering algorithm is a known technology and is not redundant.
To this end, the local density of each protein molecule is obtained.
Step S003, obtaining a protein diversity index based on the structure complexity index and the similarity position change index, improving the measurement distance in the chameleon clustering algorithm based on the local density and the protein diversity index, and further obtaining the clustering result of protein molecules in the protein set by using the chameleon clustering algorithm.
Generally, the structure of a protein is complex, protein molecules have diversity, and the further the distance between amino acids with higher similarity in the structure of the protein molecules is, the more complex the structure is, which is a representation of the diversity of the protein molecules to a certain extent.
Thus, the protein diversity index is calculated for each protein molecule in the protein pool:
in the method, in the process of the invention,protein diversity index, indicative of the x-th protein molecule in a protein pool, < >>Index of structural complexity representing the x-th protein molecule in a protein pool, < >>And (3) representing the similarity position change index of the xth protein molecule in the protein collection.
When the structural complexity index of protein moleculesThe larger the same time the similarity position change index of protein molecule +.>The larger the protein molecule, the more the degree of diversity is, the protein molecule protein diversity index +.>The larger.
The chameleon clustering algorithm is a hierarchical clustering algorithm, and can realize high-quality and rapid clustering. However, in building a neighbor graph, the metric distance is often euclidean distance, only the positions among the data points are considered, the attribute relationship among the data points is not considered, and the obtained clustering result is poor. Therefore, in order to obtain a better clustering result, the invention also considers the relationship among the attributes of the data points.
Protein similarity between every two protein molecules was calculated:
in the method, in the process of the invention,represents the protein similarity between the x-th protein molecule and the g-th protein molecule in the protein pool,/->And->Protein diversity index indicating the x-th and g-th protein molecules in the protein pool, respectively,/->And->The local densities of the x-th and g-th protein molecules in the protein set are respectively represented, M represents a proportionality constant, and the empirical value is 2.
Thus, protein similarity between protein molecules is determined by measuring the similarity between protein diversity indices of protein molecules and the similarity between local densities. Differences between the protein diversity index of the x and g protein molecules after normalization treatment in protein poolsThe larger and the difference between the local densities of the x and g th protein molecules after normalization treatment in the protein pool ∈>The larger the protein similarity between the x-th protein molecule and the g-th protein molecule in the protein pool is, the larger the protein similarity is.
The protein similarity among protein molecules is used as a measurement distance among the protein molecules, all the protein molecules in a protein molecule space are used as chameleon clustering algorithms to be input, k values in a k neighbor graph are set to be 20, the weight value during clustering is 1.5, the measurement threshold value is 5, the output of the chameleon clustering algorithm is used as all clustering clusters in a clustering result of the protein molecules, and the chameleon clustering algorithm is a known technology and is not redundant.
Step S004, obtaining a protein classification result according to the clustering result of the protein molecules.
According to all the clustering clusters in the obtained clustering result of the protein molecules, calculating protein diversity index average values of all the protein molecules in each clustering cluster, taking a sequence formed by the protein diversity index average values of all the clustering clusters according to the ascending order of numerical values as a protein diversity size sequence, and taking a class formed by the protein molecules in a first clustering cluster in the protein diversity size sequence as a first protein class; taking the class consisting of protein molecules in a second cluster in the protein diversity size sequence as a second protein class; and so on.
Therefore, the distribution of protein classes is completed according to the protein diversity index average, and the first protein class represents the class with the smallest protein diversity index average, which indicates that the protein molecules contained in the first protein class have lower diversity; and the last protein class represents the class with the largest average value of the protein diversity index, which indicates that the protein molecules contained in the last protein class have higher diversity.
Thus, a method for protein classification using machine learning was completed.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for classifying proteins using machine learning, the method comprising the steps of:
acquiring map data of protein molecules;
acquiring the amino acid structure gradient sequence of each protein molecule according to the graph data of each protein molecule; acquiring the structure complexity index of each protein molecule according to the amino acid structure gradient sequence of each protein molecule; obtaining the similarity position change index of each protein molecule according to the graph data and the amino acid structure gradient sequence of each protein molecule; acquiring protein molecule space according to the structure complexity index and the similarity position change index of all protein molecules; obtaining the local density of each protein molecule according to the protein molecule space;
acquiring a protein diversity index of each protein molecule according to the structure complexity index and the similarity position change index of each protein molecule; obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between the two protein molecules; obtaining a clustering result of protein molecules based on the protein similarity by adopting a chameleon clustering algorithm;
and obtaining a protein diversity size sequence according to the clustering result of the protein molecules, and obtaining a protein classification result according to the protein diversity size sequence.
2. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the amino acid structure transition sequence of each protein molecule from the map data of each protein molecule comprises the steps of:
the number of connecting edges of each node in the graph data of each protein molecule is taken as the structural complexity of each node in the graph data of each protein molecule, and a sequence formed by the structural complexity of all nodes in the graph data of each protein molecule according to the ascending order of numerical values is taken as the amino acid structure gradient sequence of each protein molecule.
3. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the structural complexity index of each protein molecule based on the amino acid structure transition sequence of each protein molecule comprises the steps of:
in the method, in the process of the invention,index of structural complexity representing the x-th protein molecule in a protein pool, < >>Coefficient of variation representing the number within the amino acid structure transition sequence of the x-th protein molecule in the protein pool,/->Representing an exponential function based on natural constants, < ->Representing the number of values within the amino acid structure transition sequence of the x-th protein molecule of the protein pool,/->And->Respectively representing the data values of the ith and (i-1) th data in the amino acid structure gradient sequence of the xth protein molecule in the protein collection,/>And->Respectively represent the maximum value and the minimum value of the numerical values in the amino acid structure transition sequence of the xth protein molecule in the protein collection, < ->Is an error constant.
4. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the similarity position change index of each protein molecule from the map data and the amino acid structure change sequence of each protein molecule comprises the steps of:
in the method, in the process of the invention,the change of the similarity position of the xth protein molecule in the protein collection is indicatedCount (n)/(l)>Representing Euclidean distance function, ">And->Respectively representing the position coordinates of the nodes represented by the ith and (i-1) th data in the amino acid structure gradient sequences of the xth protein molecule in the protein collection on the graph data of the xth protein molecule,/for the (x)>The number of values in the amino acid structure transition sequence of the x-th protein molecule in the protein pool is shown.
5. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining protein molecule space according to the structural complexity index and the similarity position variation index of all protein molecules comprises the steps of:
normalizing the structure complexity index of all the protein molecules, taking the normalization processing result of the structure complexity index of each protein molecule as the normalization structure complexity index of each protein molecule, and taking the normalization structure complexity index of each protein molecule as the abscissa of each protein molecule under a two-dimensional coordinate system;
normalizing the similarity position change indexes of all protein molecules, taking the normalization processing result of the similarity position change indexes of each protein molecule as the normalization similarity position change index of each protein molecule, and taking the normalization similarity position change index of each protein molecule as the ordinate of each protein molecule under a two-dimensional coordinate system;
and obtaining the two-dimensional positioning coordinates of each protein molecule according to the abscissa and the ordinate of each protein molecule under the two-dimensional coordinate system, and taking a two-dimensional data space formed by the two-dimensional positioning coordinates of all protein molecules as a protein molecule space.
6. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the local density of each protein molecule from the protein molecule space is as follows:
taking the two-dimensional coordinates of all protein molecules in the protein molecule space as the input of a DPC density peak value clustering algorithm, taking a preset cut-off distance as the parameter of the DPC density peak value clustering algorithm, and taking the output of the DPC density peak value clustering algorithm as the local density of each protein molecule.
7. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the protein diversity index of each protein molecule according to the structure complexity index and the similarity position variation index of each protein molecule comprises the steps of:
for each protein molecule, the product of the structural complexity index of the protein molecule and the similarity position change index is taken as the protein diversity index of the protein molecule.
8. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the protein similarity between two protein molecules according to the protein diversity index difference and the local density difference between two protein molecules comprises the following steps:
in the method, in the process of the invention,represents the x-th protein molecule in the protein poolProtein similarity between g protein molecules, < >>And->Protein diversity index indicating the x-th and g-th protein molecules in the protein pool, respectively,/->And->The local densities of the x-th and g-th protein molecules in the protein pool are shown, and M represents the proportionality constant.
9. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the clustering result of protein molecules based on the protein similarity by using a chameleon clustering algorithm is as follows:
taking all protein molecules in a protein molecule space as input of a chameleon clustering algorithm, taking protein similarity among the protein molecules as measurement distance among the protein molecules, and obtaining a clustering result of the protein molecules based on the measurement distance by adopting the chameleon clustering algorithm.
10. The method for classifying proteins by machine learning according to claim 1, wherein the method for obtaining the protein diversity size sequence according to the clustering result of protein molecules and obtaining the protein classification result according to the protein diversity size sequence comprises the following steps:
calculating the protein diversity index average value of all protein molecules in each cluster in the protein molecule clustering result, and taking a sequence formed by the protein diversity index average values of all the clusters in the protein molecule clustering result according to the ascending order of the numerical values as a protein diversity size sequence;
taking the cluster represented by each element in the protein diversity size sequence as a protein class, and obtaining a protein classification result according to all protein classes.
CN202311353697.6A 2023-10-19 2023-10-19 Method for classifying proteins by machine learning Active CN117095754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311353697.6A CN117095754B (en) 2023-10-19 2023-10-19 Method for classifying proteins by machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311353697.6A CN117095754B (en) 2023-10-19 2023-10-19 Method for classifying proteins by machine learning

Publications (2)

Publication Number Publication Date
CN117095754A true CN117095754A (en) 2023-11-21
CN117095754B CN117095754B (en) 2023-12-29

Family

ID=88773885

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311353697.6A Active CN117095754B (en) 2023-10-19 2023-10-19 Method for classifying proteins by machine learning

Country Status (1)

Country Link
CN (1) CN117095754B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117352043A (en) * 2023-12-06 2024-01-05 江苏正大天创生物工程有限公司 Protein design method and system based on neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040185486A1 (en) * 2003-02-27 2004-09-23 The Regents Of The University Of California Local-global alignment for finding 3D similarities in protein structures
CN106960134A (en) * 2017-03-23 2017-07-18 江南大学 A kind of S FCM algorithms clustered suitable for xylanase amino acid interactive network
CN111128301A (en) * 2019-12-06 2020-05-08 北部湾大学 Overlapped protein compound identification method based on fuzzy clustering
US11256995B1 (en) * 2020-12-16 2022-02-22 Ro5 Inc. System and method for prediction of protein-ligand bioactivity using point-cloud machine learning
CN114974398A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Information processing method and device and computer readable storage medium
CN116741265A (en) * 2023-06-14 2023-09-12 中南大学 Machine learning-based nanopore protein sequencing data processing method and application thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040185486A1 (en) * 2003-02-27 2004-09-23 The Regents Of The University Of California Local-global alignment for finding 3D similarities in protein structures
CN106960134A (en) * 2017-03-23 2017-07-18 江南大学 A kind of S FCM algorithms clustered suitable for xylanase amino acid interactive network
CN111128301A (en) * 2019-12-06 2020-05-08 北部湾大学 Overlapped protein compound identification method based on fuzzy clustering
US11256995B1 (en) * 2020-12-16 2022-02-22 Ro5 Inc. System and method for prediction of protein-ligand bioactivity using point-cloud machine learning
CN114974398A (en) * 2021-02-23 2022-08-30 腾讯科技(深圳)有限公司 Information processing method and device and computer readable storage medium
CN116741265A (en) * 2023-06-14 2023-09-12 中南大学 Machine learning-based nanopore protein sequencing data processing method and application thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117352043A (en) * 2023-12-06 2024-01-05 江苏正大天创生物工程有限公司 Protein design method and system based on neural network
CN117352043B (en) * 2023-12-06 2024-03-05 江苏正大天创生物工程有限公司 Protein design method and system based on neural network

Also Published As

Publication number Publication date
CN117095754B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
CN109800648B (en) Face detection and recognition method and device based on face key point correction
CN117095754B (en) Method for classifying proteins by machine learning
CN109376796A (en) Image classification method based on active semi-supervised learning
CN110942091A (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN110866934A (en) Normative coding-based complex point cloud segmentation method and system
CN112116950A (en) Protein folding identification method based on depth measurement learning
CN115171807A (en) Molecular coding model training method, molecular coding method and molecular coding system
CN116738297B (en) Diabetes typing method and system based on depth self-coding
CN116500703B (en) Thunderstorm monomer identification method and device
CN113128518A (en) Sift mismatch detection method based on twin convolution network and feature mixing
CN111639712A (en) Positioning method and system based on density peak clustering and gradient lifting algorithm
CN112102178A (en) Point cloud feature-preserving denoising method and device, electronic equipment and storage medium
CN114049569B (en) Deep learning model performance evaluation method and system
CN110532867A (en) A kind of facial image clustering method based on Fibonacci method
CN113707213B (en) Protein structure rapid classification method based on contrast graph neural network
CN114374931A (en) Fingerprint positioning method based on measurement learning of neighbor component analysis
CN116843368B (en) Marketing data processing method based on ARMA model
CN115457540B (en) Point cloud target detection model construction method, target detection labeling method and device
CN116070120B (en) Automatic identification method and system for multi-tag time sequence electrophysiological signals
CN117021435B (en) Trimming control system and method of trimmer
CN117372819B (en) Target detection increment learning method, device and medium for limited model space
CN117830301B (en) Slag dragging region detection method based on infrared and visible light fusion characteristics
CN114881147B (en) Multi-source heterogeneous big data fusion and integration system and method
CN115394435A (en) Key clinical index entity identification method and system based on deep learning
CN116933185A (en) Assessment method and system for finishing and rewinding of copper pipe

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant