WO2024016389A1 - 泛素化位点的识别方法、装置、系统和存储介质 - Google Patents

泛素化位点的识别方法、装置、系统和存储介质 Download PDF

Info

Publication number
WO2024016389A1
WO2024016389A1 PCT/CN2022/110318 CN2022110318W WO2024016389A1 WO 2024016389 A1 WO2024016389 A1 WO 2024016389A1 CN 2022110318 W CN2022110318 W CN 2022110318W WO 2024016389 A1 WO2024016389 A1 WO 2024016389A1
Authority
WO
WIPO (PCT)
Prior art keywords
lysine
information
training
protein
feature
Prior art date
Application number
PCT/CN2022/110318
Other languages
English (en)
French (fr)
Inventor
李坚强
陈杰
陈廷柏
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Publication of WO2024016389A1 publication Critical patent/WO2024016389A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding

Definitions

  • the present application relates to the field of ubiquitination technology, and in particular to a method, device, system and storage medium for identifying ubiquitination sites.
  • Ubiquitination is a common protein post-modification method in eukaryotic cells. It refers to the connection of ubiquitin molecules to lysine residues of target protein molecules under the sequential action of ubiquitin-activating enzymes, ubiquitin-conjugating enzymes and ubiquitin ligases. Basically. Ubiquitination plays an important role in protein localization, metabolism, cell division, gene transcription, and DNA repair, so the accurate identification of ubiquitination sites is particularly important.
  • the existing method for identifying ubiquitination sites is the DeepUbi model, which consists of a word2vec model and a convolutional neural network. It learns embedded features from a protein fragment of length 31 centered on the site to be tested to infer whether Able to be ubiquitinated.
  • the above methods only consider the sequence characteristics of the protein, resulting in a decrease in the identification accuracy of ubiquitination sites.
  • this application provides a method, device, system and storage medium for identifying ubiquitination sites to solve the problem of low identification accuracy in the prior art.
  • this application proposes a method, device, system and storage medium for identifying ubiquitination sites.
  • this application proposes a method for identifying ubiquitination sites, including:
  • the spatial structure feature information is processed based on the trained convolution model to obtain the lysine feature information of the lysine node; the convolution model is trained through the protein training set;
  • the corresponding lysine node is a ubiquitination site.
  • the step of training the convolutional model through the protein training set includes:
  • the protein training set contains at least a set of protein sample information and lysine sample information; calculate the lysine training information by using the protein sample information as an input parameter of the convolution model;
  • the loss value is calculated based on the weight parameters, the training feature parameters and the preset weighted loss function model;
  • the loss value is iteratively calculated; when the training of the convolution model is completed, the iterative calculation of the loss value is stopped.
  • the step of calculating the lysine training information by using the protein sample information as an input parameter of the convolution model includes:
  • the protein sample information is calculated through the first convolution layer to obtain a first feature matrix;
  • the protein sample information includes a training adjacency matrix and a training feature matrix;
  • the training adjacency matrix and the first feature matrix are calculated through the second convolution layer to obtain a second feature matrix
  • the second feature matrix is calculated through the self-attention mechanism layer to obtain a third feature matrix
  • the training adjacency matrix and the third feature matrix are calculated through the third convolution layer to obtain protein training information; the lysine training information is filtered out from the protein training information.
  • the step of calculating weight parameters and training feature parameters based on the lysine training information and the lysine sample information includes:
  • a first two-dimensional matrix in the training feature parameters is constructed based on the number of samples and the number of first lysine;
  • a second two-dimensional matrix in the training feature parameters is constructed based on the number of samples and the number of second lysines
  • the total amount of lysine in the weight parameter is obtained by summing the first lysine quantity and the second lysine quantity;
  • the number of ubiquitinable lysines in the lysine training information and the lysine sample information is calculated to obtain the ubiquitinable lysine in the weight parameter.
  • the step of extracting spatial structure feature information from the three-dimensional structure information includes:
  • the distance information is less than the preset distance threshold, it is determined that the corresponding two amino acids are in a connection relationship to generate the spatial structure feature information.
  • the method further includes:
  • the lysine node is arranged in front of the spatial structure feature information.
  • the method further includes:
  • the distance threshold corresponding to the highest ubiquitinizability accuracy rate is extracted to optimize the distance threshold.
  • this application proposes a recognition system for ubiquitination sites, including:
  • An extraction module used to extract spatial structure feature information from the three-dimensional structure information
  • a processing module configured to process the spatial structure feature information based on the trained convolution model to obtain the lysine feature information of the lysine node; the convolution model is trained through a protein training set;
  • a determination module configured to determine that the corresponding lysine node is a ubiquitination site if the lysine characteristic information matches the preset classification conditions.
  • the system further includes a training module for obtaining the protein training set;
  • the protein training set includes at least a set of protein sample information and lysine sample information;
  • a calculation module used to calculate the protein sample information as an input parameter of the convolution model to obtain lysine training information
  • a parameter module used to calculate based on the lysine training information and the lysine sample information Obtain weight parameters and training feature parameters
  • a loss value module used to calculate a loss value based on the weight parameters, the training feature parameters and a preset weighted loss function model
  • the judgment module is used to judge whether the training is completed based on the loss value and the preset training conditions; the judgment module is also used to iteratively calculate the loss value when the training of the convolution model is not completed; when the training of the convolution model is not completed; When training the convolutional model, stop iteratively calculating the loss value.
  • the calculation module includes a first convolution layer unit, used to calculate the protein sample information through the first convolution layer to obtain a first feature matrix;
  • the protein sample information includes a training adjacency matrix and training feature matrix;
  • a second convolution layer unit is used to calculate the training adjacency matrix and the first feature matrix through the second convolution layer to obtain a second feature matrix
  • the self-attention mechanism layer unit is used to calculate the second feature matrix through the self-attention mechanism layer to obtain the third feature matrix
  • the third convolution layer unit is used to calculate the training adjacency matrix and the third feature matrix through the third convolution layer to obtain protein training information;
  • a screening unit is used to screen out the lysine training information from the protein training information.
  • the parameter module includes a sample unit for counting the number of lysine training information to obtain the number of samples;
  • the first lysine unit is used to count the lysines in each of the lysine training information to obtain the first lysine quantity;
  • the second lysine unit is used to count the lysines in each of the lysine sample information to obtain the number of second lysine;
  • a first matrix unit configured to construct a first two-dimensional matrix in the training feature parameters based on the number of samples and the number of first lysines;
  • a second matrix unit configured to construct a second two-dimensional matrix in the training feature parameters based on the number of samples and the number of second lysines;
  • a total amount unit used to sum the first lysine amount and the second lysine amount to obtain the total amount of lysine in the weight parameter
  • a statistics unit configured to count the number of ubiquitinable lysines in the lysine training information and the lysine sample information according to the lysine training information and a preset score threshold, and obtain the weight The total number of ubiquitinables and the total number of non-ubiquitinables in the parameters.
  • the extraction module includes an identification unit for identifying the central carbon atom of each amino acid in the three-dimensional structure information based on a preset central carbon atom identification;
  • a position unit used to extract position information corresponding to each of the central carbon atoms from the three-dimensional structure information
  • a distance unit used to calculate distance information between each of the amino acids based on the position information
  • a generating unit configured to determine that the corresponding two amino acids are connected when the distance information is less than a preset distance threshold, so as to generate the spatial structure feature information.
  • the extraction module further includes a node unit for identifying the lysine node in the spatial structure feature information according to the lysine identity;
  • a configuration unit configured to configure the lysine node in front of the spatial structure feature information.
  • system further includes a condition module for obtaining the non-ubiquitination accuracy rate and the optimization range of the distance threshold;
  • a correct rate module used to select the distance threshold from the optimization range and combine it with the non-ubiquitination correct rate, and use Bayesian optimization to iteratively calculate the ubiquitinable correct rate;
  • An optimization module configured to extract the distance threshold corresponding to the highest ubiquitinizability accuracy rate after meeting the preset iteration conditions, so as to optimize the distance threshold.
  • this application proposes a device for identifying ubiquitination sites, which includes a memory and a processor.
  • the memory stores a method for identifying ubiquitination sites.
  • the processor performs the ubiquitination process when executing the ubiquitination site. The method described above was used to identify the sites.
  • the present application proposes a storage medium that stores a computer program that can be loaded by a processor and execute the above-mentioned method.
  • the three-dimensional structure information of the protein is used to obtain the spatial structure feature information, and then the convolution model is used to obtain the lysine feature information, and then it is judged whether the corresponding lysine node can be ubiquitinated. Since the three-dimensional structure of the protein is considered instead of using the two-dimensional sequence of the protein, the identification accuracy of ubiquitination sites is improved, and the quality of identification of ubiquitination sites is also improved.
  • the convolutional model is trained with a protein training set instead of a training set with equal proportions of ubiquitinable lysine and non-ubiquitinable lysine, which improves the training effect of the convolutional model. and the accuracy of the calculation results of the convolution model.
  • Figure 1 is a flow chart of a method for identifying ubiquitination sites in one embodiment.
  • Figure 2 is a schematic diagram of visualizing spatial structure feature information in one embodiment.
  • Figure 3 is a flow chart of training a convolutional model in a method for identifying ubiquitination sites in one embodiment.
  • Figure 4 is a flow chart for calculating lysine training information in a method for identifying ubiquitination sites in one embodiment.
  • Figure 5 is a flow chart of the implementation principle of a method for identifying ubiquitination sites in one embodiment.
  • Figure 6 is a structural block diagram of a ubiquitination site recognition system in one embodiment.
  • Figure 7 is a schematic structural diagram of a ubiquitination site recognition device in one embodiment.
  • the embodiment of the present application discloses a method for identifying ubiquitination sites, as shown in Figure 1, including:
  • Protein is a substance with a certain spatial structure formed by twisting and folding of a polypeptide chain composed of amino acids through "dehydration condensation". Amino acids are the basic building blocks of proteins. A protein contains multiple amino acids; amino acids are divided into multiple types. Lysine is a type of amino acid.
  • the three-dimensional structural information is the three-dimensional structure of the protein. Since not all three-dimensional structural information of the protein is known, in one embodiment, the step of obtaining the three-dimensional structural information of the protein includes:
  • the protein three-dimensional structure prediction program includes but is not limited to Alphafold2.
  • PDB is a file format used in bioinformatics to store three-dimensional proteins.
  • the current execution subject can directly receive the transmitted three-dimensional structure information, or actively retrieve the three-dimensional structure information of the protein under the preset storage path.
  • the file format for storing three-dimensional structure information is PDB.
  • the spatial structure feature information includes the three-dimensional characteristics of the protein. Compared with the use of protein sequence information in the existing technology, the process of identifying ubiquitination sites is more comprehensive, which helps to improve the identification accuracy.
  • S103. Process the spatial structure feature information based on the trained convolution model to obtain the lysine feature information of the lysine node.
  • the spatial structure feature information is used as the input parameter of the trained convolution model to obtain the lysine feature information of the lysine node.
  • the input parameters of the trained convolution model are spatial structure feature information, not imaged spatial structure feature information.
  • Figure 2 is only a schematic diagram for ease of understanding.
  • the convolutional model is trained on a protein training set.
  • the protein training set represents the entire protein as a training sample to train the convolutional model, so that the volume The input of the convolution model is closer to the real protein situation, which improves the quality of the trained convolution model and the authenticity of the calculation results.
  • S104 if the lysine characteristic information matches the preset classification conditions, the corresponding lysine node is a ubiquitination site.
  • Each lysine feature information corresponds to a lysine node, and each lysine node corresponds to a lysine in the protein.
  • the lysine characteristic information matches the classification conditions, it is proven that the corresponding lysine in the protein can undergo ubiquitination, thereby identifying the lysine as a ubiquitination site.
  • the spatial structure feature information with the three-dimensional characteristics of the protein is obtained, and then the convolution model trained based on the protein training set is used to obtain the lysine feature information, and the ubiquitination site in the protein is identified, taking a more comprehensive consideration , the obtained lysine feature information is more accurate and improves the identification accuracy of protein ubiquitination sites.
  • the classification condition is a value between 0 and 1. Specifically, in one application scenario, the classification condition is whether the value contained in the lysine characteristic information is greater than 0.6. If it is greater than 0.6, the corresponding lysine is determined to be a ubiquitination site; otherwise, it is a non-ubiquitination site. point.
  • the classification conditions are formulated based on lysine characteristic information.
  • the lysine characteristics can be obtained from the lysine characteristics information, and then the corresponding lysines are divided into two categories according to the lysine characteristics, one is ubiquitinable and the other is non-ubiquitinable.
  • the steps of training a convolutional model through a protein training set include:
  • the protein training set contains at least one set of protein sample information and lysine sample information.
  • each batch contains multiple protein training sets; each protein training set contains multiple sets of protein sample information and lysine Sample information.
  • the proteins used to train the convolution model are called sample proteins.
  • the amount of lysine contained in the sample protein, as well as the lysine available for ubiquitination, is known.
  • the protein sample information refers to the spatial structure feature information obtained by using the sample protein; the lysine sample information refers to the matrix information constructed based on the number of lysine in the sample protein.
  • the lysine training information refers to the number of lysine in the sample protein calculated using the convolution model, and the matrix information constructed based on the number of lysine.
  • the lysine sample information is a matrix information constructed based on the actual number of lysine in the sample protein; the lysine training information is constructed by calculating the number of lysine in the sample protein through the convolution model. matrix information. That is, the lysine sample information is the real value, and the lysine training information is the calculated value.
  • the weight parameters and training feature parameters are used to calculate the value of the loss function of the convolution model, that is, the loss value. Since the input parameters when training the convolution model are protein sample information, that is, the input parameters are equivalent to the sample proteins, therefore the weight parameters are calculated to weight the loss function and improve the calculation accuracy of the loss value, thus improving the accuracy of the trained convolution model. .
  • the loss value is calculated iteratively; when the training of the convolution model is completed, the iterative calculation of the loss value is stopped.
  • the protein sample information is used as the training input parameter of the convolutional model, so that the number of ubiquitinable lysine and non-ubiquitinable lysine contained in the protein sample information is consistent with the actual number.
  • the situation is consistent with protein.
  • This embodiment improves the training efficiency and training quality of the convolution model by setting a weighted loss function model, calculating weight parameters and training feature parameters, calculating the loss value, and evaluating the training progress of the convolution model.
  • the training condition is that when the loss value is iteratively calculated 500 times, the training is determined to be completed. In another embodiment, the training condition is that the loss value obtained for 20 consecutive times no longer decreases, and the training is determined to be completed. In other embodiments, 500 iterative calculations and 20 consecutive loss values that do not decrease can also be used as training conditions. If one of them is met, the training is determined to be completed.
  • the steps of using protein sample information as input parameters of the convolution model to calculate lysine training information include:
  • the first convolution layer is the first layer structure of the convolution model; the protein sample information includes the training adjacency matrix and the training feature matrix.
  • the first convolution layer is a GAT layer, that is, Graph Attention Layer, represented by GAT Layer1 (128); the protein sample information is represented by means; among them To train the adjacency matrix; is the training feature matrix; L is the number of protein amino acids, that is, the number of amino acids in the sample protein; C is the feature dimension, which is extracted using ESM-1b in this embodiment, and the value is 1280.
  • the second convolutional layer is the second layer structure of the convolutional model.
  • the second convolutional layer is a GAT layer, represented by GAT Layer2 (128); the adjacency matrix will be trained and the first feature matrix as the input parameters of the second convolution layer, and the second feature matrix is calculated.
  • the self-attention mechanism layer is the third layer structure of the convolution model.
  • the self-attention mechanism layer is represented by Self-attention Layer.
  • the input parameter of the self-attention mechanism layer is the second feature matrix, and the output parameter is the third feature matrix.
  • the third convolution layer is the fourth layer structure of the convolution model.
  • the third convolution layer is a GAT layer, represented by GAT Layer3(1); it should be noted that the number of channels of the third convolution layer is 1, that is, the protein training information is y pred ,
  • an activation function is used to reduce the values in the protein training information to between 0 and 1.
  • the activation function is sigmoid.
  • step S305 is executed.
  • lysine in each protein sample information is ranked at the front of all amino acid nodes.
  • a sample protein contains a total of 1,000 amino acid nodes, including 10 lysine nodes. Since all the information of the sample protein is a known quantity, and the protein sample information is actually a matrix, after converting the sample protein into protein sample information, the 10 lysine nodes are arranged at the front end of the matrix, and then the protein is The sample information is calculated as the input parameter of the first convolutional layer. Make the top 10 amino acid nodes in the protein training information all lysine nodes. In this way, you can directly filter out the lysine training information.
  • lysine nodes are filtered out from protein training information through preset lysine node identifiers, thereby obtaining corresponding lysine training information.
  • the lysine node identification includes but is not limited to the element type and the position information of the central carbon atom.
  • the lysine training information is expressed as y pred-part , Among them, L′ i is the number of lysine in the i-th lysine training information.
  • each amino acid node in the protein sample information can learn more features about other nodes; the self-attention mechanism layer is used to further expand the learning range of each amino acid node. , improving the training accuracy of the convolution model.
  • the step of calculating weight parameters and training feature parameters based on lysine training information and lysine sample information includes:
  • step S401 can also be replaced by: counting the number of lysine sample information to obtain the number of samples.
  • the first lysine number is calculated by the convolution model in training, and the second lysine number is the actual number of lysine in the sample protein.
  • the two may be the same or different.
  • each lysine sample information corresponds to protein sample information. Therefore, the lysine sample information in the same batch may contain the same number of lysines or may be different, and a three-dimensional matrix cannot be formed. Use the number of samples to convert the lysine sample information in the same batch into a second two-dimensional matrix
  • the lysine corresponding to the value greater than the score threshold After using the activation function to convert the value corresponding to the lysine node in the lysine training information into a score between 0 and 1, according to the score threshold, determine the lysine corresponding to the value greater than the score threshold. Nodes are ubiquitination sites, and lysine nodes corresponding to values less than or equal to the score threshold are non-ubiquitination sites. Then count the number of all ubiquitinated lysines and the number of non-ubiquitinated lysines to get the total number of ubiquitinated and non-ubiquitinated lysines. In this example, the total amount of lysine is represented by N; the total number of ubiquitinable proteins is represented by N + ; and the total number of non-ubiquitinated proteins is represented by N- .
  • the weighted loss function model is:
  • loss is the loss value
  • N is the total amount of lysine
  • N + is the total number of ubiquitinations
  • N - is the total number of non-ubiquitinations
  • y truet is the second two-dimensional matrix
  • y pred-part is the first and second dimensional matrix.
  • the first two-dimensional matrix and the second two-dimensional matrix are constructed based on the number of samples and the number of lysine, which solves the problem that the three-dimensional matrix cannot be formed due to the different numbers of lysine contained in the input sample protein, and achieves Accurate calculation of loss value.
  • the simple conversion process and weight parameter calculation process are conducive to saving resources.
  • the step of extracting spatial structure feature information from three-dimensional structure information includes:
  • Three-dimensional structural information includes amino acid types, amino acid constituent elements, position information, etc.; among them, amino acid types such as MET, ARG, and LEU, and amino acid constituent elements such as nitrogen, carbon, oxygen, central carbon atom, etc.
  • the central carbon atom is represented by CA. Since each amino acid has a central carbon atom, CA, it is called a Ca atom. Therefore, each central carbon atom in the three-dimensional structural information is used as a representative of the corresponding amino acid. When identifying the central carbon atom, just use the central carbon atom identification to identify it.
  • the position information of the central carbon atom is used as the position information of the corresponding amino acid.
  • the position information is a coordinate
  • the distance between two amino acids can be calculated through the position information, that is, the distance information.
  • the distance threshold is represented by D, because the key to spatial structure feature information is the connection relationship between amino acids. Therefore, the value of the distance threshold D is particularly important, which determines the accuracy of the spatial structure feature information. It should be noted that after determining that there is a connection relationship between two amino acids, a vector pointing from one amino acid to the connected amino acid will be formed, thereby forming a matrix. That is, the spatial structure feature information includes adjacency matrix and feature matrix.
  • the connection relationship between the various amino acids in the protein is determined through the position information of the central carbon atom, thereby generating spatial structure feature information.
  • the spatial structure feature information includes all the amino acids of the protein, and then the spatial structure feature information is used as the input parameter of the trained convolution model to identify ubiquitinable sites and improve the recognition accuracy.
  • the identification method further includes:
  • letters representing the amino acid type lysine are used as lysine identifiers to identify lysine nodes.
  • S602. Arrange the lysine node in front of the spatial structure feature information.
  • the spatial structure feature information is an array with rows and columns, and the lazy acid nodes are ranked at the front.
  • the subsequent search for the lysine node is facilitated and the efficiency is improved.
  • the identification method further includes:
  • the protein used to train the convolution model is the sample protein, because the number of amino acids, the number of lysine and the number of ubiquitinable lysines in the sample protein are all known quantities. Therefore, by using the sample protein as the input parameter of the trained convolution model, the calculation results of the trained convolution model can be obtained. Then compare the calculation results with the actual results of the sample protein to calculate the accuracy of non-ubiquitination sites calculated by the trained convolution model.
  • the non-ubiquitination accuracy rate is set to 0.95; the optimization range of the distance threshold D is [0, 20].
  • specific parameter values of the hyperparameters are selected from the optimization range; where the hyperparameter refers to the distance threshold.
  • the three-dimensional structure information of the sample protein is converted according to the specific parameter value of the hyperparameter, and the spatial structure feature information is obtained.
  • the spatial structure feature information is used as the input parameter of the trained convolution model, and finally the ubiquitinable and non-ubiquitinable quantities of the sample protein are obtained. Compare the number of non-ubiquitinated proteins with the actual number of non-ubiquitinated proteins in the sample, and calculate the correct rate of non-ubiquitinated proteins. If the non-ubiquitination accuracy rate is greater than 0.95, the ubiquitination accuracy rate is calculated.
  • the unevaluated set R′ is obtained according to the optimization range and the evaluated set R.
  • the iteration condition is iteration 20 times. That is, after 20 iterations, the iteration stops to calculate the ubiquitinizability accuracy rate. At this time, 20 ubiquitination accuracy rates are stored; the specific parameter values corresponding to each ubiquitination accuracy rate are saved in the evaluated set. Extract the specific parameter value from the evaluated set as the value of the distance threshold to complete the optimization of the distance threshold.
  • the process of optimizing the distance threshold is as follows:
  • o(r best ) is the accuracy of the optimal solution in the evaluated set R
  • u( ⁇ ) is the mean function
  • ⁇ ( ⁇ ) is the probability density function of the standard normal distribution.
  • the known evaluation set R is continuously and iteratively updated to obtain a better distance threshold D.
  • the three-dimensional structure information is converted into spatial structure feature information.
  • the spatial structure feature information includes adjacency matrix and feature matrix.
  • the output result of the trained convolution model is obtained, that is, the lysine feature information.
  • the lysine characteristic information is a matrix only about lysine.
  • the number of rows of the matrix is the number of lysines and the number of columns is 1. That is, a matrix composed of several values equal to the number of lysines is obtained.
  • the value ranges from 0-1 after being processed by the activation function. Then judge whether each value matches according to the classification conditions.
  • the lysine corresponding to the value is a non-ubiquitination site; if the value is greater than or equal to 0.5, the lysine corresponding to the value is ubiquitinable. lation site to complete the recognition of ubiquitination sites.
  • the three-dimensional characteristics of the protein are taken into account, making the identification of ubiquitination sites more accurate.
  • the entire protein is used as the input parameter of the convolutional model, which improves the data processing accuracy of the convolutional model and further improves the identification accuracy of ubiquitination sites.
  • the weight parameters are obtained and the loss value is calculated using the weight parameters, which ensures the accuracy of the calculation of the loss value, which helps to ensure the training efficiency of the convolution model and reduce the number of iterations.
  • the embodiments of the present application also disclose a ubiquitination site identification system, as shown in Figure 6, including an acquisition module 1 for acquiring three-dimensional structural information of proteins;
  • Extraction module 2 used to extract spatial structure feature information from the three-dimensional structure information
  • Processing module 3 is used to process the spatial structure feature information based on the trained convolution model to obtain the lysine feature information of the lysine node; the convolution model is trained through the protein training set;
  • Determination module 4 is configured to determine that the corresponding lysine node is a ubiquitination site if the lysine characteristic information matches the preset classification conditions.
  • the system further includes a training module for obtaining the protein training set;
  • the protein training set includes at least a set of protein sample information and lysine sample information;
  • a calculation module used to calculate the protein sample information as an input parameter of the convolution model to obtain lysine training information
  • a parameter module used to calculate based on the lysine training information and the lysine sample information Obtain weight parameters and training feature parameters
  • a loss value module used to calculate a loss value based on the weight parameters, the training feature parameters and a preset weighted loss function model
  • the judgment module is used to judge whether the training is completed based on the loss value and the preset training conditions; the judgment module is also used to iteratively calculate the loss value when the training of the convolution model is not completed; when the training of the convolution model is not completed; When training the convolutional model, stop iteratively calculating the loss value.
  • the calculation module includes a first convolution layer unit, used to calculate the protein sample information through the first convolution layer to obtain a first feature matrix;
  • the protein sample information includes a training adjacency matrix and training feature matrix;
  • a second convolution layer unit is used to calculate the training adjacency matrix and the first feature matrix through the second convolution layer to obtain a second feature matrix
  • the self-attention mechanism layer unit is used to calculate the second feature matrix through the self-attention mechanism layer to obtain the third feature matrix
  • the third convolution layer unit is used to calculate the training adjacency matrix and the third feature matrix through the third convolution layer to obtain protein training information;
  • a screening unit is used to screen out the lysine training information from the protein training information.
  • the parameter module includes a sample unit for counting the number of lysine training information to obtain the number of samples;
  • the first lysine unit is used to count the lysines in each of the lysine training information to obtain the first lysine quantity;
  • the second lysine unit is used to count the lysines in each of the lysine sample information to obtain the number of second lysine;
  • a first matrix unit configured to construct a first two-dimensional matrix in the training feature parameters based on the number of samples and the number of first lysines;
  • a second matrix unit configured to construct a second two-dimensional matrix in the training feature parameters based on the number of samples and the number of second lysines;
  • a total amount unit used to sum the first lysine amount and the second lysine amount to obtain the total amount of lysine in the weight parameter
  • a statistics unit configured to count the number of ubiquitinable lysines in the lysine training information and the lysine sample information according to the lysine training information and a preset score threshold, and obtain the weight The total number of ubiquitinables and the total number of non-ubiquitinables in the parameters.
  • the extraction module 2 includes an identification unit for identifying the central carbon atom of each amino acid in the three-dimensional structure information based on a preset central carbon atom identification;
  • a position unit used to extract position information corresponding to each of the central carbon atoms from the three-dimensional structure information
  • a distance unit used to calculate distance information between each of the amino acids based on the position information
  • a generating unit configured to determine that the corresponding two amino acids are connected when the distance information is less than a preset distance threshold, so as to generate the spatial structure feature information.
  • the extraction module 2 further includes a node unit for identifying the lysine node in the spatial structure feature information according to the lysine identity;
  • a configuration unit configured to configure the lysine node in front of the spatial structure feature information.
  • system further includes a condition module for obtaining the non-ubiquitination accuracy rate and the optimization range of the distance threshold;
  • a correct rate module used to select the distance threshold from the optimization range and combine it with the non-ubiquitinating correct rate, and use Bayesian optimization to iteratively calculate the ubiquitinable correct rate;
  • An optimization module configured to extract the distance threshold corresponding to the highest ubiquitinizability accuracy rate after meeting the preset iteration conditions, so as to optimize the distance threshold.
  • the acquisition module obtains the three-dimensional structure information
  • the three-dimensional structure information is converted into spatial structure feature information through the extraction module, and then the lysine feature information is obtained through the processing module.
  • the first matrix unit and the second matrix unit use the number of samples, the number of first lysine and the number of second lysine to construct the first two-dimensional matrix and the second two-dimensional matrix respectively, realizing the calculation of training feature parameters,
  • the process is simple and helps save computing resources.
  • the optimization module optimizes the distance threshold and improves the conversion accuracy of spatial structure feature information, thus improving the identification accuracy and quality of ubiquitination sites.
  • the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence or those that contribute to the existing technology.
  • the computer software products are stored in a storage medium and include a number of instructions to A computer device (which may be a personal computer, a server, a network device, etc.) is caused to execute all or part of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program code.
  • embodiments of the present application are not limited to any specific combination of hardware and software.
  • embodiments of the present application also disclose a storage medium that stores a computer program that can be loaded by a processor and execute the above method.
  • the embodiment of the present application also discloses a device for identifying ubiquitination sites, as shown in Figure 7 , including a processor 100, at least one communication bus 200, user interface 300, at least one external communication interface 400 and memory 500.
  • the communication bus 200 is configured to implement connection communication between these components.
  • the user interface 300 may include a display screen, and the external communication interface 400 may include a standard wired interface and a wireless interface.
  • the memory 500 stores a method for identifying ubiquitination sites.
  • the processor 100 is configured to adopt the above method when performing the identification of ubiquitination sites stored in the memory 500 .
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, direct coupling, or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be electrical, mechanical, or other forms. of.
  • the units described above as separate components may or may not be physically separated; the components shown as units may or may not be physical units; they may be located in one place or distributed to multiple network units; Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • all functional units in the embodiments of the present application can be integrated into one processing unit, or each unit can be separately used as a unit, or two or more units can be integrated into one unit; the above-mentioned integration
  • the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the program is executed, the execution includes the above
  • the aforementioned storage media include: various media that can store program codes, such as mobile storage devices, ROMs, magnetic disks, or optical disks.
  • the integrated units mentioned above in this application are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence or those that contribute to the existing technology.
  • the computer software products are stored in a storage medium and include a number of instructions to A device is caused to perform all or part of the methods described in various embodiments of this application.
  • the aforementioned storage media include: mobile storage devices, ROMs, magnetic disks or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biotechnology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种泛素化位点的识别方法、装置、系统和存储介质,属于泛素化技术领域,该方法包括获取蛋白质的三维结构信息(S101);从三维结构信息提取出空间结构特征信息(S102);基于训练完成的卷积模型对空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息(S103);若赖氨酸特征信息与预设的分类条件匹配,则对应的赖氨酸节点为泛素化位点(S104)。通过蛋白质的三维结构信息对蛋白质中各个赖氨酸是否能够泛素化进行识别,提高了识别精度。

Description

泛素化位点的识别方法、装置、系统和存储介质
优先权信息
本申请要求于2022年7月20日申请的、申请号为202210850486.2的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种泛素化技术领域,尤其涉及一种泛素化位点的识别方法、装置、系统和存储介质。
背景技术
泛素化是真核细胞中一种常见的蛋白质后修饰方式,指泛素分子在泛素激活酶、泛素结合酶和泛素连接酶的相继作用下连接到靶蛋白分子的赖氨酸残基上。泛素化在蛋白质的定位、新陈代谢、细胞分裂、基因转录和DNA修复等方面具有重要作用,因此对泛素化位点的准确识别尤为重要。
现有技术中的泛素化位点识别方式为DeepUbi模型,该模型由word2vec模型和卷积神经网络组成,从待测位点为中心的长度为31的蛋白质片段中学习嵌入特征,以推断是否能够泛素化。但上述方法仅考虑了蛋白质的序列特征,导致泛素化位点的识别精度下降。
发明内容
有鉴于此,本申请提供了一种泛素化位点的识别方法、装置、系统和存储介质,用于解决现有技术中的识别精度低的问题。为实现上述之一或部分或全部目的或是其他目的,本申请提出一种泛素化位点的识别方法、装置、系统和存储介质。第一方面,本申请提出一种泛素化位点的识别方法,包括:
获取蛋白质的三维结构信息;
从所述三维结构信息提取出空间结构特征信息;
基于训练完成的卷积模型对所述空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息;所述卷积模型通过蛋白质训练集训练完成;
若所述赖氨酸特征信息与预设的分类条件匹配,则对应的所述赖氨酸节点为泛素化位点。
在一实施例中,通过所述蛋白质训练集训练所述卷积模型的步骤包括:
获取所述蛋白质训练集;所述蛋白质训练集中至少包含一组蛋白质样本信息和赖氨酸样本信息; 将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息;
基于所述赖氨酸训练信息和所述赖氨酸样本信息计算得到权参数和训练特征参数;
基于所述权参数、所述训练特征参数和预设的加权损失函数模型计算得到损失值;
基于所述损失值和预设的训练条件判断是否完成训练;
在未完成所述卷积模型的训练时,迭代计算所述损失值;在完成所述卷积模型的训练时,停止迭代计算所述损失值。
在一实施例中,所述将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息的步骤包括:
通过第一卷积层对所述蛋白质样本信息进行计算,得到第一特征矩阵;所述蛋白质样本信息包括训练邻接矩阵和训练特征矩阵;
通过第二卷积层对所述训练邻接矩阵和所述第一特征矩阵进行计算,得到第二特征矩阵;
通过自注意力机制层对所述第二特征矩阵进行计算,得到第三特征矩阵;
通过第三卷积层对所述训练邻接矩阵和所述第三特征矩阵进行计算,得到蛋白质训练信息;从所述蛋白质训练信息中筛选出所述赖氨酸训练信息。
在一实施例中,所述基于所述赖氨酸训练信息和所述赖氨酸样本信息计算得到权参数和训练特征参数的步骤包括:
统计所述赖氨酸训练信息的数量得到样本数量;
统计各所述赖氨酸训练信息中的赖氨酸得到第一赖氨酸数量;
统计各所述赖氨酸样本信息中的赖氨酸得到第二赖氨酸数量;
基于所述样本数量和所述第一赖氨酸数量构建得到所述训练特征参数中的第一二维矩阵;
基于所述样本数量和所述第二赖氨酸数量构建得到所述训练特征参数中的第二二维矩阵;
将所述第一赖氨酸数量和所述第二赖氨酸数量求和得到所述权参数中的赖氨酸总量;
根据所述赖氨酸训练信息和预设的分数阈值统计所述赖氨酸训练信息以及所述赖氨酸样本信息中可泛素化的赖氨酸数量,得到所述权参数中的可泛素化总数和非泛素化总数。
在一实施例中,所述从所述三维结构信息提取出空间结构特征信息的步骤包括:
基于预设的中心碳原子标识识别所述三维结构信息中各个氨基酸的中心碳原子;
从所述三维结构信息中提取与各个所述中心碳原子对应的位置信息;
基于所述位置信息计算各所述氨基酸间的距离信息;
在所述距离信息小于预设的距离阈值时,判定对应的两个所述氨基酸为连接关系,以生成所述空间结构特征信息。
在一实施例中,在生成所述空间结构特征信息之后,还包括:
根据赖氨酸标识识别所述空间结构特征信息中的所述赖氨酸节点;
将所述赖氨酸节点配置在所述空间结构特征信息中的前方。
在一实施例中,在训练完成所述卷积模型后,还包括:
获取非泛素化正确率和所述距离阈值的优化范围;
从所述优化范围中选择所述距离阈值并结合所述非泛素化正确率,利用贝叶斯优化迭代计算可泛素化正确率;
在满足预设的迭代条件后,提取与最高的所述可泛素化正确率对应的所述距离阈值,以对所述距离阈值进行优化。
第二方面,本申请提出一种泛素化位点的识别系统,包括:
获取模块,用于获取蛋白质的三维结构信息;
提取模块,用于从所述三维结构信息提取出空间结构特征信息;
处理模块,用于基于训练完成的卷积模型对所述空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息;所述卷积模型通过蛋白质训练集训练完成;
判定模块,用于若所述赖氨酸特征信息与预设的分类条件匹配,则对应的所述赖氨酸节点为泛素化位点。
在一实施例中,所述系统还包括训练模块,用于获取所述蛋白质训练集;所述蛋白质训练集中至少包含一组蛋白质样本信息和赖氨酸样本信息;
计算模块,用于将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息;参数模块,用于基于所述赖氨酸训练信息和所述赖氨酸样本信息计算得到权参数和训练特征参数;
损失值模块,用于基于所述权参数、所述训练特征参数和预设的加权损失函数模型计算得到损失值;
判断模块,用于基于所述损失值和预设的训练条件判断是否完成训练;所述判断模块还用于在未完成所述卷积模型的训练时,迭代计算所述损失值;在完成所述卷积模型的训练时,停止迭代计算所述损失值。
在一实施例中,所述计算模块包括第一卷积层单元,用于通过第一卷积层对所述蛋白质样本信息进行计算,得到第一特征矩阵;所述蛋白质样本信息包括训练邻接矩阵和训练特征矩阵;
第二卷积层单元,用于通过第二卷积层对所述训练邻接矩阵和所述第一特征矩阵进行计算,得 到第二特征矩阵;
自注意力机制层单元,用于通过自注意力机制层对所述第二特征矩阵进行计算,得到第三特征矩阵;
第三卷积层单元,用于通过第三卷积层对所述训练邻接矩阵和所述第三特征矩阵进行计算,得到蛋白质训练信息;
筛选单元,用于从所述蛋白质训练信息中筛选出所述赖氨酸训练信息。
在一实施例中,所述参数模块包括样本单元,用于统计所述赖氨酸训练信息的数量得到样本数量;
第一赖氨酸单元,用于统计各所述赖氨酸训练信息中的赖氨酸得到第一赖氨酸数量;
第二赖氨酸单元,用于统计各所述赖氨酸样本信息中的赖氨酸得到第二赖氨酸数量;
第一矩阵单元,用于基于所述样本数量和所述第一赖氨酸数量构建得到所述训练特征参数中的第一二维矩阵;
第二矩阵单元,用于基于所述样本数量和所述第二赖氨酸数量构建得到所述训练特征参数中的第二二维矩阵;
总量单元,用于将所述第一赖氨酸数量和所述第二赖氨酸数量求和得到所述权参数中的赖氨酸总量;
统计单元,用于根据所述赖氨酸训练信息和预设的分数阈值统计所述赖氨酸训练信息以及所述赖氨酸样本信息中可泛素化的赖氨酸数量,得到所述权参数中的可泛素化总数和非泛素化总数。
在一实施例中,所述提取模块包括识别单元,用于基于预设的中心碳原子标识识别所述三维结构信息中各个氨基酸的中心碳原子;
位置单元,用于从所述三维结构信息中提取与各个所述中心碳原子对应的位置信息;
距离单元,用于基于所述位置信息计算各所述氨基酸间的距离信息;
生成单元,用于在所述距离信息小于预设的距离阈值时,判定对应的两个所述氨基酸为连接关系,以生成所述空间结构特征信息。
在一实施例中,所述提取模块还包括节点单元,用于根据赖氨酸标识识别所述空间结构特征信息中的所述赖氨酸节点;
配置单元,用于将所述赖氨酸节点配置在所述空间结构特征信息中的前方。
在一实施例中,所述系统还包括条件模块,用于获取非泛素化正确率和所述距离阈值的优化范围;
正确率模块,用于从所述优化范围中选择所述距离阈值并结合所述非泛素化正确率,利用贝叶 斯优化迭代计算可泛素化正确率;
优化模块,用于在满足预设的迭代条件后,提取与最高的所述可泛素化正确率对应的所述距离阈值,以对所述距离阈值进行优化。
第三方面,本申请提出一种泛素化位点的识别装置,包括存储器和处理器,所述存储器中存储有泛素化位点的识别方法,所述处理器在执行所述泛素化位点的识别方法时采用上述所述方法。
第四方面,本申请提出一种存储介质,其存储有能够被处理器加载并执行上述所述方法的计算机程序。
实施本申请实施例,将具有如下有益效果:
利用蛋白质的三维结构信息得到空间结构特征信息,而后使用卷积模型得到赖氨酸特征信息,进而判断对应的赖氨酸节点是否能够泛素化。由于考虑了蛋白质的三维结构,而非使用蛋白质的二维序列,提高了泛素化位点的识别精度,同时也提高了泛素化位点的识别质量。此外,卷积模型是通过蛋白质训练集训练完成的,而非通过等比例的可泛素化赖氨酸与非泛素化赖氨酸的训练集训练完成的,提高了卷积模型的训练效果和卷积模型的计算结果精度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中泛素化位点的识别方法的流程图。
图2为一个实施例中将空间结构特征信息图像化的示意图。
图3为一个实施例中泛素化位点的识别方法中训练卷积模型的流程图。
图4为一个实施例中泛素化位点的识别方法中计算赖氨酸训练信息的流程图。
图5为一个实施例中泛素化位点的识别方法的实施原理流程图。
图6为一个实施例中泛素化位点的识别系统的结构框图。
图7为一个实施例中泛素化位点的识别装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请 中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
本申请实施例公开一种泛素化位点的识别方法,图1所示,包括:
S101、获取蛋白质的三维结构信息。
蛋白质是由氨基酸以“脱水缩合”的方式组成的多肽链经过盘曲折叠形成的具有一定空间结构的物质。氨基酸是蛋白质的基本组成单位,一个蛋白质中包含有多个氨基酸;氨基酸又分为多个种类,赖氨酸是氨基酸的一种。
三维结构信息即蛋白质的三维结构,由于不是所有的蛋白质的三维结构信息均是已知的,因此,在一实施例中,获取蛋白质的三维结构信息的步骤包括:
S1011、在蛋白质的三维结构信息为未知时,使用蛋白质三维结构预测程序对蛋白质进行处理,得到格式为PDB(protein data bank)的三维结构信息;
S1012、在蛋白质的三维结构信息为已知时,获取三维结构信息。
其中,在步骤S1011中,蛋白质三维结构预测程序包括但不限于Alphafold2。PDB是生物信息学中存储表示三维蛋白质的一种文件格式。在步骤S1012中,由于三维结构信息为已知,当前执行主体可以直接接收传输来的三维结构信息,也可以到预设的存储路径下主动调取蛋白质的三维结构信息。需要说明的是,在本实施例中,存储三维结构信息的文件格式均为PDB。
S102、从三维结构信息提取出空间结构特征信息。
其中,空间结构特征信息包含有蛋白质的三维特性,相比于现有技术中使用蛋白质序列信息,识别泛素化位点的过程考虑的更为全面,有助于提高识别精度。S103、基于训练完成的卷积模型对空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息。
将空间结构特征信息作为训练完成的卷积模型的输入参数,求得赖氨酸节点的赖氨酸特征信息。为了更好地理解,如图2所示,在将空间结构特征信息图像化后,即可得到多个氨基酸节点,多个氨基酸节点中包含有赖氨酸节点。需要说明的是,训练完成的卷积模型的输入 参数是空间结构特征信息,而非图像化后的空间结构特征信息,图2仅为了便于理解的示意图。
在一实施例中,卷积模型通过蛋白质训练集训练完成。相比于现有技术中使用数量比为1:1的可泛素化赖氨酸和非泛素化赖氨酸,蛋白质训练集代表将整个蛋白质作为训练样本对卷积模型进行训练,使卷积模型的输入更贴近真实蛋白质的情况,提高了训练完成的卷积模型的质量和计算结果真实度。
如图1所示,S104、若赖氨酸特征信息与预设的分类条件匹配,则对应的赖氨酸节点为泛素化位点。
每个赖氨酸特征信息对应一个赖氨酸节点,每个赖氨酸节点对应蛋白质中的一个赖氨酸。在赖氨酸特征信息与分类条件匹配时,证明蛋白质中对应的赖氨酸能够进行泛素化,从而识别出该赖氨酸为泛素化位点。通过蛋白质的三维结构信息得到具有蛋白质三维特性的空间结构特征信息,再利用基于蛋白质训练集训练的卷积模型得到赖氨酸特征信息,对蛋白质中泛素化位点进行识别,考虑的更全面,得到的赖氨酸特征信息精度更高,提高了蛋白质的泛素化位点的识别精度。
其中,在一实施例中,分类条件为0-1之间一个数值。具体的,在一应用场景中,分类条件为赖氨酸特征信息中包含的数值是否大于0.6,若大于0.6,判定对应的赖氨酸为可泛素化位点;否则为非泛素化位点。
在其他实施例中,分类条件根据赖氨酸特征信息制定。从赖氨酸特征信息中能够得到赖氨酸特征,再根据赖氨酸特征将对应的赖氨酸分为两类,一类为可泛素化,一类为非泛素化。
在本申请另一实施例中,为了进一步限定及说明,如图3所示,通过蛋白质训练集训练卷积模型的步骤包括:
S201、获取蛋白质训练集;蛋白质训练集中至少包含一组蛋白质样本信息和赖氨酸样本信息。
在一实施例中,训练卷积模型时,设置有多批次的蛋白质训练集,每批次中包含有多个蛋白质训练集;每个蛋白质训练集中包含有多组蛋白质样本信息和赖氨酸样本信息。
其中,为了便于理解,在实施例中,用于训练卷积模型的蛋白质称为样本蛋白质。样本蛋白质中包含的赖氨酸数量以及可泛素化的赖氨酸均为已知量。蛋白质样本信息指利用样本蛋白质得到的空间结构特征信息;赖氨酸样本信息指基于样本蛋白质中赖氨酸的数量构建的矩阵信息。
S202、将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息。
在一实施例中,赖氨酸训练信息指利用卷积模型计算得到的样本蛋白质中赖氨酸数量,并基于赖氨酸数量构建的矩阵信息。为了便于理解,赖氨酸样本信息是基于样本蛋白质中实际 的赖氨酸数量构建的矩阵信息;赖氨酸训练信息是通过卷积模型计算得到的样本蛋白质中的赖氨酸数量,进而构建的矩阵信息。即赖氨酸样本信息是真实值,赖氨酸训练信息是计算值。
S203、基于赖氨酸训练信息和赖氨酸样本信息计算得到权参数和训练特征参数。
其中,权参数和训练特征参数用于计算卷积模型的损失函数的值,即损失值。由于训练卷积模型时输入参数为蛋白质样本信息,即输入参数相当于样本蛋白质,因此计算权参数,以对损失函数进行加权,提高损失值的计算精度,从而提高训练完成的卷积模型的精度。
S204、基于权参数、训练特征参数和预设的加权损失函数模型计算得到损失值。
S205、基于损失值和预设的训练条件判断是否完成训练。
在未完成卷积模型的训练时,迭代计算损失值;在完成卷积模型的训练时,停止迭代计算损失值。
为了提高卷积模型的计算精度,将蛋白质样本信息作为卷积模型的训练输入参数,使蛋白质样本信息中包含的可泛素化的赖氨酸数量和非泛素化的赖氨酸数量与实际蛋白质的情况相符。相比于现有技术中根据可泛素化的赖氨酸数量,从非泛素化的赖氨酸中选取等量的赖氨酸,使数量比例达成1:1的方式,现有技术中的非泛素化的赖氨酸数量容易与实际不符。本实施例通过设定加权损失函数模型,计算权参数和训练特征参数,计算得到损失值,评估卷积模型的训练进度,提高了卷积模型的训练效率和训练质量。
在一实施例中,训练条件为迭代计算500次损失值时,判定完成训练。在另一实施例中,训练条件为连续20次得到的损失值不再减小,判定完成训练。在其他实施例中,也可将迭代计算500次和连续20次损失值不再减小共同作为训练条件,满足其中一个,判定完成训练。
在本申请另一实施例中,为了进一步限定及说明,如图4所示,步骤将蛋白质样本信息作为卷积模型的输入参数计算得到赖氨酸训练信息包括:
S301、通过第一卷积层对蛋白质样本信息进行计算,得到第一特征矩阵。
其中,第一卷积层是卷积模型的第一层结构;蛋白质样本信息包括训练邻接矩阵和训练特征矩阵。在一实施例中,第一卷积层为GAT层,即Graph Attention Layer,用GAT Layer1(128)表示;蛋白质样本信息用
Figure PCTCN2022110318-appb-000001
表示;其中
Figure PCTCN2022110318-appb-000002
为训练邻接矩阵;
Figure PCTCN2022110318-appb-000003
为训练特征矩阵;L是蛋白质氨基酸数量,即样本蛋白质中的氨基酸数量;C是特征维度,在本实施例中使用ESM-1b提取,值为1280。
需要说明的是,在同一批次的蛋白质训练集中包含有多组蛋白质样本信息和赖氨酸样 本信息时,将该批次中的所有蛋白质样本信息作为第一卷积层的输入参数进行计算,得到对应数量的第一特征矩阵。在计算L和C时,均根据对应的蛋白质样本信息进行计算,L不是同批次中所有蛋白质样本信息包含的氨基酸数量总和。由于不同样本蛋白质的氨基酸数量不同,因此计算得到的L也不同。
S302、通过第二卷积层对训练邻接矩阵和第一特征矩阵进行计算,得到第二特征矩阵。
其中,第二卷积层是卷积模型的第二层结构。在一实施例中,第二卷积层为GAT层,用GAT Layer2(128)表示;将训练邻接矩阵
Figure PCTCN2022110318-appb-000004
和第一特征矩阵作为第二卷积层的输入参数,计算得到第二特征矩阵。
S303、通过自注意力机制层对第二特征矩阵进行计算,得到第三特征矩阵。
其中,自注意力机制层是卷积模型的第三层结构。在一实施例中,自注意力机制层用Self-attention Layer表示。自注意力机制层的输入参数为第二特征矩阵,输出参数为第三特征矩阵。
S304、通过第三卷积层对训练邻接矩阵和第三特征矩阵进行计算,得到蛋白质训练信息。
其中,第三卷积层是卷积模型的第四层结构。在一实施例中,第三卷积层为GAT层,用GAT Layer3(1)表示;需要说明的是,第三卷积层的通道数为1,即蛋白质训练信息为y pred
Figure PCTCN2022110318-appb-000005
在一实施例中,为了便于后期对泛素化位点进行识别,在第三卷积层输出蛋白质训练信息后,利用激活函数将蛋白质训练信息中的数值缩减至0-1之间。具体的,在一应用场景中,激活函数为sigmoid。
需要说明的是,在同批次的蛋白质训练集中包含有多组蛋白质样本信息和赖氨酸样本信息时,卷积模型会输出与组数等量的蛋白质训练信息。在利用激活函数对各个蛋白质训练信息进行计算处理后,执行步骤S305。
S305、从蛋白质训练信息中筛选出赖氨酸训练信息。
在一实施例中,每个蛋白质样本信息中的赖氨酸均被排在所有氨基酸节点的最前方,例如一个样本蛋白质共包含有1000个氨基酸节点,其中包含有10个赖氨酸节点。由于样本蛋白质的所有信息均为已知量,且蛋白质样本信息实际是一个矩阵,因此在将样本蛋白质转换成蛋白质样本信息之后,将10个赖氨酸节点排列在矩阵的前端,而后再将蛋白质样本信息作为第一卷积层的输入参数进行计算。使蛋白质训练信息中排在前10的氨基酸节点均为赖氨酸节点。如此,即可直接筛选出赖氨酸训练信息。
在其他实施例中,通过预设的赖氨酸节点标识从蛋白质训练信息中筛选出赖氨酸节点,从而获得对应的赖氨酸训练信息。具体的,赖氨酸节点标识包括但不限于元素种类和中心碳原子的位置信息。
在一实施例中,蛋白质训练信息有多个,则赖氨酸训练信息有多个,为了便于区分,赖氨酸训练信息表示为y pred-part
Figure PCTCN2022110318-appb-000006
其中,L′ i为第i个赖氨酸训练信息中的赖氨酸数量。
通过设置第一卷积层和第二卷积层,使蛋白质样本信息中的各个氨基酸节点能够学习到更多有关其他节点的特征;在利用自注意力机制层进一步扩大每个氨基酸节点的学习范围,提高了卷积模型的训练精度。
在本申请另一实施例中,为了进一步限定及说明,步骤基于赖氨酸训练信息和赖氨酸样本信息计算得到权参数和训练特征参数包括:
S401、统计赖氨酸训练信息的数量得到样本数量。
在一实施例中,同批次包含多组蛋白质样本信息和赖氨酸样本信息时,计算得到多个赖氨酸训练信息,且赖氨酸训练信息的数量与蛋白质样本信息的数量相同。因此,赖氨酸训练信息的数量即代表同一批次中蛋白质样本信息的数量;在本实施例中,样本数量用batchsize表示;则有i=1,2,……,batchsize。
需要说明的是,由于样本蛋白质、蛋白质样本信息和赖氨酸样本信息的数量相同,因此步骤S401也可以替换为:统计赖氨酸样本信息的数量得到样本数量。为了便于理解,在训练卷积模型时,设置了300批次的蛋白质训练集,每批次中包含有100组蛋白质样本信息和赖氨酸样本信息。则在计算当前批次的样本数量时,统计当前批次中的蛋白质样本信息的数量或赖氨酸样本信息的数量均可得到,batchsize=100。
S402、统计各赖氨酸训练信息中的赖氨酸得到第一赖氨酸数量。
S403、统计各赖氨酸样本信息中的赖氨酸得到第二赖氨酸数量。
需要说明的是,第一赖氨酸数量由训练中的卷积模型计算得到,第二赖氨酸数量是样本蛋白质中实际的赖氨酸数量,两者可能相同,可能不同。且第一赖氨酸数量是同批次中所有赖氨酸训练信息中赖氨酸节点求和得到的总数,即第一赖氨酸数量=L′ 1+L′ 2+...+L′ batch size;同理,第二赖氨酸数量是所有赖氨酸样本信息中赖氨酸节点求和得到的总数,即第二赖氨酸数量=L″ 1+L″ 2+...+L″ batch size
S404、基于样本数量和第一赖氨酸数量构建得到训练特征参数中的第一二维矩阵。
需要说明的是,在同批次中包含有多组蛋白质样本信息和赖氨酸样本信息时,由于每个蛋白质样本信息中包含的赖氨酸节点数量不同,因此计算得到的各个赖氨酸训练信息中包含的赖氨酸节点数量不同,无法将同批次计算得到的所有赖氨酸训练信息融合为一个三维矩阵。因此,利用样本数量将同批次中的赖氨酸训练信息转换为第一二维矩阵
Figure PCTCN2022110318-appb-000007
S405、基于样本数量和第二赖氨酸数量构建得到训练特征参数中的第二二维矩阵。
与步骤S404同理,每个赖氨酸样本信息与蛋白质样本信息对应,因此同批次中的赖氨酸样本信息包含的赖氨酸数量可能相同,可能不同,无法构成三维矩阵。利用样本数量将同批次中的赖氨酸样本信息转换为第二二维矩阵
Figure PCTCN2022110318-appb-000008
S406、将第一赖氨酸数量和第二赖氨酸数量求和得到权参数中的赖氨酸总量。
S407、根据赖氨酸训练信息和预设的分数阈值统计赖氨酸训练信息以及赖氨酸样本信息中可泛素化的赖氨酸数量,得到权参数中的可泛素化总数和非泛素化总数。
在一实施例中,利用激活函数将赖氨酸训练信息中与赖氨酸节点对应的数值转换为0-1之间的分数后,根据分数阈值,判定大于分数阈值的数值对应的赖氨酸节点为可泛素化位点,小于或等于分数阈值的数值对应的赖氨酸节点为非泛素化位点。再统计所有可泛素化的赖氨酸数量以及非泛素化的赖氨酸数量,即可得到可泛素化总数和非泛素化总数。在本实施例中,赖氨酸总量用N表示;可泛素化总数用N +表示;非泛素化总数用N -表示。
权参数和训练特征参数的计算过程简单,不易出错,保证了损失值的准确度。
在一实施例中,加权损失函数模型为:
Figure PCTCN2022110318-appb-000009
其中,loss为损失值;N为赖氨酸总量;N +为可泛素化总数;N -为非泛素化总数;y truet为第二二维矩阵;y pred-part为第一二维矩阵。
通过转换的方式,基于样本数量和赖氨酸数量构建第一二维矩阵和第二二维矩阵,解决了由于输入的样本蛋白质中含有的赖氨酸数量不同无法构成三维矩阵的问题,实现了损失值的精确计算。同时简单的转换过程和权参数的计算过程,有利于节约资源。
在本申请另一实施例中,为了进一步限定及说明,步骤从三维结构信息提取出空间结构特征信息包括:
S501、基于预设的中心碳原子标识识别三维结构信息中各个氨基酸的中心碳原子。
三维结构信息包含有氨基酸种类、氨基酸组成元素、位置信息等;其中,氨基酸种类 例如MET、ARG、LEU,氨基酸组成元素例如氮、碳、氧、中心碳原子等。其中,中心碳原子用CA表示。由于每个氨基酸都有一个中心碳原子CA,称为Ca原子。因此,将三维结构信息中各个中心碳原子作为对应氨基酸的代表。在识别中心碳原子时,利用中心碳原子标识进行识别即可。
S502、从三维结构信息中提取与各个中心碳原子对应的位置信息。
将中心碳原子的位置信息作为对应氨基酸的位置信息。
S503、基于位置信息计算各氨基酸间的距离信息。
在一实施例中,位置信息是一个坐标,通过位置信息可计算得到两个氨基酸之间的距离,即距离信息。
S504、在距离信息小于预设的距离阈值时,判定对应的两个氨基酸为连接关系,以生成空间结构特征信息。
其中,距离阈值用D表示,由于空间结构特征信息的关键为氨基酸之间的连接关系。因此距离阈值D的取值尤为重要,决定了空间结构特征信息的精度。需要说明的是,判定两个氨基酸之间为连接关系后,会形成由一个氨基酸指向相连氨基酸的向量,从而形成矩阵。即空间结构特征信息包含邻接矩阵和特征矩阵。
通过中心碳原子的位置信息判定蛋白质中各个氨基酸之间的连接关系,从而生成空间结构特征信息。使空间结构特征信息中包含有蛋白质的所有氨基酸,再以空间结构特征信息作为训练完成的卷积模型的输入参数,识别可泛素化位点,提高了识别精度。
在本申请另一实施例中,为了进一步限定及说明,在生成空间结构特征信息之后,所述识别方法还包括:
S601、根据赖氨酸标识识别空间结构特征信息中的赖氨酸节点。
在一实施例中,将代表氨基酸种类为赖氨酸的字母作为赖氨酸标识,识别出赖氨酸节点。
S602、将赖氨酸节点配置在空间结构特征信息中的前方。
空间结构特征信息为有行有列的阵列,将懒氨酸节点排在前列。
通过改变空间结构特征信息中赖氨酸节点的位置,方便后续查找赖氨酸节点的操作,提高效率。
在本申请另一实施例中,为了进一步限定及说明,在训练完成卷积模型后,所述识别方法还包括:
S701、获取非泛素化正确率和距离阈值的优化范围。
其中,训练卷积模型使用的蛋白质为样本蛋白质,由于样本蛋白质中氨基酸的数量、赖氨酸的数量以及赖氨酸中可泛素化的数量均为已知量。因此将样本蛋白质作为训练完成的卷积模型的输入参数,即可得到训练完成的卷积模型的计算结果。再将计算结果与样本蛋白质的实际结果比对,即可计算出由训练完成的卷积模型计算的非泛素化位点的正确率。在一实施例中,设定非泛素化正确率为0.95;距离阈值D的优化范围为[0,20]。
S702、从优化范围中选择距离阈值并结合非泛素化正确率,利用贝叶斯优化迭代计算可泛素化正确率。
在一实施例中,根据贝叶斯优化原理,从优化范围内选择超参数的具体参数值;其中,超参数指距离阈值。根据超参数的具体参数值转化样本蛋白质的三维结构信息,得到空间结构特征信息。将空间结构特征信息作为训练完成的卷积模型的输入参数,最后得到样本蛋白质的可泛素化数量和非泛素化数量。将非泛素化数量与样本蛋白质真实的非泛素化数量比对,计算得到非泛素化正确率。若非泛素化正确率大于0.95,则计算可泛素化正确率。并将该次选择的超参数的具体参数值记录到已评价集合R中。相应的,根据优化范围和已评价集合R得到未评价集合R′。
从未评价集合中选择超参数的具体参数值,重复上述步骤,迭代计算可泛素化正确率。
S703、在满足预设的迭代条件后,提取与最高的可泛素化正确率对应的距离阈值,以对距离阈值进行优化。
在一实施例中,迭代条件为迭代20次。即当迭代20次后,停止迭代计算可泛素化正确率。此时存储有20个可泛素化正确率;每个可泛素化正确率对应的具体参数值均保存在已评价集合中。从已评价集合中提取该具体参数值作为距离阈值的值,完成对距离阈值的优化。
通过优化距离阈值,进一步提高了泛素化位点识别的精度。
在一应用场景中,优化距离阈值的过程如下:
将目标函数建模为一个高斯过程
Figure PCTCN2022110318-appb-000010
作为先验知识,其中
Figure PCTCN2022110318-appb-000011
δ为超参数。假设存在一个已评价集合R={r 1,r 2...}和未评价集合R′={r′ 1,r′ 2,...}。已知评价集合中存放的是已知点(存放的距离阈值知道其对应的准确率),未评价集合存放的是还未经过验证的候选超参。基于该先验知识推导出R′的均值u(r′)和方差σ(r′)。
u(r′)=K R′RK R′R -1o(R)
σ(r′)=K R′R′-K R′RK R′R -1K RR′
其中K RR,K R′R=K RR′和K R′R′为对应协方差矩阵,得到已知评价集和未知评价集的分布后,通过采集函数α EI来选定下一个验证迭代的r′,其中采集函数α EI如下:
Figure PCTCN2022110318-appb-000012
Figure PCTCN2022110318-appb-000013
其中o(r best)是已评价集合R中的最优解的准确率,u(·)为均值函数,
Figure PCTCN2022110318-appb-000014
为累积分布函数,σ(·)为标准正态分布的概率密度函数。
按上述的流程不断迭代更新已知评价集合R,得出较优的距离阈值D。
本申请实施例公开的一种泛素化位点的识别方法的实施原理为:
如图5所示,在获得蛋白质的三维结构信息后,将三维结构信息转换为空间结构特征信息。空间结构特征信息包含邻接矩阵和特征矩阵。将空间结构特征信息作为训练完成的卷积模型的输入参数,得到训练完成的卷积模型输出的结果,即赖氨酸特征信息。赖氨酸特征信息为仅关于赖氨酸的矩阵,矩阵的行数为赖氨酸的数量,列数为1。即得到由数量与赖氨酸数量相等的若干数值构成的矩阵。数值经过激活函数处理后范围在0-1之间。而后根据分类条件判断各个数值是否匹配,若数值小于0.5,则与数值对应的赖氨酸为非泛素化位点;若数值大于或等于0.5,则与数值对应的赖氨酸为可泛素化位点,完成对泛素化位点的识别。
在泛素化位点的识别过程中,考虑了蛋白质的三维特性,使泛素化位点的识别更为精准。在训练卷积模型时,使用整个蛋白质作为卷积模型的输入参数,提高了卷积模型的数据处理精度,进一步提高了泛素化位点的识别精度。此外,在训练卷积模型过程中,求得权参数,利用权参数计算得到损失值,保证了损失值的计算准确度,从而有助于保证卷积模型的训练效率,减少迭代次数。
本申请实施例还公开一种泛素化位点的识别系统,如图6所示,包括获取模块1,用于获取蛋白质的三维结构信息;
提取模块2,用于从所述三维结构信息提取出空间结构特征信息;
处理模块3,用于基于训练完成的卷积模型对所述空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息;所述卷积模型通过蛋白质训练集训练完成;
判定模块4,用于若所述赖氨酸特征信息与预设的分类条件匹配,则对应的所述赖氨酸节点为泛素化位点。
在一实施例中,所述系统还包括训练模块,用于获取所述蛋白质训练集;所述蛋白质训练集中至少包含一组蛋白质样本信息和赖氨酸样本信息;
计算模块,用于将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息;参数模块,用于基于所述赖氨酸训练信息和所述赖氨酸样本信息计算得到权参数和训练特征参数;
损失值模块,用于基于所述权参数、所述训练特征参数和预设的加权损失函数模型计算得到损失值;
判断模块,用于基于所述损失值和预设的训练条件判断是否完成训练;所述判断模块还用于在未完成所述卷积模型的训练时,迭代计算所述损失值;在完成所述卷积模型的训练时,停止迭代计算所述损失值。
在一实施例中,所述计算模块包括第一卷积层单元,用于通过第一卷积层对所述蛋白质样本信息进行计算,得到第一特征矩阵;所述蛋白质样本信息包括训练邻接矩阵和训练特征矩阵;
第二卷积层单元,用于通过第二卷积层对所述训练邻接矩阵和所述第一特征矩阵进行计算,得到第二特征矩阵;
自注意力机制层单元,用于通过自注意力机制层对所述第二特征矩阵进行计算,得到第三特征矩阵;
第三卷积层单元,用于通过第三卷积层对所述训练邻接矩阵和所述第三特征矩阵进行计算,得到蛋白质训练信息;
筛选单元,用于从所述蛋白质训练信息中筛选出所述赖氨酸训练信息。
在一实施例中,所述参数模块包括样本单元,用于统计所述赖氨酸训练信息的数量得到样本数量;
第一赖氨酸单元,用于统计各所述赖氨酸训练信息中的赖氨酸得到第一赖氨酸数量;
第二赖氨酸单元,用于统计各所述赖氨酸样本信息中的赖氨酸得到第二赖氨酸数量;
第一矩阵单元,用于基于所述样本数量和所述第一赖氨酸数量构建得到所述训练特征参数中的第一二维矩阵;
第二矩阵单元,用于基于所述样本数量和所述第二赖氨酸数量构建得到所述训练特征参数中的第二二维矩阵;
总量单元,用于将所述第一赖氨酸数量和所述第二赖氨酸数量求和得到所述权参数中的赖氨酸总量;
统计单元,用于根据所述赖氨酸训练信息和预设的分数阈值统计所述赖氨酸训练信息以及所述赖氨酸样本信息中可泛素化的赖氨酸数量,得到所述权参数中的可泛素化总数和非泛素化总数。
在一实施例中,所述提取模块2包括识别单元,用于基于预设的中心碳原子标识识别所述三维结构信息中各个氨基酸的中心碳原子;
位置单元,用于从所述三维结构信息中提取与各个所述中心碳原子对应的位置信息;
距离单元,用于基于所述位置信息计算各所述氨基酸间的距离信息;
生成单元,用于在所述距离信息小于预设的距离阈值时,判定对应的两个所述氨基酸为连接关系,以生成所述空间结构特征信息。
在一实施例中,所述提取模块2还包括节点单元,用于根据赖氨酸标识识别所述空间结构特征信息中的所述赖氨酸节点;
配置单元,用于将所述赖氨酸节点配置在所述空间结构特征信息中的前方。
在一实施例中,所述系统还包括条件模块,用于获取非泛素化正确率和所述距离阈值的优化范围;
正确率模块,用于从所述优化范围中选择所述距离阈值并结合所述非泛素化正确率,利用贝叶斯优化迭代计算可泛素化正确率;
优化模块,用于在满足预设的迭代条件后,提取与最高的所述可泛素化正确率对应的所述距离阈值,以对所述距离阈值进行优化。
获取模块获得三维结构信息后,通过提取模块将三维结构信息转换为空间结构特征信息,再经过处理模块得到赖氨酸特征信息。由于利用了蛋白质的三维特性,提高了泛素化位点的识别精度。第一矩阵单元和第二矩阵单元利用样本数量、第一赖氨酸数量和第二赖氨酸数量,分别构建出第一二维矩阵和第二二维矩阵,实现了训练特征参数的计算,过程简单,利于节约计算资源。优化模块对距离阈值进行优化,提高了空间结构特征信息的转换精度,从而提高了泛素化位点的识别精度和质量。
这里需要指出的是:以上应用于泛素化位点的识别系统实施例项的描述,与上述方法描述是类似的,具有同方法实施例相同的有益效果。对于本申请泛素化位点的识别系统实施例中未披露的技术细节,本领域的技术人员请参照本申请方法实施例的描述而理解。
需要说明的是,本申请实施例中,如果以软件功能模块的形式实现上述方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以 是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。
相应地,本申请实施例还公开一种存储介质,存储有能够被处理器加载并执行上述方法的计算机程序。
本申请实施例还公开一种泛素化位点的识别装置,如图7所示,包括一个处理器100、至少一个通信总线200、用户接口300、至少一个外部通信接口400和存储器500。其中,通信总线200配置为实现这些组件之间的连接通信。其中,用户接口300可以包括显示屏,外部通信接口400可以包括标准的有线接口和无线接口。其中,存储器500中存储有泛素化位点的识别方法。其中,处理器100用于在执行存储器500中存储的泛素化位点的识别时采用上述方法。
以上应用于泛素化位点的识别装置和存储介质实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请泛素化位点的识别装置和存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集 成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台设备执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (10)

  1. 一种泛素化位点的识别方法,其中,包括:
    获取蛋白质的三维结构信息;
    从所述三维结构信息提取出空间结构特征信息;
    基于训练完成的卷积模型对所述空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息;所述卷积模型通过蛋白质训练集训练完成;
    若所述赖氨酸特征信息与预设的分类条件匹配,则对应的所述赖氨酸节点为泛素化位点。
  2. 如权利要求1所述的泛素化位点的识别方法,其中,通过所述蛋白质训练集训练所述卷积模型的步骤包括:
    获取所述蛋白质训练集;所述蛋白质训练集中至少包含一组蛋白质样本信息和赖氨酸样本信息;
    将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息;
    基于所述赖氨酸训练信息和所述赖氨酸样本信息计算得到权参数和训练特征参数;
    基于所述权参数、所述训练特征参数和预设的加权损失函数模型计算得到损失值;
    基于所述损失值和预设的训练条件判断是否完成训练;
    在未完成所述卷积模型的训练时,迭代计算所述损失值;在完成所述卷积模型的训练时,停止迭代计算所述损失值。
  3. 如权利要求2所述的泛素化位点的识别方法,其中,所述将所述蛋白质样本信息作为所述卷积模型的输入参数计算得到赖氨酸训练信息的步骤包括:
    通过第一卷积层对所述蛋白质样本信息进行计算,得到第一特征矩阵;所述蛋白质样本信息包括训练邻接矩阵和训练特征矩阵;
    通过第二卷积层对所述训练邻接矩阵和所述第一特征矩阵进行计算,得到第二特征矩阵;
    通过自注意力机制层对所述第二特征矩阵进行计算,得到第三特征矩阵;
    通过第三卷积层对所述训练邻接矩阵和所述第三特征矩阵进行计算,得到蛋白质训练信息;
    从所述蛋白质训练信息中筛选出所述赖氨酸训练信息。
  4. 如权利要求2所述的泛素化位点的识别方法,其中,所述基于所述赖氨酸训练信息和所述赖氨酸样本信息计算得到权参数和训练特征参数的步骤包括:
    统计所述赖氨酸训练信息的数量得到样本数量;
    统计各所述赖氨酸训练信息中的赖氨酸得到第一赖氨酸数量;
    统计各所述赖氨酸样本信息中的赖氨酸得到第二赖氨酸数量;
    基于所述样本数量和所述第一赖氨酸数量构建得到所述训练特征参数中的第一二维矩阵;
    基于所述样本数量和所述第二赖氨酸数量构建得到所述训练特征参数中的第二二维矩阵;
    将所述第一赖氨酸数量和所述第二赖氨酸数量求和得到所述权参数中的赖氨酸总量;
    根据所述赖氨酸训练信息和预设的分数阈值统计所述赖氨酸训练信息以及所述赖氨酸样本信息中可泛素化的赖氨酸数量,得到所述权参数中的可泛素化总数和非泛素化总数。
  5. 如权利要求1所述的泛素化位点的识别方法,其中,所述从所述三维结构信息提取出空间结构特征信息的步骤包括:
    基于预设的中心碳原子标识识别所述三维结构信息中各个氨基酸的中心碳原子;
    从所述三维结构信息中提取与各个所述中心碳原子对应的位置信息;
    基于所述位置信息计算各所述氨基酸间的距离信息;
    在所述距离信息小于预设的距离阈值时,判定对应的两个所述氨基酸为连接关系,以生成所述空间结构特征信息。
  6. 如权利要求5所述的泛素化位点的识别方法,其中,在生成所述空间结构特征信息之后,还包括:
    根据赖氨酸标识识别所述空间结构特征信息中的所述赖氨酸节点;
    将所述赖氨酸节点配置在所述空间结构特征信息中的前方。
  7. 如权利要求5所述的泛素化位点的识别方法,其中,在训练完成所述卷积模型后,还包括:
    获取非泛素化正确率和所述距离阈值的优化范围;
    从所述优化范围中选择所述距离阈值并结合所述非泛素化正确率,利用贝叶斯优化迭代计算可泛素化正确率;
    在满足预设的迭代条件后,提取与最高的所述可泛素化正确率对应的所述距离阈值,以对所述距离阈值进行优化。
  8. 一种泛素化位点的识别系统,其中,包括:
    获取模块,用于获取蛋白质的三维结构信息;
    提取模块,用于从所述三维结构信息提取出空间结构特征信息;
    处理模块,用于基于训练完成的卷积模型对所述空间结构特征信息进行处理,得到赖氨酸节点的赖氨酸特征信息;所述卷积模型通过蛋白质训练集训练完成;
    判定模块,用于若所述赖氨酸特征信息与预设的分类条件匹配,则对应的所述赖氨酸节点为泛素化位点。
  9. 一种泛素化位点的识别装置,包括存储器和处理器,其中,所述存储器中存储有泛素化位点的识别方法,所述处理器在执行所述泛素化位点的识别方法时采用如权利要求1-7中任一项所述的方法。
  10. 一种存储介质,其中,所述存储介质存储有能够被处理器加载并执行如权利要求1-7中任一项所述方法的计算机程序。
PCT/CN2022/110318 2022-07-20 2022-08-04 泛素化位点的识别方法、装置、系统和存储介质 WO2024016389A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210850486.2 2022-07-20
CN202210850486.2A CN114927165B (zh) 2022-07-20 2022-07-20 泛素化位点的识别方法、装置、系统和存储介质

Publications (1)

Publication Number Publication Date
WO2024016389A1 true WO2024016389A1 (zh) 2024-01-25

Family

ID=82815711

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/110318 WO2024016389A1 (zh) 2022-07-20 2022-08-04 泛素化位点的识别方法、装置、系统和存储介质

Country Status (2)

Country Link
CN (1) CN114927165B (zh)
WO (1) WO2024016389A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190156915A1 (en) * 2017-08-31 2019-05-23 Shenzhen University Method, apparatus, device and storage medium for predicting protein binding site
CN112447265A (zh) * 2020-11-25 2021-03-05 太原理工大学 基于模块化密集卷积网络的赖氨酸乙酰化位点预测方法
CN114283878A (zh) * 2021-08-27 2022-04-05 腾讯科技(深圳)有限公司 训练匹配模型、预测氨基酸序列和设计药物的方法与装置
CN114496095A (zh) * 2022-01-20 2022-05-13 广东药科大学 一种修饰位点识别方法、系统、装置及存储介质
CN114613427A (zh) * 2022-03-15 2022-06-10 水木未来(北京)科技有限公司 蛋白质三维结构预测方法及装置、电子设备和存储介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1500927A4 (en) * 2002-04-26 2008-03-05 Ajinomoto Kk PROCEDURE FOR PROTEIN STRUCTURE ANALYSIS, PROTEIN STRUCTURE ANALYZER, PROGRAM AND RECORDING MEDIUM
US8729009B2 (en) * 2009-05-15 2014-05-20 Stichting Het Nederlands Kanker Instituut Lysine compounds and their use in site- and chemoselective modification of peptides and proteins
US9725490B2 (en) * 2011-07-29 2017-08-08 Tokushima University ERAP1-derived peptide and use thereof
WO2015030585A2 (en) * 2013-08-27 2015-03-05 Academisch Ziekenhuis Leiden H.O.D.N. Lumc Methods for detecting post-translationally modified lysines in a polypeptide
CN107058298B (zh) * 2017-06-06 2019-10-08 中国海洋大学 一种基于人工减数分裂的辅助基因组组装方法
CN109524058B (zh) * 2018-11-07 2021-02-26 浙江工业大学 一种基于差分进化的蛋白质二聚体结构预测方法
US20200158737A1 (en) * 2018-11-21 2020-05-21 Regents Of The University Of Minnesota Methods of measuring ubiquitin-like modifications
JP7492524B2 (ja) * 2019-02-11 2024-05-29 フラッグシップ・パイオニアリング・イノベーションズ・ブイアイ,エルエルシー 機械学習支援ポリペプチド解析
CN109785902B (zh) * 2019-02-20 2023-08-29 成都分迪科技有限公司 一种泛素化降解目标蛋白质的预测方法
CN110349628B (zh) * 2019-06-27 2021-06-15 广东药科大学 一种蛋白质磷酸化位点识别方法、系统、装置及存储介质
US20210104294A1 (en) * 2019-10-02 2021-04-08 The General Hospital Corporation Method for predicting hla-binding peptides using protein structural features
CN113571124B (zh) * 2020-04-29 2024-04-23 中国科学院上海药物研究所 一种配体-蛋白质相互作用的预测方法及装置
CN112151128A (zh) * 2020-10-16 2020-12-29 腾讯科技(深圳)有限公司 相互作用信息的确定方法、装置、设备及存储介质
CN114765063A (zh) * 2021-01-12 2022-07-19 上海交通大学 基于图神经网络表征的蛋白质与核酸结合位点预测方法
CN113192559B (zh) * 2021-05-08 2023-09-26 中山大学 基于深层图卷积网络的蛋白质-蛋白质相互作用位点预测方法
CN113848259A (zh) * 2021-06-18 2021-12-28 上海交通大学医学院 基于高精度质谱的蛋白类泛素化修饰位点检测方法及应用
CN113593633B (zh) * 2021-08-02 2023-07-25 中国石油大学(华东) 基于卷积神经网络的药物-蛋白相互作用预测模型
CN114333986A (zh) * 2021-09-06 2022-04-12 腾讯科技(深圳)有限公司 模型训练、药物筛选和亲和力预测的方法与装置
CN114420203A (zh) * 2021-12-08 2022-04-29 深圳大学 一种用于预测转录因子-靶基因相互作用的方法及模型
CN114724636A (zh) * 2022-03-22 2022-07-08 腾讯科技(深圳)有限公司 蛋白质超图的构建方法、构建装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190156915A1 (en) * 2017-08-31 2019-05-23 Shenzhen University Method, apparatus, device and storage medium for predicting protein binding site
CN112447265A (zh) * 2020-11-25 2021-03-05 太原理工大学 基于模块化密集卷积网络的赖氨酸乙酰化位点预测方法
CN114283878A (zh) * 2021-08-27 2022-04-05 腾讯科技(深圳)有限公司 训练匹配模型、预测氨基酸序列和设计药物的方法与装置
CN114496095A (zh) * 2022-01-20 2022-05-13 广东药科大学 一种修饰位点识别方法、系统、装置及存储介质
CN114613427A (zh) * 2022-03-15 2022-06-10 水木未来(北京)科技有限公司 蛋白质三维结构预测方法及装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN114927165B (zh) 2022-12-02
CN114927165A (zh) 2022-08-19

Similar Documents

Publication Publication Date Title
CN111798921B (zh) 一种基于多尺度注意力卷积神经网络的rna结合蛋白预测方法及装置
CN111161793B (zh) 基于stacking集成的RNA中N6-甲基腺苷修饰位点预测方法
Qi et al. Random forest similarity for protein-protein interaction prediction from multiple sources
CN111063393B (zh) 基于信息融合和深度学习的原核生物乙酰化位点预测方法
CN110853756B (zh) 基于som神经网络和svm的食管癌风险预测方法
CN114169442B (zh) 基于双原型网络的遥感图像小样本场景分类方法
CN109559781A (zh) 一种预测dna-蛋白质结合的双向lstm和cnn模型
CN113299338A (zh) 基于知识图谱的合成致死基因对预测方法、系统、终端及介质
CN113764034B (zh) 基因组序列中潜在bgc的预测方法、装置、设备及介质
Zhou et al. scHiCSC: A novel single-cell Hi-C clustering framework by contact-weight-based smoothing and feature fusion
CN113241114A (zh) 一种基于图卷积神经网络的lncRNA-蛋白质相互作用预测方法
WO2024016389A1 (zh) 泛素化位点的识别方法、装置、系统和存储介质
Chen et al. sORFPred: A Method Based on Comprehensive Features and Ensemble Learning to Predict the sORFs in Plant LncRNAs
CN114758721B (zh) 一种基于深度学习的转录因子结合位点定位方法
CN115579068A (zh) 一种基于预训练和深度聚类的宏基因组物种重建方法
CN111599412B (zh) 基于词向量与卷积神经网络的dna复制起始区域识别方法
Kusonmano et al. Effects of pooling samples on the performance of classification algorithms: a comparative study
CN110739028B (zh) 一种基于k-近邻约束矩阵分解的细胞系药物响应预测方法
Cheng et al. CapBind: Prediction of Transcription Factor Binding Sites Based on Capsule Network
Li et al. DeTOKI identifies and characterizes the dynamics of chromatin topologically associating domains in a single cell
CN111383708A (zh) 基于化学基因组学的小分子靶标预测算法及其应用
CN117912591B (zh) 一种基于深度对比学习的激酶药物相互作用预测方法
CN117912570B (zh) 一种基于基因共表达网络的分类特征确定方法及系统
Lan et al. Deep imputation bi-stochastic graph regularized matrix factorization for clustering single-cell RNA-sequencing data
Wang et al. MAHyNet: Parallel Hybrid Network for RNA-Protein Binding Sites Prediction Based on Multi-Head Attention and Expectation Pooling

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22951657

Country of ref document: EP

Kind code of ref document: A1