WO2024027663A1 - 识别模型的预训练方法、识别方法、装置、介质和设备 - Google Patents

识别模型的预训练方法、识别方法、装置、介质和设备 Download PDF

Info

Publication number
WO2024027663A1
WO2024027663A1 PCT/CN2023/110347 CN2023110347W WO2024027663A1 WO 2024027663 A1 WO2024027663 A1 WO 2024027663A1 CN 2023110347 W CN2023110347 W CN 2023110347W WO 2024027663 A1 WO2024027663 A1 WO 2024027663A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
sequence
features
recognition model
knowledge
Prior art date
Application number
PCT/CN2023/110347
Other languages
English (en)
French (fr)
Inventor
边成
张志诚
李永会
Original Assignee
抖音视界有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 抖音视界有限公司 filed Critical 抖音视界有限公司
Publication of WO2024027663A1 publication Critical patent/WO2024027663A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present disclosure relates to the field of electronic information technology, and specifically, to a pre-training method for a recognition model, a recognition method, a device, a medium and a device.
  • Protein is the most basic component of body cells. The study of proteins helps to understand the nature of living things, thereby promoting the development of biotechnology and medical technology. For example, the structure of proteins can be used to judge the function of proteins, which is helpful for drugs and vaccines. research and development. The traditional way is to calculate the protein structure in the laboratory through X-ray crystallography and nuclear magnetic resonance, which is time-consuming and labor-intensive. Since protein sequences have a certain degree of similarity with text, inspired by NLP (English: Natural Language Processing, Chinese: Natural Language Processing) technology, known protein sequences can be used to pre-train the recognition model, so that the recognition model can be fine-tuned (English :Fine-tune) to complete the identification of proteins. However, protein sequences are usually much longer than texts, and the pre-training process requires a large amount of computing resources, making it difficult to implement in practical applications.
  • the present disclosure provides a pre-training method for a recognition model.
  • the method include:
  • the pre-training sample set includes multiple pre-training protein sequences
  • the protein knowledge graph includes multiple triples, each of the triples consists of a protein, a gene ontology, and The relationship between protein and gene ontology consists of;
  • For each of the pre-trained protein sequences perform a masking operation on the pre-trained protein sequence to obtain a mask sequence corresponding to the pre-trained protein sequence;
  • the recognition model is pre-trained according to the decoding result, the pre-trained protein sequence and the protein knowledge map.
  • the pre-trained recognition model can identify proteins after fine-tuning.
  • the present disclosure provides a protein identification method, which method includes:
  • the target sequence is input into a protein identification model to determine identification information of the target protein.
  • the identification information includes at least one of the following: secondary structure, residue contacts, remote homology, and stability of the target protein. , and fluorescence;
  • the protein recognition model is obtained by fine-tuning the recognition model described in the first aspect of the present disclosure based on a training sample set, and the training sample set includes a plurality of training protein sequences.
  • the present disclosure provides a pre-training device for an egg recognition model, which device includes:
  • An acquisition module is used to acquire a pre-training sample set and a protein knowledge map.
  • the pre-training sample set includes multiple pre-training protein sequences.
  • the protein knowledge map includes multiple triples, each of which is composed of a protein. , gene ontology, and the relationship between protein and gene ontology;
  • a masking module configured to perform a masking operation on each of the pre-trained protein sequences to obtain a masking sequence corresponding to the pre-trained protein sequence
  • the pre-training module is used to use a preset recognition model to extract features from the masked sequence corresponding to each pre-trained protein sequence to obtain sequence features, and to extract features from the triplet containing the pre-trained protein sequence to obtain knowledge features. ; Use the recognition model to fuse the sequence features and the knowledge features, and decode according to the fusion result to obtain a decoding result; According to the decoding result, the pre-trained protein sequence and the protein knowledge map, The recognition model is pre-trained, and the pre-trained recognition model can identify proteins after fine-tuning.
  • the present disclosure provides a protein recognition device, the device comprising:
  • the acquisition module is used to obtain the target sequence corresponding to the target protein to be identified
  • An identification module for inputting the target sequence into a protein identification model to determine identification information of the target protein.
  • the identification information includes at least one of the following: secondary structure of the target protein, residue contacts, remote homology origin, stability, and fluorescence;
  • the protein recognition model is obtained by fine-tuning the recognition model described in the first aspect of the present disclosure based on a training sample set, and the training sample set includes a plurality of training protein sequences.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method described in the first aspect of the present disclosure are implemented.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the method described in the second aspect of the present disclosure are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of the method described in the first aspect of the present disclosure.
  • the present disclosure provides an electronic device, including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of the method described in the second aspect of the present disclosure.
  • the present disclosure first obtains a protein knowledge graph and multiple pre-trained protein sequences.
  • the protein knowledge graph includes multiple triples composed of proteins, gene ontology, and relationships.
  • a masking operation is performed on each pre-trained protein sequence to obtain the corresponding masked sequence.
  • the recognition model is used to extract features from the masked sequence corresponding to each pre-trained protein sequence and the triplet containing the pre-trained protein sequence to obtain sequence features and knowledge features, and then the recognition model is used to fuse the sequence features and knowledge features.
  • the pre-trained recognition model can be fine-tuned to identify proteins.
  • This disclosure fuses sequence features and knowledge features and then decodes them, so that the protein knowledge graph can directly affect the output results of the recognition model.
  • the recognition model can fully learn the information contained in the protein knowledge graph, improving the The ability of the recognition model to improve the accuracy of the recognition model for downstream tasks.
  • Figure 1 is a flow chart of a pre-training method for a recognition model according to an exemplary embodiment
  • Figure 2 is a schematic structural diagram of a recognition model according to an exemplary embodiment
  • Figure 3 shows another pre-training method of the recognition model according to an exemplary embodiment. Law flow chart
  • Figure 4 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment
  • Figure 5 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment
  • Figure 6 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment
  • Figure 7 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment
  • Figure 8 is a schematic structural diagram of a decoding layer according to an exemplary embodiment
  • Figure 9 is a flow chart of a protein identification method according to an exemplary embodiment
  • Figure 10 is a schematic structural diagram of a protein recognition model according to an exemplary embodiment
  • Figure 11 is a block diagram of a pre-training device for a recognition model according to an exemplary embodiment
  • Figure 12 is a block diagram of another pre-training device for a recognition model according to an exemplary embodiment
  • Figure 13 is a block diagram of another pre-training device for a recognition model according to an exemplary embodiment
  • Figure 14 is a block diagram of a protein recognition device according to an exemplary embodiment
  • FIG. 15 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “include” and its variations are open-ended, ie, “including but not limited to.”
  • the term “based on” means “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • a prompt message is sent to the user to clearly remind the user that the operation requested will require the acquisition and use of the user's personal information. Therefore, users can autonomously choose whether to provide personal information to software or hardware such as electronic devices, applications, servers or storage media that perform the operations of the technical solution of the present disclosure based on the prompt information.
  • the method of actively requesting and sending prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in the form of text in the pop-up window.
  • the pop-up window can also contain a selection control for the user to choose "agree” or "disagree” to provide personal information to the electronic device.
  • Figure 1 is a flow chart of a pre-training method for a recognition model according to an exemplary embodiment. As shown in Figure 1, the method includes:
  • Step 101 Obtain a pre-training sample set and a protein knowledge graph.
  • the pre-training sample set includes multiple pre-trained protein sequences.
  • the protein knowledge graph includes multiple triples. Each triple consists of protein, gene ontology, and protein and gene. The relationship between ontologies consists of.
  • Step 102 For each pre-trained protein sequence, perform a masking operation on the pre-trained protein sequence to obtain a mask sequence corresponding to the pre-trained protein sequence.
  • the sample input set includes multiple sample inputs
  • the sample output set includes sample output corresponding to each sample input.
  • a pre-training sample set including multiple pre-trained protein sequences can be obtained in advance.
  • the protein sequences included in the Swiss-Prot database can be used as the pre-training sample set.
  • the protein knowledge map includes multiple triples. Each triple includes proteins, gene ontology (English: Gene Ontology, abbreviation: GO), and the relationship between proteins and gene ontology. The triplet can be expressed as (protein, relation, gene ontology).
  • Gene ontology is a control vocabulary with a dynamic form, used to explain the role of eukaryotic genes or proteins in cells and biomedical knowledge. Relationships are text descriptions, used to describe the proteins and genes in triples.
  • the protein knowledge graph contains information that can describe various properties of proteins.
  • a masking operation (English: mask) can be performed on each pre-trained protein sequence in the pre-training sample set to obtain the mask sequence corresponding to the pre-trained protein sequence.
  • the masking position for each pre-training protein sequence to perform the masking operation can be one or multiple, for example, it can be the 15th, 173rd, 210th, etc. in the pre-training protein sequence.
  • the determination of the masking position can be randomly generated or determined according to a preset algorithm, which is not specifically limited in this disclosure.
  • the masking sequence corresponding to a pre-trained protein sequence and the triplet containing the pre-trained protein sequence can be used as a sample input, and the pre-trained protein sequence can be used as the corresponding sample output, thereby obtaining a sample input set and a sample output set.
  • Step 103 Use a preset recognition model to perform feature extraction on the masked sequence corresponding to each pre-trained protein sequence to obtain sequence features, and perform feature extraction on triples containing the pre-trained protein sequence to obtain knowledge features.
  • Step 104 Use the recognition model to fuse the sequence features and the knowledge features, and perform decoding based on the fusion results to obtain the decoding results.
  • Step 105 Pre-train the recognition model based on the decoding results, the pre-trained protein sequence and the protein knowledge map.
  • the pre-trained recognition model can identify proteins after fine-tuning.
  • the sample input set can be used as the input of the preset recognition model, and then the sample output set can be used as the output of the recognition model to pre-train the recognition model, so that when the sample input set is input, the output of the recognition model, Can match the sample output set.
  • the masking sequence corresponding to each pre-trained protein sequence included in the sample input set and the triplet containing the pre-trained protein sequence can be used as the input of the recognition model, and then based on the output of the recognition model and the pre-trained protein sequence and protein knowledge graph to pre-train the recognition model.
  • the recognition model is used to fuse the sequence features and the knowledge features.
  • the sequence features and the knowledge features can be spliced to obtain the fusion result.
  • the attention mechanism can also be used to fuse the sequence features and the knowledge features to obtain the fusion result.
  • You can also The sequence features and knowledge features are weighted and summed according to the preset weights to obtain the fusion result, which is not specifically limited in this disclosure.
  • decoding is performed based on the fusion result to obtain the decoding result.
  • the recognition model is pre-trained based on the decoding results, the pre-trained protein sequence and the protein knowledge map.
  • the loss function of the recognition model can be determined based on the decoding results, the pre-trained protein sequence and the protein knowledge map.
  • the back propagation algorithm can be used to correct the parameters of the neurons in the recognition model.
  • the parameters of the neurons For example, it can be the weight (English: Weight) and bias (English: Bias) of the neuron. Repeat the above steps until the loss function meets the preset conditions, for example, the loss function is less than the preset loss threshold, or the loss function converges, so as to achieve the purpose of pre-training the recognition model.
  • the pre-trained recognition model can be fine-tuned according to specific downstream tasks, so that the fine-tuned recognition model can complete the downstream tasks and identify proteins. Downstream tasks may be, for example, identifying the secondary structure of a protein, identifying residue contacts of a protein, or identifying long-range homologies of a protein, etc.
  • the recognition model extracts sequence features and knowledge features respectively, fuses the sequence features and knowledge features, and then decodes them, so that the protein knowledge map can directly affect the output results of the recognition model, that is, the decoding results.
  • the recognition model While learning pre-trained protein sequences, it can also fully learn the information contained in the protein knowledge map, improving the ability of the recognition model, thus improving the accuracy of the recognition model for downstream tasks.
  • the structure of the recognition model can include: sequence encoder, knowledge encoder and decoder, the input of the sequence encoder and the input of the knowledge encoder are used as the input of the recognition model, the output of the sequence encoder and the knowledge encoding The output of the decoder is input into the decoder together, and the output of the decoder is used as the output of the recognition model, as shown in Figure 2.
  • FIG. 3 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment. As shown in Figure 3, the implementation of step 103 may include:
  • Step 1031 Use a sequence encoder to extract features from the masked sequence corresponding to the pre-trained protein sequence to obtain sequence features.
  • the masked sequence corresponding to the pre-trained protein sequence is input into the sequence encoder so that the sequence encoder can perform feature extraction.
  • the sequence features can be expressed as protein embedding
  • the feature extraction process can also be Understood as the encoding process, the sequence features can also be understood as the vector representation of the pre-trained protein sequence.
  • the output of the sequence encoder can be directly used as sequence features.
  • a masking token (which can be expressed as masked token) corresponding to the pre-trained protein sequence can also be generated based on the masking position where the masking operation is performed.
  • the masking token corresponds to the masking position one-to-one, which can be understood as A shared, learnable vector, and the masking token can also include a position vector that represents the masking location. That is, the masking token is used to represent the information at the masked location.
  • the output of the encoder can be fused with the masking token as a sequence feature.
  • the sequence encoder can be an Encoder in ProtBer or an Encoder in other PPLMs (English: Protein Pre-trained Language Models). This disclosure does not specifically limit this.
  • Step 1032 Use the knowledge encoder to extract features from the gene ontology and relationships in the triplet containing the pre-trained protein sequence to obtain gene ontology features and relationship features.
  • Step 1033 Determine knowledge features based on gene ontology features and relationship features, and the knowledge features are used to characterize triples containing the pre-trained protein sequence.
  • the gene ontology and relationship in the triplet containing the pre-trained protein sequence can be input into the knowledge encoder respectively, so that the knowledge encoder can perform feature extraction and obtain the gene ontology features output by the knowledge encoder (which can be expressed as GO embedding) and relational features (can be expressed as relation embedding), the feature extraction process can also be understood as the encoding process, and the gene ontology features can be understood as the vector representation of the gene ontology in the triplet containing the pre-trained protein sequence.
  • Relational features can be understood as vector representations of relationships in triples containing this pre-trained protein sequence.
  • the knowledge feature of the triplet of the pre-trained protein sequence can be expressed as knowledge embedding.
  • gene ontology features and relationship features can be spliced (English: Concat), and the result of the splicing can be used as knowledge features.
  • the attention mechanism can also be used to fuse gene ontology features and relationship features to obtain knowledge features.
  • the recognition model can include an attention unit, as shown in Figure 2.
  • Gene ontology features and relationship features are input into the attention unit to obtain the knowledge features output by the attention unit.
  • E knowledge represents knowledge features
  • f GO represents gene ontology features
  • f R represents relationship features
  • Attn represents attention mechanism
  • f GO can be used as the Query (expressed as Q) of the attention mechanism
  • f R can be used as the attention mechanism.
  • Key (expressed as K) and Value (expressed as V)
  • W Q represents the weight matrix corresponding to Query
  • W K represents the weight matrix corresponding to Key
  • W V represents the weight matrix corresponding to Value
  • d k represents the length of Key.
  • FIG. 4 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment. As shown in Figure 4, step 101 can be implemented through the following steps:
  • Step 1011 Obtain the pre-training sample set.
  • Step 1012 Align the pre-training sample set with the gene ontology knowledge map to obtain an initial knowledge map.
  • the initial knowledge map includes multiple positive triples, and the relationship between the proteins included in the positive triples and the gene ontology is true.
  • Step 1013 Negative sampling is performed on the initial knowledge graph to obtain multiple negative triples, and the relationship between the proteins and gene ontology included in the negative triples is false.
  • Step 1014 Obtain a protein knowledge map based on multiple negative triples and the initial knowledge map.
  • the protein knowledge map includes multiple triples and an identifier for each triple. The identifier is used to indicate that the triple belongs to a positive triple. group or negative triplet.
  • the publicly available Gene Ontology knowledge graph can be aligned with the pre-training sample set to construct an initial knowledge graph.
  • the initial knowledge graph includes many Positive triplet, each positive triplet includes protein, gene ontology, and the relationship between protein and gene ontology, and the relationship is true, that is to say, the relationship between the protein represented in the positive triplet and gene ontology is relationship consistent with biological properties.
  • each negative triplet includes proteins, gene ontology, and the relationship between proteins and gene ontology.
  • the negative triplet includes proteins and The relationship between gene ontology is false, which means that the relationship between the protein represented in the positive triplet and the gene ontology is not in line with biological characteristics.
  • a protein knowledge graph can be generated based on multiple negative triples and the initial knowledge graph.
  • the protein knowledge graph includes multiple triples and an identification of each triple. The identification is used to indicate that the triple belongs to a positive triple. or negative triplet, that is, the flag is used to indicate whether the relationship included in the triplet serves a biological property.
  • T′ Protein-GO ⁇ (h, r, t′)
  • T′ Protein-GO(h, r, t) represents the negative triplet
  • h represents the head of the negative triplet (i.e. protein)
  • r represents the relationship in the negative triplet
  • t' represents the negative triplet
  • the tail of the group i.e., gene ontology
  • t represents the tail of the positive triplet
  • E represents the set of gene ontology in the initial knowledge graph
  • E' represents the set whose intersection with E is empty.
  • negative sampling There are a total of three types of gene ontologies: biological processes, molecular functions, and cellular components. It is possible to perform negative sampling only from gene ontologies of the same class. For example, if the gene ontology in the positive triplet is a biological process, then the gene ontology in the negative triplet is also a biological process. If the gene ontology in the positive triplet is a molecular function, then the gene ontology in the negative triplet is also a molecular function.
  • FIG. 5 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment. As shown in Figure 5, step 105 can be implemented through the following steps:
  • Step 1051 Determine the predicted sequence based on the decoding result, and determine the prediction loss based on the predicted sequence and the pre-trained protein sequence.
  • the prediction loss can be determined based on the predicted sequence and the pre-trained protein sequence, and the prediction loss can be understood as MLM (English: Masked Language Model, Chinese: Masked Language Model) loss.
  • MLM International Language Model, Chinese: Masked Language Model
  • L MLM represents the prediction loss
  • E represents the expectation operation
  • xi represents the i-th masked element in the pre-trained protein sequence
  • M represents the number of masked elements in the pre-trained protein sequence
  • P( xi ) Represents the probability that the i-th element in the prediction sequence is predicted to be x i .
  • Step 1052 determine the predicted recognition result based on the decoding result, and determine the recognition loss based on the predicted recognition result and the identity of the triplet containing the pre-trained protein sequence.
  • the second task can first pass the decoding result through a pooling layer, and then input it into the MLP, so that the MLP can complete a two-classification task, that is, the predicted recognition result is determined based on the result output by the MLP.
  • the predicted recognition result can be understood as a recognition model pair. Prediction of knowledge features, determining whether the triplet represented by the knowledge feature is a positive triplet or a negative triplet.
  • the recognition loss can therefore be determined based on the predicted recognition results and the identity of the triplet containing the pretrained protein sequence. For example, the recognition loss can be determined according to Equation 4:
  • L PFI represents the recognition loss
  • y represents the identity of the positive triplet (can be 1)
  • p represents the probability that the recognition model recognizes the positive triplet as a positive triplet
  • N represents the number of negative triplet in the protein knowledge graph
  • y i represents the identity of the i negative triplet (all 0)
  • p i represents the probability that the recognition model recognizes the i-th negative triplet as a positive triplet.
  • Step 1053 Determine the total loss based on the predicted loss and identified loss.
  • Step 1054 with the goal of reducing the total loss, use the back propagation algorithm to pre-train the recognition model.
  • the total loss can be determined based on the prediction loss and the identification loss.
  • the sum of the prediction loss and the identification loss can be used as the total loss, or the prediction loss and the identification loss can be weighted and summed to obtain the total loss.
  • the back propagation algorithm can be used to pre-train the recognition model.
  • L total represents the total loss
  • represents the weight corresponding to the recognition loss, which can be set to 1, for example.
  • FIG. 6 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment. As shown in Figure 6, step 104 may include:
  • Step 1041 Splice sequence features and knowledge features to obtain comprehensive features.
  • Step 1042 Decode the comprehensive features to obtain the decoding result.
  • the decoding process can first concatenate sequence features and knowledge features (i.e., Concat) to obtain comprehensive features, and then input the comprehensive features into the decoder (English: Decoder), and the decoder decodes the comprehensive features to obtain the decoding result.
  • the decoder can be, for example, a Decoder in Bert, or a Decoder in other PPLMs, which is not specifically limited in this disclosure.
  • FIG. 7 is a flow chart of another pre-training method for a recognition model according to an exemplary embodiment. As shown in Figure 7, step 104 may include:
  • Step 1043 Use the attention mechanism to fuse the information in the knowledge features that matches the sequence features with the sequence features to obtain cross-modal fusion features.
  • Step 1044 Use the self-attention mechanism to decode the cross-modal fusion features and obtain the decoding result.
  • the decoder includes multiple decoding layers that can be connected in sequence (for example, it can include 3 decoding layers).
  • Each decoding layer includes: multi-head cross-modal attention module, multi-head self-attention (English: Multi-head self -attention) module and MLP, as shown in Figure 8 (only one decoding layer is used as an illustration in the figure, and multiple decoding layers are not shown).
  • the knowledge features and sequence features can be passed through the LN (English: Layer Normalization) layer respectively and then input into the multi-head cross-modal attention module.
  • the multi-head cross-modal attention module can combine the information in the knowledge features that matches the sequence features with the sequence features. Fusion to obtain cross-modal fusion features. After that, a residual unit can be set up, and the cross-modal fusion features and sequence features can be input into the multi-head self-attention module and MLP for decoding to obtain the decoding results.
  • Equation 6 The implementation of the multi-head cross-modal attention module can be expressed by Equation 6:
  • the two inputs of the first decoding layer are and E knowledge , is the sequence feature), E knowledge represents the knowledge feature, Attn represents the attention mechanism, and Query (expressed as Q) as the attention mechanism, E knowledge as the Key (expressed as K) and Value (expressed as V) of the attention mechanism, W Q represents the weight matrix corresponding to Query, and W K represents the weight corresponding to Key Matrix, W V represents the weight matrix corresponding to Value, and d k represents the length of Key.
  • the multi-head cross-modal attention module can filter out information that matches sequence features in knowledge features, which can reduce the noise in the matching process of two modalities (protein sequence modalities, gene ontology, and relational text modalities). At the same time, the two modalities can be aligned to improve the quality of cross-modal fusion features, thereby improving the ability of the recognition model.
  • this disclosure first obtains a protein knowledge graph and multiple pre-trained protein sequences.
  • the protein knowledge graph includes multiple triples composed of proteins, gene ontology, and relationships. Then, a masking operation is performed on each pre-trained protein sequence to obtain the corresponding masked sequence. Then use the recognition model to correspond to each pre-trained protein sequence. Feature extraction is performed on the masked sequence and the triplet containing the pre-trained protein sequence to obtain sequence features and knowledge features, and then the recognition model is used to fuse and decode the sequence features and knowledge features. Finally, according to the decoding results, the pre-trained protein Sequence and protein knowledge graphs pre-train preset recognition models. The pre-trained recognition model can be fine-tuned to identify proteins.
  • This disclosure fuses sequence features and knowledge features and then decodes them, so that the protein knowledge graph can directly affect the output results of the recognition model.
  • the recognition model can fully learn the information contained in the protein knowledge graph, improving the The ability of the recognition model to improve the accuracy of the recognition model for downstream tasks.
  • Figure 9 is a flow chart of a protein identification method according to an exemplary embodiment. As shown in Figure 9, the method includes:
  • Step 201 Obtain the target sequence corresponding to the target protein to be identified.
  • Step 202 Enter the target sequence into the protein identification model to determine the identification information of the target protein.
  • the identification information includes at least one of the following: secondary structure, residue contacts, remote homology, stability, and fluorescence of the target protein.
  • the protein recognition model is obtained by fine-tuning the recognition model obtained by the pre-training method of the above recognition model based on the training sample set.
  • the training sample set includes multiple training protein sequences.
  • the above-mentioned pre-training method of the recognition model can be used to complete the pre-training of the recognition model, and then the preset training sample set can be used to fine-tune the recognition model to obtain a protein recognition model, so that the protein recognition model can complete at least one of the following: Task: Identify protein secondary structure, residue contacts, long-range homology, stability, and fluorescence.
  • the training sample set may be the same as the pre-training sample set, or may be different, and this disclosure does not specifically limit this. Because the recognition model has fully learned the information contained in the protein knowledge map, it can quickly and accurately fine-tune the protein recognition model, shorten the fine-tuning time, and reduce the computing resources consumed by fine-tuning.
  • the protein recognition model can only use the sequence encoder in the recognition model, and connect an MLP after the sequence encoder, as shown in Figure 10.
  • the fine-tuning process can be to use the training sample set as the input of the protein recognition model, and then use the training samples to Set corresponding loss
  • the output set is used as the output of the recognition model to fine-tune the protein recognition model so that when the training sample set is input, the output of the protein recognition model can match the corresponding output set.
  • the output set corresponding to the training sample set can be determined according to the specific task. For example, the task of the protein recognition model is to identify the secondary structure of the protein, then the corresponding output set can be the secondary structure corresponding to each training protein sequence in the training sample set. .
  • the loss function can be determined based on the output of the protein recognition model and the corresponding output set, and with the goal of reducing the loss function, the back propagation algorithm can be used to correct the parameters of the neurons in the protein recognition model. Repeat the above steps until the loss function meets the preset conditions, for example, the loss function is less than the preset loss threshold, or the loss function converges, so as to achieve the purpose of fine-tuning the protein identification model.
  • the target sequence corresponding to the target protein to be identified can be obtained, and then the target sequence can be input into the protein recognition model.
  • the protein recognition model can identify the target sequence to obtain the identification information of the target protein.
  • the information includes at least one of the following: secondary structure, residue contacts, remote homology, stability, and fluorescence of the target protein.
  • the protein recognition model used in this disclosure is obtained by fine-tuning the recognition model.
  • the recognition model fully learned the information contained in the protein knowledge map, so the recognition model was fine-tuned.
  • the calculation amount is low and the efficiency is high, thereby improving the accuracy of the protein identification model.
  • Figure 11 is a block diagram of a pre-training device for a recognition model according to an exemplary embodiment. As shown in Figure 11, the device 300 includes:
  • the acquisition module 301 is used to obtain a pre-training sample set and a protein knowledge map.
  • the pre-training sample set includes multiple pre-training protein sequences.
  • the protein knowledge map includes multiple triples, each triple consists of protein, gene ontology, and The relationship between protein and gene ontology consists of.
  • the masking module 302 is configured to perform a masking operation on each pre-trained protein sequence to obtain a masking sequence corresponding to the pre-trained protein sequence.
  • the pre-training module 303 is used to perform feature extraction on the masked sequence corresponding to each pre-trained protein sequence using a preset recognition model to obtain sequence features, and perform feature extraction on the mask sequence containing the pre-trained protein sequence. Feature extraction is performed on triples of training protein sequences to obtain knowledge features.
  • the recognition model is used to fuse sequence features and knowledge features, and decoding is performed based on the fusion results to obtain the decoding results. Based on the decoding results, the pre-trained protein sequence and the protein knowledge map, the recognition model is pre-trained.
  • the pre-trained recognition model can identify proteins after fine-tuning.
  • Figure 12 is a block diagram of another pre-training device for a recognition model according to an exemplary embodiment.
  • the recognition model includes: a sequence encoder and a knowledge encoder.
  • the pre-training module 303 may include:
  • the first extraction sub-module 3031 is used to use a sequence encoder to extract features from the masked sequence corresponding to the pre-trained protein sequence to obtain sequence features.
  • the second extraction sub-module 3032 is used to use the knowledge encoder to extract features of the gene ontology and relationships in the triplet containing the pre-trained protein sequence, respectively, to obtain gene ontology features and relationship features.
  • the determination sub-module 3033 is used to determine knowledge features based on gene ontology features and relationship features, and the knowledge features are used to characterize triples containing the pre-trained protein sequence.
  • the determination sub-module 3033 can be used to:
  • the attention mechanism is used to fuse gene ontology features and relationship features to obtain knowledge features.
  • Figure 13 is a block diagram of another pre-training device for a recognition model according to an exemplary embodiment.
  • the acquisition module 301 may include:
  • Acquisition sub-module 3011 is used to obtain the pre-training sample set.
  • Alignment submodule 3012 is used to align the pre-training sample set with the gene ontology knowledge map to obtain an initial knowledge map.
  • the initial knowledge map includes multiple positive triples, and the relationship between the proteins included in the positive triples and the gene ontology is true. .
  • the negative sampling sub-module 3013 is used to perform negative sampling on the initial knowledge graph to obtain multiple negative triples.
  • the relationship between the proteins included in the negative triples and the gene ontology is false.
  • the processing submodule 3014 is used to obtain a protein knowledge graph based on multiple negative triples and the initial knowledge graph.
  • the protein knowledge graph includes multiple triples, and an identifier of each triplet. The identifier is used to indicate the triplet. Groups belong to positive triples or negative triples.
  • the pre-training module 303 can be used to perform the following steps:
  • Step 1) Determine the predicted sequence based on the decoding result, and determine the prediction loss based on the predicted sequence and the pre-trained protein sequence.
  • Step 2 Determine the predicted recognition result based on the decoding result, and determine the recognition loss based on the predicted recognition result and the identity of the triplet containing the pre-trained protein sequence.
  • Step 3 Determine the total loss based on predicted losses and identified losses.
  • Step 4 With the goal of reducing the total loss, use the back propagation algorithm to pre-train the recognition model.
  • the pre-training module 303 can be used to perform the following steps:
  • Step 5 Splice sequence features and knowledge features to obtain comprehensive features.
  • Step 6 Decode the comprehensive features to obtain the decoding result.
  • the pre-training module 303 can be used to perform the following steps:
  • Step 7) Use the attention mechanism to fuse the information in the knowledge features that matches the sequence features with the sequence features to obtain cross-modal fusion features.
  • Step 8) Use the self-attention mechanism to decode the cross-modal fusion features and obtain the decoding results.
  • this disclosure first obtains a protein knowledge graph and multiple pre-trained protein sequences.
  • the protein knowledge graph includes multiple triples composed of proteins, gene ontology, and relationships. Then, a masking operation is performed on each pre-trained protein sequence to obtain the corresponding masked sequence. Then the recognition model is used to extract features from the masked sequence corresponding to each pre-trained protein sequence and the triplet containing the pre-trained protein sequence to obtain sequence features and knowledge features, and then the recognition model is used to fuse the sequence features and knowledge features. And perform decoding, and finally pre-train the preset recognition model based on the decoding results, the pre-trained protein sequence and the protein knowledge map. The pre-trained recognition model can be fine-tuned to identify proteins.
  • This disclosure fuses sequence features and knowledge features and then decodes them, so that the protein knowledge graph can directly affect the output of the recognition model.
  • the recognition model can fully learn the information contained in the protein knowledge map, improving the ability of the recognition model, thereby improving the accuracy of the recognition model for downstream tasks.
  • Figure 14 is a block diagram of a protein recognition device according to an exemplary embodiment. As shown in Figure 14, the device 400 includes:
  • the acquisition module 401 is used to acquire the target sequence corresponding to the target protein to be identified.
  • the identification module 402 is used to input the target sequence into the protein identification model to determine the identification information of the target protein.
  • the identification information includes at least one of the following: secondary structure of the target protein, residue contacts, remote homology, stability, and Fluorescence.
  • the protein recognition model is obtained by fine-tuning the recognition model obtained by the pre-training method of the above recognition model based on the training sample set.
  • the training sample set includes multiple training protein sequences.
  • the protein recognition model used in this disclosure is obtained by fine-tuning the recognition model.
  • the recognition model fully learned the information contained in the protein knowledge map, so the recognition model was fine-tuned.
  • the calculation amount is low and the efficiency is high, thereby improving the accuracy of the protein identification model.
  • FIG. 15 shows a schematic structural diagram of an electronic device (for example, it can be the execution subject in the above embodiment, it can be a terminal device or a server) 600 suitable for implementing embodiments of the present disclosure.
  • Terminal devices in embodiments of the present disclosure may include, but are not limited to, mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablets), PMPs (Portable Multimedia Players), vehicle-mounted terminals (such as Mobile terminals such as vehicle navigation terminals) and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG. 15 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 600 may include a processing device (such as a central processing unit, a graphics processor, etc.) 601 , which may process data stored in a read-only memory (ROM) 602 Various appropriate actions and processes are performed by programs in the RAM 603 or programs loaded from the storage device 608 into the random access memory (RAM) 603 . In the RAM 603, various programs and data required for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602 and RAM 603 are connected to each other via a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604.
  • I/O interface 605 input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration
  • An output device 607 such as a computer
  • a storage device 608 including a magnetic tape, a hard disk, etc.
  • Communication device 609 may allow electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 15 illustrates electronic device 600 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 609, or from storage device 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above functions defined in the method of the embodiment of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programming read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • terminal devices and servers can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communications e.g., communications network
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device obtains a pre-training sample set and a protein knowledge map.
  • the pre-training sample set includes a plurality of Pre-trained protein sequences
  • the protein knowledge graph includes a plurality of triples, each of the triples consists of a protein, a gene ontology, and the relationship between the protein and the gene ontology; for each of the pre-trained proteins sequence, perform a masking operation on the pre-trained protein sequence, and obtain the masked sequence corresponding to the pre-trained protein sequence; use the preset recognition model to perform feature extraction on the masked sequence corresponding to each pre-trained protein sequence, and obtain the sequence characteristics.
  • the pre-trained protein sequence and the protein knowledge map the recognition model is pre-trained, and the pre-trained recognition model can identify proteins after fine-tuning.
  • the computer-readable medium carries one or more programs.
  • the electronic device obtains the target sequence corresponding to the target protein to be identified; converts the target sequence Input a protein recognition model to determine identification information of the target protein, the identification information including at least one of the following: secondary structure, residue contacts, long-range homology, stability, and fluorescence of the target protein;
  • the protein recognition model is obtained by fine-tuning the recognition model trained by the pre-training method of the above recognition model according to a training sample set, and the training sample set includes a plurality of training protein sequences.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider
  • each box in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more Executable instructions used to implement specified logical functions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure can be implemented in software or hardware.
  • the name of the module does not constitute a limitation on the module itself under certain circumstances.
  • the acquisition module can also be described as "a module for acquiring pre-training sample sets and protein knowledge maps.”
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • Example 1 provides a pre-training method for a recognition model, including: obtaining a pre-training sample set and a protein knowledge map, the The pre-training sample set includes multiple pre-training protein sequences, and the protein knowledge map includes multiple triples, each of which consists of a protein, a gene ontology, and the relationship between the protein and the gene ontology; for each each of the pre-trained protein sequences, perform a masking operation on the pre-trained protein sequence, and obtain the mask sequence corresponding to the pre-trained protein sequence; use a preset recognition model to perform feature extraction on the mask sequence corresponding to each pre-trained protein sequence, Obtain sequence features, perform feature extraction on triples containing the pre-trained protein sequence, and obtain knowledge features; use the recognition model to fuse the sequence features and the knowledge features, and decode according to the fusion results to The decoding result is obtained; the recognition model is pre-trained according to the decoding result, the pre-trained protein sequence and the protein knowledge map
  • Example 2 provides the method of Example 1, the recognition model includes: a sequence encoder and a knowledge encoder; the use of a preset recognition model pair for each pre-trained protein sequence Perform feature extraction on the corresponding masked sequence to obtain sequence features, and perform feature extraction on triples containing the pre-trained protein sequence to obtain knowledge features, including: using the sequence encoder to obtain the masked sequence corresponding to the pre-trained protein sequence.
  • Example 3 provides the method of Example 2, wherein determining the knowledge features according to the gene ontology features and the relationship features includes: using an attention mechanism to The ontology features and the relationship features are fused to obtain the knowledge features.
  • Example 4 provides the method of Example 1.
  • the obtaining a pre-training sample set and a protein knowledge map includes: obtaining the pre-training sample set; and combining the pre-training sample set with The Gene Ontology knowledge map is aligned to obtain an initial knowledge map.
  • the initial knowledge map includes a plurality of positive triples.
  • the protein knowledge graph is obtained, and the protein knowledge graph includes a plurality of the triples, and the identification of each triplet, The identifier is used to indicate that the triplet belongs to the positive triplet or the negative triplet.
  • Example 5 provides the method of Example 4, wherein the recognition model is pre-trained according to the decoding result, the pre-trained protein sequence and the protein knowledge graph, including : Determine a predicted sequence based on the decoding result, and determine a prediction loss based on the predicted sequence and the pre-trained protein sequence; determine a predicted recognition result based on the decoding result, and determine a predicted recognition result based on the predicted recognition result and the pre-trained protein sequence.
  • the identification of the triplet is used to determine the recognition loss; the total loss is determined based on the prediction loss and the recognition loss; with the goal of reducing the total loss, the recognition model is pre-trained using a backpropagation algorithm.
  • Example 6 provides the method of Example 1, which uses the recognition model to fuse the sequence features and the knowledge features, and performs decoding according to the fusion result to obtain decoding
  • the result includes: splicing the sequence features and the knowledge features to obtain comprehensive features; decoding the comprehensive features to obtain the decoding result.
  • Example 7 provides the method of Example 1, which uses the recognition model to fuse the sequence features and the knowledge features, and performs decoding according to the fusion result to obtain decoding
  • the results include: using the attention mechanism to fuse the information in the knowledge features that matches the sequence features with the sequence features to obtain cross-modal fusion features; using the self-attention mechanism to fuse the cross-modal features. Decode the state fusion features to obtain the decoding result.
  • Example 8 provides a protein identification method, including: obtaining a target sequence corresponding to a target protein to be identified; inputting the target sequence into a protein identification model to determine the target protein identification information, the identification information includes at least one of the following: the secondary structure of the target protein, residues base contact, long-range homology, stability, and fluorescence; the protein recognition model is obtained by fine-tuning the recognition model described in any one of Examples 1-7 according to the training sample set, the training sample set includes Multiple training protein sequences.
  • Example 9 provides a pre-training device for a recognition model, including: an acquisition module for acquiring a pre-training sample set and a protein knowledge map, where the pre-training sample set includes a plurality of Pre-training protein sequences, the protein knowledge graph includes multiple triples, each triple is composed of a protein, a gene ontology, and the relationship between a protein and a gene ontology; a masking module is used to target each The pre-trained protein sequence is used to perform a masking operation on the pre-trained protein sequence to obtain the mask sequence corresponding to the pre-trained protein sequence; the pre-training module is used to use a preset recognition model to mask the mask sequence corresponding to each pre-trained protein sequence.
  • the pre-trained recognition model can identify proteins after fine-tuning. .
  • Example 10 provides a protein identification method, including: an acquisition module, used to obtain a target sequence corresponding to the target protein to be identified; and an identification module, used to input the target sequence
  • a protein recognition model is used to determine the recognition information of the target protein, and the recognition information includes at least one of the following: secondary structure, residue contact, long-range homology, stability, and fluorescence of the target protein;
  • the protein recognition model is obtained by fine-tuning the recognition model described in any one of Examples 1-7 based on a training sample set, which includes multiple training protein sequences.
  • Example 11 provides a computer-readable medium having a computer program stored thereon, which implements the steps of the methods described in Examples 1 to 8 when executed by a processing device.
  • Example 12 provides an electronic device, It includes: a storage device with a computer program stored thereon; a processing device configured to execute the computer program in the storage device to implement the steps of the methods described in Examples 1 to 8.

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种识别模型的预训练方法、识别方法、装置、介质和设备,涉及电子信息技术领域,该方法包括:获取预训练样本集和蛋白质知识图谱,针对每个预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列,利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征,利用识别模型对序列特征和知识特征进行融合,并根据融合结果进行解码,以得到解码结果,根据解码结果、该预训练蛋白质序列和蛋白质知识图谱,对识别模型进行预训练,预训练后的识别模型,经过微调能够对蛋白质进行识别。

Description

识别模型的预训练方法、识别方法、装置、介质和设备
本申请要求于2022年8月5日提交中国专利局、申请号为202210947783.9、发明名称为“识别模型的预训练方法、识别方法、装置、介质和设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及电子信息技术领域,具体地,涉及一种识别模型的预训练方法、识别方法、装置、介质和设备。
背景技术
蛋白质是机体细胞最基本的组成部分,对蛋白质的研究有助于理解生物的本质,从而推动生物技术、医疗技术的发展,例如蛋白质的结构能够用于判断蛋白质的功能,有助于药物、疫苗的研发。传统的方式,是在实验室中通过X射线结晶学和核磁共振等方式计算出蛋白质的结构,耗时耗力。由于蛋白质序列与文本具有一定的相似度,受NLP(英文:Natural Language Processing,中文:自然语言处理)技术的启发,可以利用已知的蛋白质序列预训练识别模型,使得识别模型能够被微调(英文:Fine-tune)来完成对蛋白质的识别。然而,蛋白质序列通常比文本要长很多,预训练过程需要大量的计算资源,很难进行实际应用。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种识别模型的预训练方法,所述方法 包括:
获取预训练样本集和蛋白质知识图谱,所述预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;
针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;
利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;
利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果;
根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
第二方面,本公开提供一种蛋白质识别方法,所述方法包括:
获取待识别的目标蛋白质对应的目标序列;
将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性;
所述蛋白质识别模型为根据训练样本集对本公开第一方面所述的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
第三方面,本公开提供一种蛋识别模型的预训练装置,所述装置包括:
获取模块,用于获取预训练样本集和蛋白质知识图谱,所述预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;
掩蔽模块,用于针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;
预训练模块,用于利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果;根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
第四方面,本公开提供一种蛋白质识别装置,所述装置包括:
获取模块,用于获取待识别的目标蛋白质对应的目标序列;
识别模块,用于将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性;
所述蛋白质识别模型为根据训练样本集对本公开第一方面所述的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
第五方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面所述方法的步骤。
第六方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第二方面所述方法的步骤。
第七方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面所述方法的步骤。
第八方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第二方面所述方法的步骤。
通过上述技术方案,本公开首先获取蛋白质知识图谱和多个预训练蛋白质序列,蛋白质知识图谱包括多个由蛋白质、基因本体、关系组成的三元组。之后对每个预训练蛋白质序列进行掩蔽操作,得到对应的掩蔽序列。然后利用识别模型分别对每个预训练蛋白质序列对应的掩蔽序列和包含该预训练蛋白质序列的三元组进行特征提取,得到序列特征和知识特征,再利用识别模型对序列特征和知识特征进行融合并进行解码,最后根据解码结果、该预训练蛋白质序列和蛋白质知识图谱对预设的识别模型进行预训练。预训练后的识别模型,经过微调能够对蛋白质进行识别。本公开将序列特征和知识特征进行融合再解码,使得蛋白质知识图谱能够直接影响到识别模型的输出结果,这样在预训练的过程中,识别模型能够充分学习蛋白质知识图谱所包含的信息,提升了识别模型的能力,从而提高识别模型用于下游任务的准确度。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据一示例性实施例示出的一种识别模型的预训练方法的流程图;
图2是根据一示例性实施例示出的一种识别模型的结构示意图;
图3是根据一示例性实施例示出的另一种识别模型的预训练方 法的流程图;
图4是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图;
图5是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图;
图6是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图;
图7是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图;
图8是根据一示例性实施例示出的一种解码层的结构示意图;
图9是根据一示例性实施例示出的一种蛋白质识别方法的流程图;
图10是根据一示例性实施例示出的一种蛋白质识别模型的结构示意图;
图11是根据一示例性实施例示出的一种识别模型的预训练装置的框图;
图12是根据一示例性实施例示出的另一种识别模型的预训练装置的框图;
图13是根据一示例性实施例示出的另一种识别模型的预训练装置的框图;
图14是根据一示例性实施例示出的一种蛋白质识别装置的框图;
图15是根据一示例性实施例示出的一种电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的 是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。
例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。
作为一种可选的但非限定性的实现方式,响应于接收到用户的 主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。
可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。
同时,可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。
图1是根据一示例性实施例示出的一种识别模型的预训练方法的流程图,图1所示,该方法包括:
步骤101,获取预训练样本集和蛋白质知识图谱,预训练样本集中包括多个预训练蛋白质序列,蛋白质知识图谱包括多个三元组,每个三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成。
步骤102,针对每个预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列。
举例来说,对识别模型的预训练过程,首先要获取用于预训练识别模型的样本输入集和样本输出集。样本输入集中包括多个样本输入,样本输出集中包括了与每个样本输入对应的样本输出。可以预先获取包括多个预训练蛋白质序列的预训练样本集,例如可以将Swiss-Prot数据库中包括的蛋白质序列作为预训练样本集。还可以获取蛋白质知识图谱,蛋白质知识图谱中包括多个三元组,每个三元组包括了蛋白质、基因本体(英文:Gene Ontology,缩写:GO),以及蛋白质与基因本体之间的关系,三元组可以表示为(蛋白质,relation,基因本体)。基因本体是具有动态形式的控制字汇,用于解释真核生物的基因或者蛋白质在细胞内所扮演的角色及生物医学方面的知识,关系是文字描述,用于描述三元组中蛋白质与基因本 体之间的关联,因此蛋白质知识图谱包含了能够描述蛋白质各种特性的信息。
之后可以对预训练样本集中的每个预训练蛋白质序列进行掩蔽操作(英文:mask),得到该预训练蛋白质序列对应的掩蔽序列。每个预训练蛋白质序列进行掩蔽操作的掩蔽位置可以是一个也可以是多个,例如可以是预训练蛋白质序列中的第15位、第173位、第210位等。掩蔽位置的确定,可以是随机生成的,也可以按照预设算法确定的,本公开对此不作具体限定。可以将一个预训练蛋白质序列对应的掩蔽序列,与包含该预训练蛋白质序列的三元组作为一个样本输入,将该预训练蛋白质序列作为对应的样本输出,从而得到样本输入集和样本输出集。
步骤103,利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征。
步骤104,利用识别模型对序列特征和知识特征进行融合,并根据融合结果进行解码,以得到解码结果。
步骤105,根据解码结果、该预训练蛋白质序列和蛋白质知识图谱,对识别模型进行预训练,预训练后的识别模型,经过微调能够对蛋白质进行识别。
示例的,可以将样本输入集作为预设的识别模型的输入,然后再将样本输出集作为识别模型的输出,来对识别模型进行预训练,使得在输入样本输入集时,识别模型的输出,能够和样本输出集匹配。具体的,可以将样本输入集包括的每个预训练蛋白质序列对应的掩蔽序列和包含该预训练蛋白质序列的三元组作为识别模型的输入,然后根据识别模型的输出,以及该预训练蛋白质序列和蛋白质知识图谱,对识别模型进行预训练。例如,可以先利用识别模型分别对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征。序列特征用于表征该预训练蛋白质序列对应的掩蔽序 列,知识特征用于表征包含该预训练蛋白质序列的三元组。
之后,利用识别模型对序列特征和知识特征进行融合,例如可以将序列特征和知识特征进行拼接,得到融合结果,也可以利用注意力机制对序列特征和知识特征进行融合,得到融合结果,还可以按照预设权重对序列特征和知识特征进行加权求和,得到融合结果,本公开对此不作具体限定。然后根据融合结果进行解码,以得到解码结果。最后根据解码结果、该预训练蛋白质序列和蛋白质知识图谱,对识别模型进行预训练。例如,可以根据解码结果、该预训练蛋白质序列和蛋白质知识图谱确定识别模型的损失函数,以降低损失函数为目标,利用反向传播算法来修正识别模型中的神经元的参数,神经元的参数例如可以是神经元的权重(英文:Weight)和偏置量(英文:Bias)。重复上述步骤,直至损失函数满足预设条件,例如损失函数小于预设的损失阈值,或者损失函数收敛,以达到预训练识别模型的目的。预训练得到的识别模型,可以根据具体的下游任务进行微调,使得微调后的识别模型能够完成下游任务,对蛋白质进行识别。下游任务例如可以是识别蛋白质的二级结构,识别蛋白质的残基接触,或者识别蛋白质的远程同源性等。
识别模型分别提取序列特征和知识特征,并对序列特征和知识特征进行融合再解码,使得蛋白质知识图谱能够直接影响到识别模型的输出结果,即解码结果,这样在预训练的过程中,识别模型在学习预训练蛋白质序列的同时,还能够充分学习蛋白质知识图谱所包含的信息,提升了识别模型的能力,从而提高识别模型用于下游任务的准确度。
在一种应用场景中,识别模型的结构可以包括:序列编码器、知识编码器和解码器,序列编码器的输入和知识编码器的输入作为识别模型的输入,序列编码器的输出和知识编码器的输出一同输入解码器,解码器的输出作为识别模型的输出,如图2所示。
图3是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图,如图3所示,步骤103的实现方式可以包括:
步骤1031,利用序列编码器对该预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征。
示例的,将该预训练蛋白质序列对应的掩蔽序列输入序列编码器,以使序列编码器进行特征提取,可以根据序列编码器的输出确定序列特征(可以表示为protein embedding),特征提取过程也可以理解为编码过程,序列特征也可以理解为该预训练蛋白质序列的向量表示。在一种实现方式中,可以将序列编码器的输出直接作为序列特征。在另一种实现方式中,还可以根据进行掩蔽操作的掩蔽位置,生成该预训练蛋白质序列对应的掩蔽令牌(可以表示为masked token),掩蔽令牌与掩蔽位置一一对应,可以理解为一个共享,可学习的向量,同时,掩蔽令牌中还可以包括用于表征掩蔽位置的位置向量。也就说是,掩蔽令牌用于表示掩蔽位置处的信息。可以将编码器的输出与掩蔽令牌进行融合得到的结果作为序列特征。序列编码器可以是采用ProtBer中的Encoder,也可以采用其他PPLM(英文:Protein Pre-trained Language Models)中的Encoder,本公开对此不作具体限定。
步骤1032,利用知识编码器分别对包含该预训练蛋白质序列的三元组中的基因本体和关系进行特征提取,得到基因本体特征和关系特征。
步骤1033,根据基因本体特征和关系特征,确定知识特征,知识特征用于表征包含该预训练蛋白质序列的三元组。
示例的,可以分别将包含该预训练蛋白质序列的三元组中的基因本体和关系输入知识编码器,以使知识编码器进行特征提取,得到知识编码器输出的基因本体特征(可以表示为GO embedding)和关系特征(可以表示为relation embedding),特征提取过程也可以理解为编码过程,基因本体特征可以理解为包含该预训练蛋白质序列的三元组中的基因本体的向量表示,同样的,关系特征可以理解为包含该预训练蛋白质序列的三元组中的关系的向量表示。
之后,可以根据基因本体特征和关系特征,确定用于表征包含 该预训练蛋白质序列的三元组的知识特征(可以表示为knowledge embedding)。具体的,可以将基因本体特征和关系特征进行拼接(英文:Concat),将拼接得到的结果作为知识特征。也可以利用注意力机制对基因本体特征和关系特征进行融合,得到知识特征。例如,识别模型中可以包括一个注意力单元,如图2所示,将基因本体特征和关系特征输入该注意力单元,得到注意力单元输出的知识特征。注意力单元的实现可以通过公式1来表示:
Eknowledge=fGO+Attn(fGOWQ,fRWK,fKWV)
其中,Eknowledge表示知识特征,fGO表示基因本体特征,fR表示关系特征,Attn表示注意力机制,可以将fGO作为注意力机制的Query(表示为Q),将fR作为注意力机制的Key(表示为K)和Value(表示为V),WQ表示Query对应的权重矩阵,WK表示Key对应的权重矩阵,WV表示Value对应的权重矩阵,dk表示Key的长度。
图4是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图,如图4所示,步骤101可以通过以下步骤来实现:
步骤1011,获取预训练样本集。
步骤1012,将预训练样本集与基因本体知识图谱对齐,得到初始知识图谱,初始知识图谱包括多个正三元组,正三元组中包括的蛋白质与基因本体之间的关系为真。
步骤1013,对初始知识图谱进行负采样,得到多个负三元组,负三元组包括的蛋白质与基因本体之间的关系为假。
步骤1014,根据多个负三元组和初始知识图谱,得到蛋白质知识图谱,蛋白质知识图谱包括多个三元组,和每个三元组的标识,标识用于指示该三元组属于正三元组或负三元组。
举例来说,可以将能够公开获得的基因本体知识图谱与预训练样本集进行对齐,从而构建初始知识图谱。初始知识图谱中包括多 个正三元组,每个正三元组包括了蛋白质、基因本体,以及蛋白质与基因本体之间的关系,并且关系为真,也就是说正三元组中表征的蛋白质与基因本体之间的关系为符合生物学特性的关系。
之后,可以对初始知识图谱进行负采样,得到多个负三元组,每个负三元组包括了蛋白质、基因本体,以及蛋白质与基因本体之间的关系,负三元组包括的蛋白质与基因本体之间的关系为假,也就是说正三元组中表征的蛋白质与基因本体之间的关系为不符合生物学特性的关系。之后,可以根据多个负三元组和初始知识图谱,生成蛋白质知识图谱,蛋白质知识图谱包括多个三元组以及每个三元组的标识,标识用于指示该三元组属于正三元组或负三元组,也就是说标识用于指示该三元组中包括的关系是否服务生物学特性。
具体的,可以通过公式2来实现对初始知识图谱的负采样:
T′Protein-GO={(h,r,t′)|t′∈E′}公式2
其中,T′Protein-GO(h,r,t)表示负三元组,h表示负三元组的头部(即蛋白质),r表示负三元组中的关系,t’表示负三元组的尾部(即基因本体),t表示正三元组的尾部,E表示初始知识图谱中基因本体的集合,E’表示与E的交集为空的集合。
负采样的具体实现:总共有3类基因本体:生物过程、分子功能以及细胞组成。可以只从同一类基因本体中进行负采样。例如,正三元组的基因本体为生物过程,那么负三元组中的基因本体也为生物过程。正三元组的基因本体为分子功能,那么负三元组中的基因本体也为分子功能。
图5是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图,如图5所示,步骤105可以通过以下步骤来实现:
步骤1051,根据解码结果确定预测序列,并根据预测序列与该预训练蛋白质序列确定预测损失。
举例来说,识别模型中可以采用两个Classifier来做多任务预训练,一个任务可以将解码结果输入MLP(英文:Multilayer Perceptron, 中文:多层感知器),将MLP输出的结果作为预测序列,预测序列可以理解为识别模型对蛋白质序列的预测,预测序列与该预训练蛋白质序列的长度相同。因此可以根据预测序列与该预训练蛋白质序列确定预测损失,预测损失可以理解为MLM(英文:Masked Language Model,中文:掩蔽语言模型)损失。例如,可以根据公式3确定预测损失:
其中,LMLM表示预测损失,E表示期望运算,xi表示该预训练蛋白质序列中第i个被掩蔽的元素,M表示该预训练蛋白质序列中被masked的元素的数量,P(xi)表示预测序列中第i个元素被预测为xi的概率。
步骤1052,根据解码结果确定预测识别结果,并根据预测识别结果与包含该预训练蛋白质序列的三元组的标识,确定识别损失。
示例的,第二个任务可以先将解码结果经过一个池化层,再输入MLP,让MLP完成一个二分类任务,即根据MLP输出的结果确定预测识别结果,预测识别结果可以理解为识别模型对知识特征的预测,确定知识特征表征的三元组为正三元组还是负三元组。因此可以根据预测识别结果与包含该预训练蛋白质序列的三元组的标识,确定识别损失。例如,可以根据公式4确定识别损失:
其中,LPFI表示识别损失,y表示正三元组的标识(可以为1),p表示识别模型将正三元组识别为正三元组的概率,N表示蛋白质知识图谱中负三元组的数量,yi表示i个负三元组的标识为(均为0),pi表示识别模型将第i个负三元组识别为正三元组的概率。通过负采样和识别损失,使得识别模型能够区分正确和错误的知识,从而辅助蛋白质知识图谱的嵌入。
步骤1053,根据预测损失和识别损失确定总损失。
步骤1054,以降低总损失为目标,利用反向传播算法对识别模型进行预训练。
示例的,可以根据预测损失和识别损失确定总损失,例如可以将预测损失和识别损失的和作为总损失,也可以对预测损失和识别损失进行加权求和,得到总损失。最后,可以以降低总损失为目标,利用反向传播算法对识别模型进行预训练。例如可以根据公式5确定总损失:
Ltotal=LMLM+αLPFI   公式5
其中,Ltotal表示总损失,α表示识别损失对应的权重,例如可以设置为1。
图6是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图,如图6所示,步骤104可以包括:
步骤1041,将序列特征和知识特征进行拼接,得到综合特征。
步骤1042,对综合特征进行解码,得到解码结果。
示例的,解码过程可以先将序列特征和知识特征进行拼接(即Concat),得到综合特征,然后将综合特征输入解码器(英文:Decoder),由解码器对综合特征进行解码,得到解码结果。解码器例如可以是Bert中的Decoder,也可以采用其他PPLM中的Decoder,本公开对此不作具体限定。
图7是根据一示例性实施例示出的另一种识别模型的预训练方法的流程图,如图7所示,步骤104可以包括:
步骤1043,利用注意力机制,将知识特征中与序列特征匹配的信息,与序列特征进行融合,得到跨模态融合特征。
步骤1044,利用自注意力机制,对跨模态融合特征进行解码,得到解码结果。
示例的,解码器中包括可以多个依次连接的解码层(例如可以包括3个解码层),每个解码层包括:多头跨模态注意力模块、多头自注意力(英文:Multi-head self-attention)模块、MLP,如图8所示(图中仅以一个解码层作为示意,并未展示出多个解码层)。 可以将知识特征与序列特征分别经过LN(英文:Layer Normalization)层再输入多头跨模态注意力模块,多头跨模态注意力模块能够将知识特征中与序列特征匹配的信息,与序列特征进行融合,得到跨模态融合特征。之后,可以设置一个残差单元,将跨模态融合特征和序列特征一起输入多头自注意力模块、MLP进行解码,得到解码结果。
多头跨模态注意力模块的实现可以通过公式6来表示:

其中,表示第i个解码层中多头跨模态注意力模块的输出,即跨模态融合特征,Eknowledge表示知识特征。表示第i-1个解码层的输出,即第i个解码层的一个输入(另外一个输入为Eknowledge),相应的,第1个解码层两个输入分别为和Eknowledge即为序列特征),Eknowledge表示知识特征,Attn表示注意力机制,将作为注意力机制的Query(表示为Q),将Eknowledge作为注意力机制的Key(表示为K)和Value(表示为V),WQ表示Query对应的权重矩阵,WK表示Key对应的权重矩阵,WV表示Value对应的权重矩阵,dk表示Key的长度。多头跨模态注意力模块,可以筛选出知识特征中与序列特征匹配的信息,这样能够减少两种模态(蛋白质序列的模态,基因本体、关系的文本模态)匹配过程中的噪声,同时还可以对齐两种模态,提升跨模态融合特征的质量,从而提高识别模型的能力。
综上所述,本公开首先获取蛋白质知识图谱和多个预训练蛋白质序列,蛋白质知识图谱包括多个由蛋白质、基因本体、关系组成的三元组。之后对每个预训练蛋白质序列进行掩蔽操作,得到对应的掩蔽序列。然后利用识别模型分别对每个预训练蛋白质序列对应 的掩蔽序列和包含该预训练蛋白质序列的三元组进行特征提取,得到序列特征和知识特征,再利用识别模型对序列特征和知识特征进行融合并进行解码,最后根据解码结果、该预训练蛋白质序列和蛋白质知识图谱对预设的识别模型进行预训练。预训练后的识别模型,经过微调能够对蛋白质进行识别。本公开将序列特征和知识特征进行融合再解码,使得蛋白质知识图谱能够直接影响到识别模型的输出结果,这样在预训练的过程中,识别模型能够充分学习蛋白质知识图谱所包含的信息,提升了识别模型的能力,从而提高识别模型用于下游任务的准确度。
图9是根据一示例性实施例示出的一种蛋白质识别方法的流程图,如图9所示,该方法包括:
步骤201,获取待识别的目标蛋白质对应的目标序列。
步骤202,将目标序列输入蛋白质识别模型,以确定目标蛋白质的识别信息,识别信息包括以下至少一种:目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性。
蛋白质识别模型为根据训练样本集对上述识别模型的预训练方法得到的识别模型进行微调得到的,训练样本集包括多个训练蛋白质序列。
举例来说,可以利用上述识别模型的预训练方法完成对识别模型的预训练,之后可以利用预设的训练样本集对识别模型进行微调,得到蛋白质识别模型,使得蛋白质识别模型能够完成以下至少一种任务:识别蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性。训练样本集可以和预训练样本集相同,也可以不相同,本公开对此不作具体限定。因为识别模型充分学习了蛋白质知识图谱所包含的信息,能够快速准确地微调得到蛋白质识别模型,缩短微调时间,减少微调消耗的计算资源。
蛋白质识别模型可以只采用识别模型中的序列编码器,并在序列编码器之后连接一个MLP,如图10所示,微调过程可以是将训练样本集作为蛋白质识别模型的输入,然后再将训练样本集对应的输 出集作为识别模型的输出,来对蛋白质识别模型进行微调,使得在输入训练样本集时,蛋白质识别模型的输出,能够和对应的输出集匹配。训练样本集对应的输出集可以根据具体的任务来确定,例如蛋白质识别模型的任务是要识别蛋白质的二级结构,那么对应的输出集可以是训练样本集中每个训练蛋白质序列对应的二级结构。可以根据蛋白质识别模型的输出与对应的输出集确定损失函数,并以降低损失函数为目标,利用反向传播算法来修正蛋白质识别模型中的神经元的参数。重复上述步骤,直至损失函数满足预设条件,例如损失函数小于预设的损失阈值,或者损失函数收敛,以达到微调蛋白质识别模型的目的。
在完成对蛋白质识别模型的微调后,可以获取待识别的目标蛋白质对应的目标序列,然后将目标序列输入蛋白质识别模型,蛋白质识别模型能够对目标序列进行识别,以得到目标蛋白质的识别信息,识别信息包括以下至少一种:目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性。
综上所述,本公开中使用的蛋白质识别模型,是对识别模型进行微调得到的,而识别模型在预训练过程中,充分学习了蛋白质知识图谱所包含的信息,因此对识别模型进行微调的计算量低、效率高,从而提高了蛋白质识别模型的准确度。
图11是根据一示例性实施例示出的一种识别模型的预训练装置的框图,如图11所示,该装置300包括:
获取模块301,用于获取预训练样本集和蛋白质知识图谱,预训练样本集中包括多个预训练蛋白质序列,蛋白质知识图谱包括多个三元组,每个三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成。
掩蔽模块302,用于针对每个预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列。
预训练模块303,用于利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预 训练蛋白质序列的三元组进行特征提取,得到知识特征。利用识别模型对序列特征和知识特征进行融合,并根据融合结果进行解码,以得到解码结果。根据解码结果、该预训练蛋白质序列和蛋白质知识图谱,对识别模型进行预训练,预训练后的识别模型,经过微调能够对蛋白质进行识别。
图12是根据一示例性实施例示出的另一种识别模型的预训练装置的框图,如图12所示,识别模型包括:序列编码器和知识编码器。
预训练模块303可以包括:
第一提取子模块3031,用于利用序列编码器对该预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征。
第二提取子模块3032,用于利用知识编码器分别对包含该预训练蛋白质序列的三元组中的基因本体和关系进行特征提取,得到基因本体特征和关系特征。
确定子模块3033,用于根据基因本体特征和关系特征,确定知识特征,知识特征用于表征包含该预训练蛋白质序列的三元组。
在一种实现方式中,确定子模块3033可以用于:
利用注意力机制对基因本体特征和关系特征进行融合,得到知识特征。
图13是根据一示例性实施例示出的另一种识别模型的预训练装置的框图,如图13所示,获取模块301可以包括:
获取子模块3011,用于获取预训练样本集。
对齐子模块3012,用于将预训练样本集与基因本体知识图谱对齐,得到初始知识图谱,初始知识图谱包括多个正三元组,正三元组中包括的蛋白质与基因本体之间的关系为真。
负采样子模块3013,用于对初始知识图谱进行负采样,得到多个负三元组,负三元组包括的蛋白质与基因本体之间的关系为假。
处理子模块3014,用于根据多个负三元组和初始知识图谱,得到蛋白质知识图谱,蛋白质知识图谱包括多个三元组,和每个三元组的标识,标识用于指示该三元组属于正三元组或负三元组。
在另一种实现方式中,预训练模块303可以用于执行以下步骤:
步骤1)根据解码结果确定预测序列,并根据预测序列与该预训练蛋白质序列确定预测损失。
步骤2)根据解码结果确定预测识别结果,并根据预测识别结果与包含该预训练蛋白质序列的三元组的标识,确定识别损失。
步骤3)根据预测损失和识别损失确定总损失。
步骤4)以降低总损失为目标,利用反向传播算法对识别模型进行预训练。
在又一种实现方式中,预训练模块303可以用于执行以下步骤:
步骤5)将序列特征和知识特征进行拼接,得到综合特征。
步骤6)对综合特征进行解码,得到解码结果。
在又一种实现方式中,预训练模块303可以用于执行以下步骤:
步骤7)利用注意力机制,将知识特征中与序列特征匹配的信息,与序列特征进行融合,得到跨模态融合特征。
步骤8)利用自注意力机制,对跨模态融合特征进行解码,得到解码结果。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
综上所述,本公开首先获取蛋白质知识图谱和多个预训练蛋白质序列,蛋白质知识图谱包括多个由蛋白质、基因本体、关系组成的三元组。之后对每个预训练蛋白质序列进行掩蔽操作,得到对应的掩蔽序列。然后利用识别模型分别对每个预训练蛋白质序列对应的掩蔽序列和包含该预训练蛋白质序列的三元组进行特征提取,得到序列特征和知识特征,再利用识别模型对序列特征和知识特征进行融合并进行解码,最后根据解码结果、该预训练蛋白质序列和蛋白质知识图谱对预设的识别模型进行预训练。预训练后的识别模型,经过微调能够对蛋白质进行识别。本公开将序列特征和知识特征进行融合再解码,使得蛋白质知识图谱能够直接影响到识别模型的输 出结果,这样在预训练的过程中,识别模型能够充分学习蛋白质知识图谱所包含的信息,提升了识别模型的能力,从而提高识别模型用于下游任务的准确度。
图14是根据一示例性实施例示出的一种蛋白质识别装置的框图,如图14所示,该装置400包括:
获取模块401,用于获取待识别的目标蛋白质对应的目标序列。
识别模块402,用于将目标序列输入蛋白质识别模型,以确定目标蛋白质的识别信息,识别信息包括以下至少一种:目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性。
蛋白质识别模型为根据训练样本集对上述识别模型的预训练方法得到的识别模型进行微调得到的,训练样本集包括多个训练蛋白质序列。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
综上所述,本公开中使用的蛋白质识别模型,是对识别模型进行微调得到的,而识别模型在预训练过程中,充分学习了蛋白质知识图谱所包含的信息,因此对识别模型进行微调的计算量低、效率高,从而提高了蛋白质识别模型的准确度。
下面参考图15,其示出了适于用来实现本公开实施例的电子设备(例如可以上述实施例中的执行主体,可以是终端设备或服务器)600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图15示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图15所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602 中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图15示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存 储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,终端设备、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取预训练样本集和蛋白质知识图谱,所述预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特 征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果;根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
或者,上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待识别的目标蛋白质对应的目标序列;将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性;所述蛋白质识别模型为根据训练样本集对上述识别模型的预训练方法训练的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个 用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,获取模块还可以被描述为“获取预训练样本集和蛋白质知识图谱的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种识别模型的预训练方法,包括:获取预训练样本集和蛋白质知识图谱,所述 预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果;根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述识别模型包括:序列编码器和知识编码器;所述利用预设的识别模型对对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征,包括:利用所述序列编码器对该预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征;利用所述知识编码器分别对包含该预训练蛋白质序列的三元组中的基因本体和关系进行特征提取,得到基因本体特征和关系特征;根据所述基因本体特征和所述关系特征,确定所述知识特征,所述知识特征用于表征包含该预训练蛋白质序列的三元组。
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述根据所述基因本体特征和所述关系特征,确定所述知识特征,包括:利用注意力机制对所述基因本体特征和所述关系特征进行融合,得到所述知识特征。
根据本公开的一个或多个实施例,示例4提供了示例1的方法,所述获取预训练样本集和蛋白质知识图谱,包括:获取所述预训练样本集;将所述预训练样本集与基因本体知识图谱对齐,得到初始知识图谱,所述初始知识图谱包括多个正三元组,所述正三元组中 包括的蛋白质与基因本体之间的关系为真;对所述初始知识图谱进行负采样,得到多个负三元组,所述负三元组包括的蛋白质与基因本体之间的关系为假;根据多个所述负三元组和所述初始知识图谱,得到所述蛋白质知识图谱,所述蛋白质知识图谱包括多个所述三元组,和每个所述三元组的标识,所述标识用于指示该三元组属于所述正三元组或所述负三元组。
根据本公开的一个或多个实施例,示例5提供了示例4的方法,所述根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,包括:根据所述解码结果确定预测序列,并根据所述预测序列与该预训练蛋白质序列确定预测损失;根据所述解码结果确定预测识别结果,并根据所述预测识别结果与包含该预训练蛋白质序列的三元组的标识,确定识别损失;根据所述预测损失和所述识别损失确定总损失;以降低所述总损失为目标,利用反向传播算法对所述识别模型进行预训练。
根据本公开的一个或多个实施例,示例6提供了示例1的方法,所述利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果,包括:将所述序列特征和所述知识特征进行拼接,得到综合特征;对所述综合特征进行解码,得到所述解码结果。
根据本公开的一个或多个实施例,示例7提供了示例1的方法,所述利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果,包括:利用注意力机制,将所述知识特征中与所述序列特征匹配的信息,与所述序列特征进行融合,得到跨模态融合特征;利用自注意力机制,对所述跨模态融合特征进行解码,得到所述解码结果。
根据本公开的一个或多个实施例,示例8提供了一种蛋白质识别方法,包括:获取待识别的目标蛋白质对应的目标序列;将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残 基接触、远程同源性、稳定性,以及荧光性;所述蛋白质识别模型为根据训练样本集对示例1-7中任一项所述的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
根据本公开的一个或多个实施例,示例9提供了一种识别模型的预训练装置,包括:获取模块,用于获取预训练样本集和蛋白质知识图谱,所述预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;掩蔽模块,用于针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;预训练模块,用于利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;利用所述识别模型对所述序列特征和所述知识特征进行融合,根据融合结果进行解码,以得到解码结果;根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
根据本公开的一个或多个实施例,示例10提供了一种蛋白质识别方法,包括:获取模块,用于获取待识别的目标蛋白质对应的目标序列;识别模块,用于将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性;所述蛋白质识别模型为根据训练样本集对示例1-7中任一项所述的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
根据本公开的一个或多个实施例,示例11提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1至示例8中所述方法的步骤。
根据本公开的一个或多个实施例,示例12提供了一种电子设备, 包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1至示例8中所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (12)

  1. 一种识别模型的预训练方法,其特征在于,所述方法包括:
    获取预训练样本集和蛋白质知识图谱,所述预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;
    针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;
    利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;
    利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果;
    根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
  2. 根据权利要求1所述的方法,其特征在于,所述识别模型包括:序列编码器和知识编码器;所述利用预设的识别模型对对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征,包括:
    利用所述序列编码器对该预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征;
    利用所述知识编码器分别对包含该预训练蛋白质序列的三元组中的基因本体和关系进行特征提取,得到基因本体特征和关系特征;
    根据所述基因本体特征和所述关系特征,确定所述知识特征,所述知识特征用于表征包含该预训练蛋白质序列的三元组。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述基 因本体特征和所述关系特征,确定所述知识特征,包括:
    利用注意力机制对所述基因本体特征和所述关系特征进行融合,得到所述知识特征。
  4. 根据权利要求1所述的方法,其特征在于,所述获取预训练样本集和蛋白质知识图谱,包括:
    获取所述预训练样本集;
    将所述预训练样本集与基因本体知识图谱对齐,得到初始知识图谱,所述初始知识图谱包括多个正三元组,所述正三元组中包括的蛋白质与基因本体之间的关系为真;
    对所述初始知识图谱进行负采样,得到多个负三元组,所述负三元组包括的蛋白质与基因本体之间的关系为假;
    根据多个所述负三元组和所述初始知识图谱,得到所述蛋白质知识图谱,所述蛋白质知识图谱包括多个所述三元组,和每个所述三元组的标识,所述标识用于指示该三元组属于所述正三元组或所述负三元组。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述解码结果、该预训练蛋白质序列和所述蛋白质知识图谱,对所述识别模型进行预训练,包括:
    根据所述解码结果确定预测序列,并根据所述预测序列与该预训练蛋白质序列确定预测损失;
    根据所述解码结果确定预测识别结果,并根据所述预测识别结果与包含该预训练蛋白质序列的三元组的标识,确定识别损失;
    根据所述预测损失和所述识别损失确定总损失;
    以降低所述总损失为目标,利用反向传播算法对所述识别模型进行预训练。
  6. 根据权利要求1所述的方法,其特征在于,所述利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果,包括:
    将所述序列特征和所述知识特征进行拼接,得到综合特征;
    对所述综合特征进行解码,得到所述解码结果。
  7. 根据权利要求1所述的方法,其特征在于,所述利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果,包括:
    利用注意力机制,将所述知识特征中与所述序列特征匹配的信息,与所述序列特征进行融合,得到跨模态融合特征;
    利用自注意力机制,对所述跨模态融合特征进行解码,得到所述解码结果。
  8. 一种蛋白质识别方法,其特征在于,所述方法包括:
    获取待识别的目标蛋白质对应的目标序列;
    将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性;
    所述蛋白质识别模型为根据训练样本集对权利要求1-7中任一项所述的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
  9. 一种识别模型的预训练装置,其特征在于,所述装置包括:
    获取模块,用于获取预训练样本集和蛋白质知识图谱,所述预训练样本集中包括多个预训练蛋白质序列,所述蛋白质知识图谱包括多个三元组,每个所述三元组由蛋白质、基因本体,以及蛋白质与基因本体之间的关系组成;
    掩蔽模块,用于针对每个所述预训练蛋白质序列,对该预训练蛋白质序列进行掩蔽操作,得到该预训练蛋白质序列对应的掩蔽序列;
    预训练模块,用于利用预设的识别模型对每个预训练蛋白质序列对应的掩蔽序列进行特征提取,得到序列特征,并对包含该预训练蛋白质序列的三元组进行特征提取,得到知识特征;利用所述识别模型对所述序列特征和所述知识特征进行融合,并根据融合结果进行解码,以得到解码结果;根据所述解码结果、该预训练蛋白质 序列和所述蛋白质知识图谱,对所述识别模型进行预训练,预训练后的所述识别模型,经过微调能够对蛋白质进行识别。
  10. 一种蛋白质识别装置,其特征在于,所述装置包括:
    获取模块,用于获取待识别的目标蛋白质对应的目标序列;
    识别模块,用于将所述目标序列输入蛋白质识别模型,以确定所述目标蛋白质的识别信息,所述识别信息包括以下至少一种:所述目标蛋白质的二级结构、残基接触、远程同源性、稳定性,以及荧光性;
    所述蛋白质识别模型为根据训练样本集对权利要求1-7中任一项所述的识别模型进行微调得到的,所述训练样本集包括多个训练蛋白质序列。
  11. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-8中任一项所述方法的步骤。
  12. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-8中任一项所述方法的步骤。
PCT/CN2023/110347 2022-08-05 2023-07-31 识别模型的预训练方法、识别方法、装置、介质和设备 WO2024027663A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210947783.9A CN115312127B (zh) 2022-08-05 2022-08-05 识别模型的预训练方法、识别方法、装置、介质和设备
CN202210947783.9 2022-08-05

Publications (1)

Publication Number Publication Date
WO2024027663A1 true WO2024027663A1 (zh) 2024-02-08

Family

ID=83860964

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/110347 WO2024027663A1 (zh) 2022-08-05 2023-07-31 识别模型的预训练方法、识别方法、装置、介质和设备

Country Status (2)

Country Link
CN (1) CN115312127B (zh)
WO (1) WO2024027663A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115312127B (zh) * 2022-08-05 2023-04-18 抖音视界有限公司 识别模型的预训练方法、识别方法、装置、介质和设备
CN115937689B (zh) * 2022-12-30 2023-08-11 安徽农业大学 一种农业害虫智能识别与监测技术

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333982A (zh) * 2021-11-26 2022-04-12 北京百度网讯科技有限公司 蛋白质表示模型预训练、蛋白质相互作用预测方法和装置
CN114780691A (zh) * 2022-06-21 2022-07-22 安徽讯飞医疗股份有限公司 模型预训练及自然语言处理方法、装置、设备及存储介质
CN115312127A (zh) * 2022-08-05 2022-11-08 抖音视界有限公司 识别模型的预训练方法、识别方法、装置、介质和设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11587644B2 (en) * 2017-07-28 2023-02-21 The Translational Genomics Research Institute Methods of profiling mass spectral data using neural networks
CN110263324B (zh) * 2019-05-16 2021-02-12 华为技术有限公司 文本处理方法、模型训练方法和装置
CN111401534B (zh) * 2020-04-29 2023-12-05 北京晶泰科技有限公司 一种蛋白质性能预测方法、装置和计算设备
CN111462822B (zh) * 2020-04-29 2023-12-05 北京晶泰科技有限公司 一种蛋白质序列特征的生成方法、装置和计算设备
CN112614538A (zh) * 2020-12-17 2021-04-06 厦门大学 一种基于蛋白质预训练表征学习的抗菌肽预测方法和装置
CN113535972B (zh) * 2021-06-07 2022-08-23 吉林大学 一种融合上下文语义的知识图谱链路预测模型方法及装置
CN114218926A (zh) * 2021-12-17 2022-03-22 中山大学 一种基于分词与知识图谱的中文拼写纠错方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333982A (zh) * 2021-11-26 2022-04-12 北京百度网讯科技有限公司 蛋白质表示模型预训练、蛋白质相互作用预测方法和装置
CN114780691A (zh) * 2022-06-21 2022-07-22 安徽讯飞医疗股份有限公司 模型预训练及自然语言处理方法、装置、设备及存储介质
CN115312127A (zh) * 2022-08-05 2022-11-08 抖音视界有限公司 识别模型的预训练方法、识别方法、装置、介质和设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG NINGYU, BI ZHEN, LIANG XIAOZHUAN, CHENG SIYUAN, HONG HAOSEN, DENG SHUMIN, LIAN JIAZHANG, ZHANG QIANG, CHEN HUAJUN: "OntoProtein: Protein Pretraining With Gene Ontology Embedding", ICLR 2022, 23 January 2022 (2022-01-23), pages 1 - 18, XP093136495, ISSN: 2331-8422 *

Also Published As

Publication number Publication date
CN115312127B (zh) 2023-04-18
CN115312127A (zh) 2022-11-08

Similar Documents

Publication Publication Date Title
WO2024027663A1 (zh) 识别模型的预训练方法、识别方法、装置、介质和设备
US11775761B2 (en) Method and apparatus for mining entity focus in text
CN112115257B (zh) 用于生成信息评估模型的方法和装置
WO2022247562A1 (zh) 多模态数据检索方法、装置、介质及电子设备
WO2023273578A1 (zh) 语音识别方法、装置、介质及设备
CN111144124B (zh) 机器学习模型的训练方法、意图识别方法及相关装置、设备
WO2023274187A1 (zh) 基于自然语言推理的信息处理方法、装置和电子设备
WO2023040742A1 (zh) 文本数据的处理方法、神经网络的训练方法以及相关设备
CN112668339A (zh) 语料样本确定方法、装置、电子设备及存储介质
CN112883968A (zh) 图像字符识别方法、装置、介质及电子设备
CN117114063A (zh) 用于训练生成式大语言模型和用于处理图像任务的方法
CN113255327B (zh) 文本处理方法、装置、电子设备及计算机可读存储介质
CN111090993A (zh) 属性对齐模型训练方法及装置
WO2022012178A1 (zh) 用于生成目标函数的方法、装置、电子设备和计算机可读介质
CN112580343A (zh) 模型生成方法、问答质量判断方法、装置、设备及介质
CN116258911A (zh) 图像分类模型的训练方法、装置、设备及存储介质
CN116244431A (zh) 文本分类方法、装置、介质及电子设备
CN115858732A (zh) 实体链接方法及设备
CN116186220A (zh) 信息检索方法、问答处理方法、信息检索装置及系统
CN117743555B (zh) 答复决策信息发送方法、装置、设备和计算机可读介质
CN111522887B (zh) 用于输出信息的方法和装置
CN111681660B (zh) 语音识别方法、装置、电子设备和计算机可读介质
CN116343905B (zh) 蛋白质特征的预处理方法、装置、介质及设备
CN114625876B (zh) 作者特征模型的生成方法、作者信息处理方法和装置
CN115240042B (zh) 多模态图像识别方法、装置、可读介质和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23849370

Country of ref document: EP

Kind code of ref document: A1