WO2023109714A1

WO2023109714A1 - Multi-mode information fusion method and system for protein representative learning, and terminal and storage medium

Info

Publication number: WO2023109714A1
Application number: PCT/CN2022/138208
Authority: WO
Inventors: 胡奕绅; 殷鹏; 胡帆
Original assignee: 深圳先进技术研究院
Priority date: 2021-12-15
Filing date: 2022-12-09
Publication date: 2023-06-22
Also published as: CN114388064A

Abstract

The present application provides a multi-mode information fusion method and system for protein representative learning, and a terminal and a storage medium. A strategy of performing extraction at an early stage, performing fusion at an intermediate stage and performing prediction at a later stage is used, such that single-mode models can fully extract advanced semantic information of respective modes and then perform fusion, and task prediction is then performed by means of a feedforward neural network at the later stage. Moreover, a multi-mode fusion module is provided, such that fine-grained interaction can be performed on different mode information of a network on each layer during fusion at the intermediate stage, and multiple modes are better fused and transferred. On the last layer of a feature extractor during prediction at the later stage, the embedding of the fused multiple modes and the embedding of the previous single mode are combined together to serve as a representation of protein itself, such that original information of the single mode can be retained to the greatest extent.

Description

Multimodal information fusion method, system, terminal and storage medium for protein representation learning

technical field

The application belongs to the technical field of medical data processing, and specifically relates to a multimodal information fusion method, system, terminal and storage medium for protein representation learning.

Background technique

Protein representation learning is a very important research topic in the field of bioinformatics. It plays a key role in predicting protein-protein interactions, protein-drug interactions, and protein-gene interactions. A good data representation should be able to cover the information of the object itself in multiple directions, so that the reasoning process of downstream tasks has more available feature support.

In the computational research of proteins, it is necessary to convert proteins into data that can be processed by computers, and before inputting the original data into the model, features need to be extracted. This process is called representation learning. A good representation learning can improve the performance of downstream tasks. with major help. The representation learning of proteins can be divided into single-modal representation and multi-modal representation.

On the single modality, it is mainly to learn the features of sequence and structure respectively. The sequence of a protein is similar to a text sequence, which can be modeled using techniques from the NLP field. In the past, some studies used CNN to perform one-dimensional convolution on protein sequences to extract the sequence features of proteins for subsequent tasks; some studies also used the RNN model, which is good at time series data, and achieved good results. Recently, many people have tried Transformer, which has made breakthroughs in the fields of NLP and CV, pre-trained large-scale protein sequences, and achieved better results in downstream tasks. As opposed to the modality of the sequence, the structural modality of the protein is also crucial to understanding the protein itself. There are fewer studies on protein structure modeling than sequences. Some studies convert 3D protein structures into images, and then use CNN to extract features to represent proteins. Other studies tile 3D structures into adjacency matrices of amino acid nodes. Then use the algorithm of graph neural network to model.

In multimodality, the key is how to fuse unimodal information. Most studies use different feature extractors to extract unimodal information, and then splicing or summing embeddings of different modalities to obtain new The embedding is used as a multimodal representation, and some people input the embedding into a new interactive network, such as Transformer, to obtain an interactive embedding after splicing or summing.

Many current multimodal fusion methods simply splicing or summing single-modal representations. This method cannot learn the interaction information between modalities in a fine-grained manner, and the resulting representation vector will lose a lot of information. Some studies consider learning the interaction between modalities. They splice the data of the two modalities in the initial embedding layer of the original data, and then pass it into the encoding layer of the Transformer to learn the relationship between tokens. However, this method is in Early fusion of modalities will cause each modal to be fused with other modalities without fully extracting advanced semantic information, and the performance in subsequent tasks is unsatisfactory. In addition, basically all research is to extract multi-modal representations and use them directly downstream, but no matter how well the multi-modal features are learned, there will always be single-modal information lost in the transmission process.

Contents of the invention

In view of this, it is necessary to provide a multimodal information fusion method for protein representation learning that can preserve the original information of the single modality to the greatest extent to address the shortcomings of the existing technology.

In order to solve the above problems, the application adopts the following technical solutions:

One of the purposes of the present application is to provide a multimodal information fusion method for protein representation learning, including the following steps:

Preprocessing open source protein data;

dividing the protein data set into a training set, a verification set and a test set;

Constructing a unimodal feature extractor, the unimodal feature extractor is used as a feature extractor for protein sequences;

Build a multi-modal fusion module, the multi-modal fusion module updates the amino acid token embedding of the single-modal feature extractor, so that the single-modal information with multiple models is used as the single-modal feature the input to the extractor;

Constructing a learning model based on the multimodal fusion module;

The training set trains the learning model, the verification set measures the effect of the learning model, and selects the parameters with the best performance as the parameters of the learning model, and uses the test set to independently test the learning model generalization ability.

In some of these embodiments, the step of preprocessing the open source protein data specifically includes the following steps:

The sequence data of the protein is extracted from the open source protein data set, the sequence is composed of 20 English letters, and the 20 English letters represent 20 kinds of amino acids, and the 3D structure of the protein is converted into an adjacency matrix graph.

In some of these embodiments, in the step of constructing a single-modal feature extractor, it specifically includes:

The unimodal feature extractor is a pre-trained Transformer model.

In some of these embodiments, in the step of constructing the multimodal fusion module, the following steps are specifically included:

Average pooling of sequence feature matrix and structure feature matrix, each amino acid

The eigenvectors get a representative value, the formula is as follows, where

Represents the sequence feature matrix and structure feature matrix before inputting the multimodal module, where D _seq represents the feature dimension of each amino acid on the sequence,

D _struc represents the characteristic dimension of each amino acid in the structure, L _seq and L _struc represent the amino acid length in the sequence and structure respectively, but the two are actually equal, that is, L _seq = L _struc = L;

The pooled vectors of sequences and structures are concatenated, and then transformed into vectors containing multimodal information through a fully connected network. The formula is as follows:

M _comp =W[M _seq ,M _struc ]+b

in,

Let D _comp =(L _seq +L _struc )/5;

The multi-modal information compression vector M _comp is redistributed to each mode to calibrate the single-modal information. The process of distributing is to introduce the fully connected conversion layer of each mode respectively. The formula is as follows:

T _seq ＝W _seq M _comp +b _seq ,

T _struc ＝W _struc M _comp +b _struc ,

The modal vector of shunt conversion is activated through the activation function, which is used as a gating switch to limit the contribution of each amino acid to the overall task. The specific formula is as follows:

Where σ refers to the sigmoid function, and ⊙ refers to the Hadamard product;

After being multiplied by the activated gating vector, the reconstructed unimodal vector is obtained as the input of the next layer unimodal feature extractor.

In some of these embodiments, in the step of building a learning model based on the multimodal fusion module, the following steps are specifically included:

Add a special token to the original input of the protein sequence and structure, named [cls], the [cls] of the sequence is placed at the front of the entire sequence, and the [cls] of the structure establishes a virtual full connection with all amino acids;

The original protein data passes through the early unimodal feature extractor of the _Ne layer, the sequence passes through the encoding layer of the Transformer model, and the structure passes through the graph attention network layer, and the output result represents a unimodal vector representation that has extracted high-level semantics;

Insert the multimodal fusion module for mid-term fusion;

After the mid-term fusion, the single mode is calibrated by multi-modal information, and continues to pass through the _N1 layer feature extractor, and further performs the calibrated feature mining;

The [cls] vectors of the two modalities of the calibrated feature mining are spliced, and then passed through the feed-forward neural network, and then spliced with the [cls] vector obtained by the early single-modal feature extractor;

After the second feed-forward neural network, the learning model is obtained.

In some of these embodiments, after completing the step of building a learning model based on the multimodal fusion module, the following steps are also included:

An auxiliary loss is added to update the parameters of the learning model.

The second purpose of this application is to provide a multimodal information fusion system for protein representation learning, including:

Data processing unit: used to preprocess open source protein data;

Classification unit: used to divide the protein data set into training set, verification set and test set;

Single-modal feature extractor construction unit: used to construct a single-modal feature extractor, the single-modal feature extractor is used as a feature extractor for protein sequences;

Multimodal fusion module construction unit: used to build a multimodal fusion module, the multimodal fusion module updates the amino acid token embedding of the single-modal feature extractor, so that the single-mode has multiple models information, and as the input of the unimodal feature extractor;

Learning model building unit: used to build a learning model based on the multimodal fusion module;

Training unit: the training set trains the learning model, the verification set measures the effect of the learning model, and selects the parameters with the best performance as the parameters of the learning model, and uses the test set to independently test the learning model. generalization ability of the learning model.

The third purpose of the present application is to provide a terminal, including: the terminal includes a processor and a memory coupled to the processor, wherein,

The memory stores program instructions for realizing the multimodal information fusion method for protein characterization learning;

The processor is configured to execute the program instructions stored in the memory to control multimodal information fusion.

The fourth object of the present application is to provide a storage medium, which stores program instructions executable by a processor, and the program instructions are used to execute the multimodal information fusion method for protein characterization learning.

The application adopts the above-mentioned technical solution to have the following effects:

The multi-modal information fusion method, system, terminal, and storage medium for protein representation learning provided by this application utilize the strategies of early extraction, mid-term fusion, and later prediction, so that each single-modal model can fully extract the information of each modality The high-level semantic information is then fused, and then the feed-forward neural network is used to predict the task in the later stage; at the same time, a multi-modal fusion module is proposed, which can perform different modal information for each layer of the network during the mid-term fusion. Fine-grained interaction, better fusion and transmission of multimodality; in the last layer of the feature extractor in the later prediction stage, the fused multimodal embedding and the previous single-modal embedding are spliced together as the protein itself In this way, the original information of the single mode can be preserved to the greatest extent.

In addition, the multi-modal information fusion method, system, terminal and storage medium for protein representation learning provided by this application, when designing the loss function, predict a result for the feature extraction network of different layers in the later prediction stage, as the final loss The auxiliary loss, the introduction of auxiliary loss can help the model converge faster and achieve a better performance.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present application or in the description of the prior art. Obviously, the accompanying drawings described below are only of the present application For some embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without creative effort.

Fig. 1 is a flow chart of the steps of the multimodal information fusion method for protein representation learning provided by the embodiment of the present application.

Fig. 2 is an adjacency matrix diagram of proteins provided in the examples of the present application.

Fig. 3 is a schematic diagram of a multi-modal fusion module provided by an embodiment of the present application.

FIG. 4 is a schematic diagram of a learning module provided by an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a multi-modal information fusion method for protein representation learning provided by an embodiment of the present application.

FIG. 6 is a schematic diagram of a terminal structure provided by an embodiment of the present application.

FIG. 7 is a schematic structural diagram of a storage medium provided by an embodiment of the present application.

Detailed ways

Embodiments of the present application are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary, and are intended to explain the present application, and should not be construed as limiting the present application.

In the description of the present application, it should be understood that the orientation or positional relationship indicated by the terms "upper", "lower", "horizontal", "inner", "outer", etc. is based on the orientation or positional relationship shown in the drawings , is only for the convenience of describing the present application and simplifying the description, but does not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present application.

In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

Please refer to Figure 1, which is a flow chart of the steps of the multimodal information fusion method for protein characterization learning provided by this application, including the following steps:

Step S110: Preprocessing the open source protein data.

In this example, among the open source protein data, these datasets have various tasks, including prediction of protein fluorescence, protein secondary structure, protein long-range homology and protein stability. Extract protein sequence data from these data sets, the sequence consists of 20 English letters (representing 20 amino acids), and convert the 3D structure of the protein into an adjacency matrix map, also known as a contact map.

As shown in Figure 2, it indicates whether the amino acids are in contact with each other in space, the white part indicates that there is contact, and the black part indicates that there is no contact.

Step 120: Divide the protein data set into training set, verification set and test set.

It can be understood that the processed data set is divided into training, verification and test sets, the training set is used to adjust the model parameters to fit the target, the verification set is used to select the optimal parameters, and the test set is used to evaluate the final effect of the model.

Step S130: constructing a single-modal feature extractor, which serves as a feature extractor for protein sequences.

In this embodiment, this application selects the pre-trained Transformer model TAPE as the feature extractor of the protein sequence. The pre-training strategy allows the model to have prior information before training, which can play a positive role in the reasoning of the model. Function, it can capture the amino acid association relationship of the whole sequence, and supports parallelization. For the protein structure, the nature of its topological structure is suitable for solving by graph algorithms. This application uses an effective graph neural network, specifically the graph attention network GAT, which also uses the attention mechanism to capture neighbor nodes and target nodes. relation.

Step S140: Constructing a multimodal fusion module, the multimodal fusion module updates the amino acid token embedding of the single-modal feature extractor, so that the single-modal has information of multiple models, and serves as the single-modal feature extractor. Input to the modality feature extractor.

Please refer to Figure 3. In the steps of building a multimodal fusion module, there are mainly four steps of pooling, compression, splitting, and reconstruction, which specifically include the following steps:

Step 1: Perform average pooling on the sequence feature matrix and structure feature matrix, and obtain a representative value for each amino acid feature vector, the formula is as follows, where

suppose

Represent the sequence feature matrix and structure feature matrix before inputting the multimodal module, where D _seq represents the feature dimension of each amino acid on the sequence, D _struc represents the feature dimension of each amino acid on the structure, L _seq and L _struc represent respectively in The amino acid length of sequence and structure, but both are actually equal, ie, L _seq =L _struc =L.

The second step: concatenate the pooled vectors of sequences and structures, and then convert them into vectors containing multimodal information through a fully connected network. The formula is as follows:

M _comp =W[M _seq ,M _struc ]+b

in,

Let D _comp =(L _seq +L _struc )/5.

It can be understood that through the above steps, the size of the model can be limited and its generalization ability can be improved. This step is a process of multimodal information interaction and compression.

Step 3: Re-distribute the multi-modal information compression vector M _comp to each mode to calibrate the single-modal information. The process of diversion is to introduce the fully connected conversion layer of each mode respectively. The formula is as follows:

T _seq ＝W _seq M _comp +b _seq ,

T _struc ＝W _struc M _comp +b _struc ,

Step 4: Activate the modal vector of shunt conversion through the activation function, and use it as a gating switch to limit the contribution of each amino acid to the overall task. The specific formula is as follows:

Where σ refers to the sigmoid function, and ⊙ refers to the Hadamard product.

Step 5: After multiplying with the activated gating vector, the reconstructed unimodal vector is obtained as the input of the next layer unimodal feature extractor.

It can be understood that this application is a method of calibration and reconstruction in the step of building a multimodal fusion module. Specifically, it uses multimodal information interaction to update each single-modal amino acid token embedding, so that the original possible Fuzzy single-modal information with multi-modal information guidance makes it clearer in pattern recognition.

Step S150: Construct a learning model based on the multimodal fusion module.

Please refer to Figure 4, which is a schematic diagram of the principle of building a learning model based on the multimodal fusion module, which specifically includes the following steps:

Step S151: Add a special token to the original input of the protein sequence and structure, named [cls], the [cls] of the sequence is placed at the top of the entire sequence, and the [cls] of the structure establishes a virtual full connection with all amino acids .

It can be understood that the purpose of introducing [cls] is to allow [cls] to represent the entire modality to participate in subsequent predictions.

Step S152: The original protein data passes through the early unimodal feature extractor of _Ne layer, the sequence passes through the encoding layer of the Transformer model, the structure passes through the graph attention network layer, and the output result represents the unimodal vector that has extracted high-level semantics express.

Step S153: inserting the multimodal fusion module for mid-term fusion.

It can be understood that when entering the multi-modal fusion stage, that is, the mid-term fusion stage, on the basis of the previous early extraction, each layer adds interaction between modalities, inserts the multi-modal fusion network described in Figure 3, and passes through a total of N _m layers.

Step S154: After the mid-term fusion, the single modality is calibrated by the multimodal information, and continues to pass through _N1 layer feature extractors for further calibrated feature mining.

Step S155: Concatenate the [cls] vectors of the two modalities of the calibrated feature mining, and then pass through the feedforward neural network, and then concatenate with the [cls] vector obtained by the early single-mode feature extractor.

It can be understood that since the spliced vectors are relatively fragmented, the spliced feature vectors can be passed through a learnable feedforward neural network to obtain a more holistic feature vector, and the prediction results are more accurate.

It can be understood that since the multi-modality may lose some information of the single-mode in the process of information transmission, the information can be completed after splicing with the single-mode vector.

Step S156: Obtain the learning model through the second feed-forward neural network.

It can be understood that the multi-modal fusion strategy provided by the embodiment of the present application can make the model learn more fully about single-modal and multi-modal information through early extraction, mid-term fusion, and later prediction strategies; The state representation is not directly used for prediction, but the early unimodal representation is added, so that the unimodal information lost during the network transmission process can be supplemented at the end.

Step S157: adding an auxiliary loss to update the parameters of the learning model.

It can be understood that due to the large parameters of the main network and the complex model, it will be difficult to converge during training. Therefore, each feature extraction layer in the later prediction stage of this application will output the results to predict the final goal. The resulting loss is used as an auxiliary loss, which is added to the main loss to update the parameters of the model.

Step S160: the training set trains the learning model, the verification set measures the effect of the learning model, and selects the parameters with the best performance as the parameters of the learning model, and uses the testing set to independently test the generalization ability of the learning model.

Please refer to Figure 5, which is a schematic structural diagram of a multimodal information fusion system for protein characterization learning provided by this application, including: data processing unit 110: used to preprocess open source protein data; classification unit 120: used to The protein data set is divided into a training set, a verification set and a test set; a single-modal feature extractor construction unit 130: used to construct a single-modal feature extractor, and the single-modal feature extractor is used as a feature extraction of protein sequences device; multimodal fusion module construction unit 140: used to construct a multimodal fusion module, the multimodal fusion module updates the amino acid token embedding of the single-modal feature extractor, so that the single-modal with Multi-model information, and as the input of the single-modal feature extractor; learning model construction unit 150: for building a learning model based on the multi-modal fusion module; training unit 160: the training set trains the learning model The verification set measures the effect of the learning model, and selects the parameters with the best performance as the parameters of the learning model, and uses the test set to independently test the generalization ability of the learning model. Its detailed implementation has been described in the above description of the method in this application, and will not be repeated here.

Please refer to FIG. 6 , which is a schematic diagram of a terminal structure according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .

The memory 52 stores program instructions for implementing the multimodal information fusion method for protein characterization learning.

The processor 51 is configured to execute the program instructions stored in the memory 52 to control the multimodal information fusion.

Wherein, the processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components . A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

Please refer to FIG. 7 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of the embodiment of the present application stores a program file 61 capable of realizing all the above-mentioned methods, wherein the program file 61 can be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to make a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the methods in various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. , or terminal devices such as computers, servers, mobile phones, and tablets.

The above are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, various modifications and changes may occur in this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.

Claims

A multimodal information fusion method for protein representation learning, characterized in that it comprises the following steps:

Preprocessing open source protein data;

dividing the protein data set into a training set, a verification set and a test set;

Constructing a unimodal feature extractor, the unimodal feature extractor is used as a feature extractor for protein sequences;

Build a multi-modal fusion module, the multi-modal fusion module updates the amino acid token embedding of the single-modal feature extractor, so that the single-modal information with multiple models is used as the single-modal feature the input to the extractor;

Constructing a learning model based on the multimodal fusion module;

The training set trains the learning model, the verification set measures the effect of the learning model, and selects the parameters with the best performance as the parameters of the learning model, and uses the test set to independently test the learning model generalization ability.
The multimodal information fusion method for protein characterization learning according to claim 1, wherein, in the step of preprocessing open source protein data, specifically comprising the following steps:

The sequence data of the protein is extracted from the open source protein data set, the sequence is composed of 20 English letters, and the 20 English letters represent 20 kinds of amino acids, and the 3D structure of the protein is converted into an adjacency matrix graph.
The multimodal information fusion method for protein representation learning according to claim 2, wherein in the step of constructing a single-modal feature extractor, it specifically includes:

The unimodal feature extractor is a pre-trained Transformer model.
The multimodal information fusion method for protein representation learning according to claim 3, characterized in that, in the step of building a multimodal fusion module, specifically comprising the following steps:

Average pooling is performed on the sequence feature matrix and the structure feature matrix, and a representative value is obtained from the feature vector of each amino acid. The formula is as follows, where

Represent the sequence feature matrix and structure feature matrix before inputting the multimodal module, where D seq represents the feature dimension of each amino acid on the sequence, D struc represents the feature dimension of each amino acid on the structure, L seq and L struc represent respectively in The amino acid length of sequence and structure, but the two are actually equal, that is, L seq = L struc = L;

The pooled vectors of sequences and structures are concatenated, and then transformed into vectors containing multimodal information through a fully connected network. The formula is as follows:

M comp =W[M seq ,M struc ]+b

in,
Let D comp =(L seq +L struc )/5;

The multi-modal information compression vector M comp is redistributed to each mode to calibrate the single-modal information. The process of distributing is to introduce the fully connected conversion layer of each mode respectively. The formula is as follows:

T seq ＝W seq M comp +b seq ,
T struc ＝W struc M comp +b struc ,

The modal vector of shunt conversion is activated through the activation function, which is used as a gating switch to limit the contribution of each amino acid to the overall task. The specific formula is as follows:

Where σ refers to the sigmoid function, and ⊙ refers to the Hadamard product;

After being multiplied by the activated gating vector, the reconstructed unimodal vector is obtained as the input of the next layer unimodal feature extractor.
The multimodal information fusion method for protein representation learning according to claim 4, wherein in the step of building a learning model based on the multimodal fusion module, specifically comprising the following steps:

Add a special token to the original input of the protein sequence and structure, named [cls], the [cls] of the sequence is placed at the front of the entire sequence, and the [cls] of the structure establishes a virtual full connection with all amino acids;

The original protein data passes through the early unimodal feature extractor of the Ne layer, the sequence passes through the encoding layer of the Transformer model, and the structure passes through the graph attention network layer, and the output result represents a unimodal vector representation that has extracted high-level semantics;

Insert the multimodal fusion module for mid-term fusion;

After the mid-term fusion, the single mode is calibrated by multi-modal information, and continues to pass through the N1 layer feature extractor, and further performs the calibrated feature mining;

The [cls] vectors of the two modalities of the calibrated feature mining are spliced, and then passed through the feed-forward neural network, and then spliced with the [cls] vector obtained by the early single-modal feature extractor;

After the second feed-forward neural network, the learning model is obtained.
The multimodal information fusion method for protein representation learning according to claim 5, characterized in that, after completing the step of building a learning model based on the multimodal fusion module, it also includes the following steps:

An auxiliary loss is added to update the parameters of the learning model.
A multi-modal information fusion system for protein representation learning, characterized in that it includes:

Data processing unit: used to preprocess open source protein data;

Classification unit: used to divide the protein data set into training set, verification set and test set;

Single-modal feature extractor construction unit: used to construct a single-modal feature extractor, the single-modal feature extractor is used as a feature extractor for protein sequences;

Multimodal fusion module construction unit: used to build a multimodal fusion module, the multimodal fusion module updates the amino acid token embedding of the single-modal feature extractor, so that the single-mode has multiple models information, and as the input of the unimodal feature extractor;

Learning model building unit: used to build a learning model based on the multimodal fusion module;

Training unit: the training set trains the learning model, the verification set measures the effect of the learning model, and selects the parameters with the best performance as the parameters of the learning model, and uses the test set to independently test the learning model. generalization ability of the learning model.
A terminal, characterized by comprising: the terminal includes a processor and a memory coupled to the processor, wherein,

The memory stores program instructions for realizing the multimodal information fusion method for protein characterization learning according to any one of claims 1-6;

The processor is configured to execute the program instructions stored in the memory to control multimodal information fusion.
A storage medium, characterized in that it stores program instructions executable by a processor, and the program instructions are used to execute the multimodal information fusion method for protein representation learning according to any one of claims 1 to 6.