CN116206676B

CN116206676B - Immunogen prediction system and method based on protein three-dimensional structure and graph neural network

Info

Publication number: CN116206676B
Application number: CN202310479207.0A
Authority: CN
Inventors: 宰晓东; 赵云祥; 徐俊杰; 任洪广; 陈薇
Original assignee: Academy of Military Medical Sciences AMMS of PLA
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-09-26
Anticipated expiration: 2043-04-28
Also published as: CN116206676A

Abstract

The invention discloses an immunogen prediction system and method based on a protein three-dimensional structure and a graphic neural network. The system comprises: (1) a three-dimensional structural feature extraction module; (2) an immunogen structure data set processing module; (3) a machine learning classifier module; (4) an automatic prediction output module. The system applies artificial intelligent algorithms such as protein three-dimensional structure information, a graphic neural network, a protein language model and the like to the field of novel vaccine immunogen discovery, overcomes the limitation caused by the traditional method for carrying out immunogen prediction only based on one-dimensional amino acid sequence characteristics, realizes universal (applicable to bacteria, viruses, parasites and the like) and high-precision immunogen prediction, and is favorable for efficient research and development of novel vaccines.

Description

Immunogen prediction system and method based on protein three-dimensional structure and graph neural network

Technical Field

The invention discloses an immunogen prediction system and method based on a protein three-dimensional structure and a graphic neural network, and belongs to the technical field of biological medicine and bioinformatics.

Background

Immunogens determine the targeting of vaccine-induced immune response attacks and are the determining factor in the development of novel vaccines. The traditional vaccine immunogen identification method has long flow and low efficiency, and can not meet the requirement of rapid development of novel vaccines. Reverse vaccinology (Reverse Vaccinology) provides a new approach to searching for novel immunogens for complex pathogens, and the main approach is based on the discovery of histology, by computational analysis and prediction of large amounts of data, screening for target antigens and validating protective responses (Rappuoli r. (2000) curr. Opin. Microbiol., 3, 445-450.). The method has been successfully applied to the development of vaccines against complex pathogens such as neisseria meningitidis and staphylococcus aureus (Pizza m. (2000) Science (80-), 287, 1816-1820). A corresponding variety of reverse vaccinology immunogen prediction methods and software systems were developed successively, represented by the first localized immunogen discovery system NERVE and the first online immunogen discovery system Vaxign, using mainly rule-based filtration methods, protein properties (such as protein subcellular localization, molecular weight size, adhesion, virulence probability, etc.) were analyzed according to predetermined steps, and the rule-compliant proteins were entered into the next stage by filtration until the immunogen of interest was screened (Vivona S. (2006) BMC Biotechnol., 6;35.; he Y. (2010) J. Biomed. Biotechnol., 1; 15.).

With the rapid development of artificial intelligence, machine learning-based classification models have been gradually applied to the field of reverse vaccinology immunogen predictive discovery. Representative 45 physicochemical parameter characteristics, as obtained by Darren et al based on one-dimensional amino acid sequence annotation, established the VaxiJen method using the partial least squares (DA-PLS) algorithm (Doytchinova i.a., (2007) BMC Bioinformatics, 8; 4.); zai et al established the MPPA-ML method (Zai X, (2021) Vet Res 52; 75.) using voting algorithm, etc., based on 6 core biological features obtained by one-dimensional amino acid sequence annotation; the method comprises the steps that a Vaxign-ML method (HeY, (2020) Bioinformation, 36; 10:3185-3191.) is established by using algorithms such as a Support Vector Machine (SVM) based on 509 biological characteristics and physicochemical parameter characteristics obtained by annotation of a one-dimensional amino acid sequence; compared with the traditional filtering method based on rules, the method has the advantage that the accuracy of immunogen prediction is greatly improved. Recently, the transducer-based unsupervised protein language model ESM-2 showed better ability to extract one-dimensional amino acid sequence features than the conventional algorithm (Rives A, (2021) Proc Natl Acad Sci U S A.118 (15): e 2016239118.).

However, all the above methods require annotation of protein biological characteristics and physicochemical parameter characteristics by using a large amount of bioinformatics software from one-dimensional amino acid sequences, and the process is extremely complicated (Dalsass M, (2019) Front Immunol,14; 10:113). Meanwhile, the biological characteristic annotation method and software of proteins are different aiming at different types of pathogens such as bacteria, viruses, parasites and the like, and the model is wide in applicability and poor in application. In addition, the range of the immunogen prediction by the current method is still wider, and the immune evaluation verification is difficult, so that the accuracy of the method prediction is still to be further improved.

The three-dimensional structure of the protein finally determines the biological characteristics and physicochemical parameter characteristics of the protein, and has the information abundance far exceeding the one-dimensional amino acid sequence. The invention aims to establish a novel immunogen prediction method based on three-dimensional structural characteristics, and realize universal (applicable to bacteria, viruses, parasites and the like) and high-precision immunogen prediction.

Disclosure of Invention

The invention aims to overcome the defect that the existing reverse vaccinology only develops an immunogen prediction technology based on one-dimensional amino acid sequence characteristics, and provides a brand-new immunogen high-precision prediction system and method based on a protein three-dimensional structure and a graph neural network.

The three-dimensional structure of the protein finally determines the biological characteristics and physicochemical parameter characteristics of the protein, and has the information abundance far exceeding the one-dimensional amino acid sequence. The machine learning classification prediction model of the immunogen is established based on the three-dimensional structural characteristics of the protein, and the limitation of the existing model based on the one-dimensional amino acid sequence is hopeful to be broken through.

The method is characterized in that an immunogen prediction model is established based on three-dimensional structural characteristics of protein, and firstly, how to extract effective characteristic vectors from the protein structure needs to be solved. The three-dimensional structure of the protein predicted by software such as Alphafold2, which is determined experimentally, is typically stored in PDB file format. The PDB file is essentially an ASCII-code text file containing information such as the coordinates of each atom of the amino acids constituting the protein. The structural data in the text file format is difficult to directly perform feature extraction and machine learning model establishment.

The three-dimensional structure of the protein consists of amino acids, and the amino acids are connected by chemical bonds. Thus, the three-dimensional structure of a protein can be seen as a "map" in which the amino acids are nodes and the chemical bonds are edges. There are many methods for capturing three-dimensional structures of proteins, including abstracting the three-dimensional structures of proteins into 3D grid patterns, or abstracting the structures of proteins in three-dimensional space into one-dimensional or two-dimensional structures and then analyzing the structures. However, the three-dimensional structure of the protein is composed of atoms, and the atoms are connected by chemical bonds, which is essentially a 'graph', so that the invention selects the graph neural network (GNN, graphic Nuaral Network) as a tool for processing the three-dimensional structure characteristics of the protein, and an immunogen prediction system based on the three-dimensional structure of the protein and the graph neural network is established.

The system specifically comprises the following modules:

(1) The three-dimensional structure feature extraction module: characterizing and extracting three-dimensional structural characteristics of protein by using a graphic neural network model designed aiming at the structural characteristics of an immunogen, and obtaining characteristic representation of interaction of amino acids in protein in a three-dimensional space by pre-training and learning a protein three-dimensional structure PDB database;

(2) An immunogen structure data set processing module: collecting known immunogens of pathogens as positive sample sets, randomly extracting non-immunogens of non-homologous proteins of the known immunogens from a pathogen protein database as negative sample sets, obtaining a protein three-dimensional structure PDB file through structure prediction software to form a first immunogen structure data set and a non-immunogen structure data set, inputting structure information into the pre-trained graph neural network model, and extracting to obtain feature vectors corresponding to the three-dimensional structure;

(3) A machine learning classifier module: after the feature vector is subjected to dimension reduction, the feature vector is combined with one-dimensional amino acid sequence features, a machine learning algorithm is adopted for classification model training, a trained immunogen classifier is obtained, and model prediction accuracy is evaluated based on a test set;

(4) And the automatic prediction output module is used for: inputting three-dimensional structural information and one-dimensional amino acid sequences of all annotated proteins of pathogens to be predicted, applying the pre-trained pattern neural network model and an immunogen classifier, and automatically outputting a candidate immunogen list.

In a preferred embodiment, the protein structure database in module (1) includes, but is not limited to, the PDB (Protein Data Bank), alphafold Protein Structure Database databases.

In another preferred embodiment, the optimized modified graph neural network model in block (1) is specifically a modified neighborhood enhanced neural network model (NEGCN, neighbor Enhanced Graph Convolutional Network). Based on the three-dimensional spatial distribution of amino acids in proteins, NEGCN constructs an adjacency graph according to the spatial distance between amino acids, chain connection of amino acids, nearest neighbor distance between amino acids, etc., respectively. The amino acid adjacency graph is then converted into an edge graph based on forces between amino acids, while direction-sensitive messaging occurs in the vector direction of space according to the different force edges. In a specific embodiment of the invention, the invention classifies edge types for the first time into spatial locations Class, more preferably, in experiments +.>8, namely the space is divided into 8 quadrants according to x, y and z axes, so that different roles played by different adjacent sides in the three-dimensional space are better distinguished. Finally, based on contrast learning mechanisms, a characteristic representation of the interactions of amino acids in proteins in three-dimensional space is obtained, including but not limited to spatial positions of amino acids, interactions between different amino acids, amino acid physicochemical properties. More preferably, a 4-layer neural network is obtained and output [512, 256, 128, 64, respectively]The dimension vectors add up to 960 dimensions. The model is designed with unique technical characteristics aiming at the structural characteristics of the immunogen protein, and has enhanced three-dimensional structural characterization capability of the protein compared with the existing NEGCN neighborhood enhanced neural network model.

In another preferred embodiment, the pathogen in module (2) is a bacterial, viral or parasitic microorganism which is pathogenic to a human and/or animal host.

In another preferred embodiment, the known immunogens in block (2) refer to pathogen protein components that have been experimentally validated to elicit potent immunoprotection in a host, and the collection of known immunogens includes integrated database screening, literature investigation, and experimental discovery, databases including, but not limited to, IEDB database, anti-gen database, and proteogen database; the non-immunogen refers to pathogen protein components which have not been shown to trigger effective immune protection in a host, and are collected by downloading all pathogen complete protein sequences from a Uniprot protein database, randomly extracting candidate proteins from the pathogen complete protein sequences, and establishing a non-immunogen data set, namely a negative sample set, after removing homologous sequences from known immunogens by a search tool (such as BLAST, basic Local Alignment Search Tool) based on a local alignment algorithm.

In another preferred embodiment, the structure prediction software in module (2) includes, but is not limited to, alphaFold2, roseTTAFold, ESMFold.

In another preferred embodiment, the extraction of the one-dimensional amino acid sequence features in block (3) is performed by a method including, but not limited to, protein language model ESM-2, and the one-dimensional amino acid sequence features can also be performed by the skilled person using other one-dimensional amino acid sequence feature extraction tools of the prior art using common technical knowledge.

In another preferred embodiment, the dimension reduction method in block (3) includes, but is not limited to, principal Component Analysis (PCA), and the oversized three-dimensional structure feature dimension may result in overfitting, and also in an increase in model parameters and complexity, and more preferably, the dimension reduction of the 960-dimensional feature vector extracted from the neural network to 20 to 100 dimensions, so as to obtain optimal machine learning model performance.

In another preferred embodiment, the classification model algorithm in block (3) includes, but is not limited to, model training with extreme gradient boosting (XGBoost) to yield a trained classifier.

Secondly, the invention also provides an immunogen prediction method based on the immunogen prediction system based on the protein three-dimensional structure and the graph neural network, which specifically comprises the following steps:

(1) Three-dimensional structural feature extraction: characterizing and extracting three-dimensional structural characteristics of protein by using a graphic neural network model designed aiming at the structural characteristics of an immunogen, and obtaining characteristic representation of interaction of amino acids in protein in a three-dimensional space by pre-training and learning a protein three-dimensional structure PDB database;

(2) The immunogen structure data set processing step: collecting known immunogens of pathogens as positive sample sets, randomly extracting non-immunogens of non-homologous proteins of the known immunogens from a pathogen protein database as negative sample sets, obtaining a protein three-dimensional structure PDB file through structure prediction software to form a first immunogen structure data set and a non-immunogen structure data set, inputting structure information into the pre-trained graph neural network model, and extracting to obtain feature vectors corresponding to the three-dimensional structure;

(3) Machine learning classifier step: after the feature vector is subjected to dimension reduction, the one-dimensional amino acid sequence features extracted by the ESM-2 are combined, a machine learning algorithm is adopted to carry out classification model training, a trained immunogen classifier is obtained, and model prediction accuracy is evaluated based on a test set;

(4) An automatic prediction output step: inputting three-dimensional structural information of all annotated proteins of pathogens to be predicted, applying the pre-trained graphic neural network model and an immunogen classifier, and automatically outputting a candidate immunogen list.

In a preferred embodiment, the protein structure database in step (1) includes, but is not limited to, the PDB (Protein Data Bank), alphafold Protein Structure Database databases.

In another preferred embodiment, the optimized modified graph neural network model in step (1) is specifically a modified neighborhood enhanced neural network model (NEGCN, neighbor Enhanced Graph Convolutional Network). Based on amino groups in proteinsThe three-dimensional distribution of the acids, NEGCN constructs an adjacency graph according to the spatial distance between amino acids, chain connection of amino acids, nearest neighbor distance between amino acids, etc., respectively. The amino acid adjacency graph is then converted into an edge graph based on forces between amino acids, while direction-sensitive messaging occurs in the vector direction of space according to the different force edges. In a specific embodiment of the invention, the invention classifies edge types for the first time into spatial locations Class, more preferably, in experiments +.>8, namely the space is divided into 8 quadrants according to x, y and z axes, so that different roles played by different adjacent sides in the three-dimensional space are better distinguished. Finally, based on contrast learning mechanisms, a characteristic representation of the interactions of amino acids in proteins in three-dimensional space is obtained, including but not limited to spatial positions of amino acids, interactions between different amino acids, amino acid physicochemical properties. More preferably, a 4-layer neural network is obtained and output [512, 256, 128, 64, respectively]The dimension vectors add up to 960 dimensions. The model is designed with unique technical characteristics aiming at the structural characteristics of the immunogen protein, and has enhanced three-dimensional structural characterization capability of the protein compared with the existing NEGCN neighborhood enhanced neural network model.

In another preferred embodiment, the pathogen in step (2) is a bacterial, viral or parasitic microorganism which is pathogenic to a human and/or animal host.

In another preferred embodiment, the known immunogens in step (2) refer to pathogen protein components that have been experimentally validated to elicit potent immunoprotection in a host, and the known immunogens are collected by means including integrated database screening, literature investigation, and experimental discovery, databases including but not limited to IEDBDatabse, anti-gen Databse, proteogendatabse; the non-immunogen refers to pathogen protein components which have no experimental evidence that can trigger effective immune protection in a host, and are collected by downloading all pathogen complete protein sequences from a Uniprot protein database, obtaining candidate proteins from the pathogen complete protein sequences by adopting a random extraction mode, and establishing a non-immunogen data set, namely a negative sample set after removing homologous sequences with known immunogens through a searching tool (BLAST, basic Local Alignment Search Tool) based on a local alignment algorithm.

In another preferred embodiment, the structure prediction software in step (2) includes, but is not limited to, alphaFold2, roseTTAFold, ESMFold.

In another preferred embodiment, the extraction of the one-dimensional amino acid sequence features in step (3) is performed by a method including, but not limited to, protein language model ESM-2, and the one-dimensional amino acid sequence features can also be performed by the skilled person using other one-dimensional amino acid sequence feature extraction tools of the prior art using common technical knowledge.

In another preferred embodiment, the dimension reduction method in step (3) includes, but is not limited to, principal Component Analysis (PCA), and the oversized three-dimensional structure feature dimension may result in overfitting, and also in an increase in model parameters and complexity, and more preferably, the dimension reduction of the 960-dimensional feature vector extracted from the neural network to 20 to 100 dimensions, so as to obtain the optimal machine learning model performance.

In another preferred embodiment, the classification model algorithm in step (3) includes, but is not limited to, model training with extreme gradient boosting (XGBoost) to obtain a trained classifier.

The invention relates to an innovative application of an artificial intelligence algorithm in the field of novel vaccine immunogen discovery, and the core idea is to provide a high-precision immunogen prediction method based on a protein three-dimensional structure and a graphic neural network, overcome the limitation caused by the traditional machine learning classification only based on one-dimensional amino acid sequence characteristics, and realize universal (applicable to bacteria, viruses, parasites and the like) and high-precision immunogen prediction. The method is favorable for rapid research and development of novel vaccines, and has important application value in the field of biological medicine.

Drawings

FIG. 1 is a schematic flow diagram of an immunogen prediction system based on a three-dimensional structure of a protein and a graphic neural network.

FIG. 2 is a flow chart of learning three-dimensional structural features of proteins based on the neural network of the graph.

FIG. 3 is a graph of the NEGCN model feature learning of the neighborhood enhancement graph convolutional network.

Fig. 4, a positional relationship classification diagram.

Detailed Description

The invention will be further described with reference to specific embodiments, and advantages and features of the invention will become apparent from the description. These examples are merely exemplary and do not limit the scope of the invention in any way.

Example 1: construction of immunogen prediction system based on protein three-dimensional structure and graph neural network

As shown in figure 1, the invention provides a high-precision immunogen prediction method based on a protein three-dimensional structure and a graphic neural network, which aims to overcome the limitation caused by the traditional machine learning classification based on one-dimensional amino acid sequences and realize universal and high-precision immunogen prediction.

The method comprises the following steps:

1. the three-dimensional structure feature extraction module: based on about 80.5 ten thousand high-quality protein three-dimensional structure data in Alphafold Protein Structure Database (https:// www.alphafold.ebi.ac.uk /), pre-training learning is carried out, and a novel graph neural network model characterization designed aiming at the structural characteristics of the immunogen is established and protein three-dimensional structure characteristics are extracted.

The specific process is as follows:

1.1 in the three-dimensional structure of proteins, where the position of an amino acid in three-dimensional space is determined by the position of its Alpha Carbon.

1.2 Based on the three-dimensional structure of randomly selected protein A, one amino acid a is randomly selected in the protein ₁ In a, a ₁ Is centered and takes R as radius (R takes a random value, but the value should beAnd the circle contains at least Y amino acids, wherein the value range of X is set as [15,50]The value range of Y is set to be [15,50]Preferably, the values in the examples are set to x=15, y=15), and the amino acids contained in the circles are truncated in the three-dimensional spatial structure of the protein (fig. 2). Based on the amino acids thus intercepted, an adjacent map is constructed to obtain a target sample (A of FIG. 2).

1.3 based on the three-dimensional Structure of randomly selected protein A, one more amino acid a was randomly selected in the protein ₂ In a, a ₂ Taking R as a radius and intercepting amino acids contained in the circle in the three-dimensional space structure of the protein. Based on the amino acids thus captured, an adjacent pattern is constructed, and a positive sample of the target sample is obtained (A in FIG. 2).

1.4 randomly selecting the three-dimensional structure of another protein B, randomly selecting an amino acid B in the protein, taking B as a circle center, taking R as a radius, and intercepting the amino acid contained in the circle in the three-dimensional space structure of the protein. Based on the amino acids that were truncated, an adjacency graph was constructed, resulting in a negative sample of the target sample (B of fig. 2).

1.5 in the generated target samples, positive samples, and negative samples, nodes are amino acids in proteins, the properties of the nodes are categories of amino acids (21 types of which 20 types are common amino acids and all unusual amino acids are classified as one type), and edges in the figure are the adjacency relations between the corresponding amino acids. Based on the target sample, the positive sample and the negative sample respectively, the edges in the adjacent graph are randomly removed according to a given percentage (the value range is [5%,30% ], and the value is preferably set to 15%) by adopting a random Mask method (namely, a certain proportion of edges are randomly selected in a list of all edges without replacement), so that updated target samples, positive samples and negative samples are obtained.

1.6 target samples, positive samples, and negative samples were simultaneously input to our proposed improved Neighborhood Enhanced Graph Convolution Network (NEGCN) to obtain their graph representations. The process flow of NEGCN is shown in fig. 3.

1.6.1 in NEGCN, first a in initial fig. 3 is converted to B (edge graph). In B (edge map)The adjacency between amino acids is a node, and the connectivity between different adjacencies is an edge. In the newly generated edge graph B, the attribute of the node is the attribute cascade of two nodes linked by one edge in the original graph; the type of edge is the angle formed by the adjacent relationship between two amino acids. To better distinguish the different roles played by different adjacent edges in three-dimensional space, we propose for the first time to divide the types of edges into spatial locations Class, preferably, in experiments +.>8, namely, the space is divided into 8 quadrants according to x, y and z axes (as shown in fig. 4), so that different roles played by different adjacent sides in the three-dimensional space are better distinguished. Specifically, three nodes corresponding to two adjacent edges are based<a,b,c>Where b is the node common to both edges. In the case of the position class of the edge ab to the edge bc, the vector +.>For reference, vector difference +.>It is then determined in which quadrant of fig. 4 the position class of the edge ab to edge bc should be based on the three-dimensional difference result. On the other hand, when the position class of the side bc and the side ab is found, the vector +.>For reference, vector difference +.>It is then determined in which quadrant of fig. 4 the position class of the edge bc to the edge ab should be based on the three-dimensional difference result.

1.6.2 constructing a 4-layer neural network based on the constructed edge graph (according to the specific requirements of the experiment on the number of layers of the neural network and the output vector of each layer, preferably, the number of layers adopted in the experiment is 4, preferably, the output vector dimensions are respectively [512, 256, 128, 64], and the output of each layer is cascaded and used as a final three-dimensional structural feature representation (total 512+256+128+64=960 dimensions). Three different message passing methods are adopted to update each node in the edge graph and cascade the obtained ebedding of the three methods, so as to obtain the three message passing methods of each node in the edge graph:

1.6.2.1 in the node update at the m-th layer, with a given node as the center, select to connect with that node,all nodes within the hop (preferably, in the embodiment +.>The value of 2) and updating the representation of the target node with the representations of the nodes.

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing a specific jump->Represent the leachable weight corresponding to a particular hop, < ->Representing nodes of the same genus and same hop, N _i Is a neighbor node of (a).

1.6.2.2 in the node update of the m-th layer, k nodes near a given node are selected centering on the nodeThe representation of the target node is updated with the representations of the k nodes (preferably, the value of k is 10 in the embodiment).

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing a distance node N _i The nearest k node sets.

1.6.2.3 in node update at layer m, at a given node N _i Centered, a circle is drawn in space with r as the radius (preferably, r has a value of 10 in the embodimentAll N in a circle _i The representation of the target node is updated taking into account the representations of the neighboring nodes.

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing and node N _i A set of neighbor nodes having a distance less than r.

1.6.2.4 finally, the results from the three messaging are weighted averaged to obtain the final representation. When updating, according to the neighbor node and the target node N _i Classifying according to the included angle relation theta of each subclassMessage transmission is carried out, and finally, the message transmission results of different categories are averaged to obtain updated node N _i Representation of->。

1.6.3 after obtaining the representation of each node in the edge graph at different levels, the representations of each node in the different levels are first concatenated and then the average of the representations of all nodes is taken as the representation G of the whole graph.

Where N represents the total number of nodes in the graph, the function avg represents the average, and the symbol i is a concatenated symbol.

1.7 Finally, contrast learning is performed through contrast loss, so that the representation of the positive sample and the target sample of NEGCN learning is closer in the high-dimensional space, and the representation of the negative sample and the target sample is farther in the high-dimensional space.

The function sim calculates the residual similarity corresponding to different sample representations, B represents the Batch size (i.e. how many samples are processed simultaneously in each round, and B can be defined according to a specific task, preferably, the value of B is 48 in the embodiment), and δ is a temperature coefficient (preferably, the value of δ is 0.07 in the embodiment). G _t Representation corresponding to representative target sample, G _p Representation of positive sample of target sample, G _i Representing a representation of the negative sample to which the target sample corresponds. Preferably, the number of training iteration rounds is 25, and the number of iteration rounds can be adjusted according to specific tasks.

2. An immunogen structure data set processing module: collecting known immunogens of pathogens as positive sample sets, randomly extracting non-immunogens of non-homologous proteins of the known immunogens from a pathogen protein database as negative sample sets, obtaining a protein three-dimensional structure PDB file through structure prediction software, establishing a first immunogen protein structure data set and a non-immunogen structure data set, inputting protein structure information into the pre-trained graph neural network model, and extracting to obtain feature vectors corresponding to the three-dimensional structure; the specific process is as follows:

2.1 creation of an immunogen dataset

In order to build reliable predictive models of immunogens, it is important to construct a suitable positive and negative data set. A common positive data set is the known antigenic protein, i.e. immunogen. The negative data set is typically a protein that is not antigenic, i.e., is not an immunogen, such as a cytoplasmic or intranuclear protein, etc. The method comprises the steps of collecting the important pathogen lists listed in domestic interpersonal infectious pathogen microorganism lists and international institutions such as World Health Organization (WHO), national Institutes of Health (NIH), american centers for disease control and prevention (CDC) and the like, and establishing pathogen lists including bacteria, viruses, parasites and the like after summarizing and analyzing.

And searching information related to the pathogen vaccine target immunogen in a public online database by a database searching method. Various databases, such as Immune Epitope Database (IEDB), anti-gen database, protein database, etc., are used to ensure high reliability and coverage of the retrieved information. In order to cover the vaccine immunogen discovered by the latest research, real-time dynamic document investigation is carried out from a Pubmed database, and a search strategy is formulated, including search keywords, limited document types, year and the like, so as to ensure that the searched documents have higher relevance and reliability. And integrating and screening vaccine target immunogen information obtained by literature investigation and database retrieval to form a positive immunogen data set.

Whole proteome information of the above pathogens was extracted from Uniprot database, and proteins were randomly extracted therefrom, excluding proteins homologous to the positive dataset (sequence homology > 30%), and negative immunogen datasets were constructed. Ensuring that the amount of protein in the positive and negative immunogen data sets is equivalent to avoid data bias.

The final positive sample dataset established contained 1000 known immunogens of which bacteria 554, virus 281, parasite 165; the negative data set established contained 1000 non-immunogens of which bacteria 547, viruses 287, parasites 166. The positive and negative sample data sets are far more than the reported model data amount, so that the method has wider coverage and contributes to the improvement of the efficacy of the final model.

2.2 Immunogen three-dimensional structure prediction and feature extraction

Based on protein amino acid fasta sequences in the positive sample set and the negative sample set, adopting protein structure prediction software alpha fold2 to perform atomic level prediction on the three-dimensional structure of proteins in the immunogen data set, obtaining an output PDB file, and establishing a first immunogen protein structure data set and a non-immunogen structure data set. The alpha fold2 adopts a method based on deep learning and convolutional neural network, and can predict the three-dimensional structure of the protein with high quality in a short time. Meanwhile, the alpha fold2 can also output uncertainty distribution of protein and is used for evaluating reliability of a prediction result.

And adopting a pre-trained graph neural network model to extract the characteristics of the three-dimensional structures of 1000 known immunogens in the positive sample data set and 1000 non-immunogens in the negative sample set, and training a downstream machine learning model.

3. A machine learning classifier module: after dimension reduction is carried out on the three-dimensional structural feature vectors of the immunogen and the non-immunogen proteins, the one-dimensional amino acid sequence features obtained by extraction of the deep learning model ESM-2 are combined, classification model training is carried out by adopting a plurality of machine learning algorithms, a trained immunogen classifier is obtained, and model prediction accuracy is evaluated based on a test set.

The specific process is as follows:

3.1 represents dimensions as hyper-parameters, with significant impact on the performance of the classification model: too few dimensions can reduce the expressive power of the model, too large dimensions can lead to overfitting, and can also lead to increased model parameters and complexity. The three-dimensional structural feature vector of the extracted protein is subjected to dimension reduction by a PCA method to obtain a low-dimensional three-dimensional structural feature representation (the dimension range of the low-dimensional feature is [20-100], and the value in the embodiment is 40 dimensions).

3.2 The one-dimensional amino acid sequence features extracted from the protein language model ESM are applied to the prediction discovery of the immunogen for the first time. ESM-2 uses UniRef50 as training data to complete the pre-training process, and uses a transducer-based language model and an attention mechanism to learn the pattern of interactions between amino acid pairs in the input sequence. In the experiment, the pretrained ESM-2 model is used, the parameter scale is 650 million, and after the one-dimensional amino acid sequence of the protein is input, the feature vector with the dimension of 1280 is automatically generated. And then the dimension of the extracted protein one-dimensional structural feature vector is reduced by a PCA method, and the dimension is kept to be the same as that of the three-dimensional structural feature, so that a low-dimensional three-dimensional structural feature representation (the dimension range of the low-dimensional feature is [20-100], and the value in the embodiment is 40 dimensions) is obtained.

3.3 And (3) dividing 1000 positive samples and 1000 negative samples in proportion, wherein the training set accounts for 75% of the total set, and is used for building a prediction model, and the testing set accounts for 25% of the total set, and is used for evaluating and verifying the built model.

3.4 model training is carried out by adopting classification model algorithms including but not limited to extreme gradient lifting (XGBoost, chen T, guestrin C.Xgboost: A scalable tree boosting system [ C ]// Proceedings of the 22. 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2016: 785-794.) based on protein one-dimensional amino acid sequence feature vectors (40 dimensions) corresponding to the training set or combined three-dimensional structure feature vectors (40 dimensions+40 dimensions, total 80 dimension feature vectors) respectively, so as to obtain the trained classifier.

3.5 The algorithm evaluation is carried out on the trained classifier model based on the one-dimensional amino acid sequence feature (40 dimension) or the combined protein three-dimensional structure feature vector (40 dimension+40 dimension, total 80 dimension feature vector) respectively, wherein the indexes comprise Precision, recall, F1 value (F1-score) and Accuracy (Accuracy), and the model evaluation report is as follows:

Table 1: immunogen prediction model classification evaluation report based on one-dimensional amino acid sequence characteristics alone

Through the first application of one-dimensional amino acid sequence characteristics extracted by a transducer-based unsupervised protein language model ESM to immunogen prediction discovery, the result shows that the accuracy of an immunogen prediction model established based on the ESM-based one-dimensional amino acid sequence characteristics (40 dimensions) is respectively 85%, 88% and 88% in bacteria, parasites and viruses, which is obviously superior to that of the traditional immunogen prediction algorithm (HeY (2020) Bioinformation, 36; 10:3185-3191.)

Table 2: immunogen prediction model classification evaluation report based on protein three-dimensional structure and graphic neural network combined one-dimensional amino acid sequence characteristics

On the basis, the novel graph neural network model designed for the structural characteristics of the immunogen is used for representing and extracting three-dimensional structural characteristics of protein, and an immunogen prediction model is built after three-dimensional structural information is combined with traditional one-dimensional amino acid sequence information (80 dimensions), so that the accuracy of bacteria, parasites and viruses respectively reaches 91%, 91% and 93%, each evaluation index is good, the effectiveness of the model is further obviously improved, and the effectiveness of an immunogen prediction method based on the three-dimensional structure of the protein and the graph neural network is verified.

4. And the automatic prediction output module is used for: and inputting a PDB file and a one-dimensional amino acid sequence of all annotated proteins of pathogens to be predicted, and automatically outputting a candidate immunogen list by applying the pre-trained graphic neural network model and the immunogen classifier.

And carrying out three-dimensional structure prediction by adopting protein structure prediction software alpha fold2 pathogen annotation protein to obtain an output PDB file. Extracting structural feature vectors of the annotation proteins through a pre-trained graph neural network model, reducing the dimension of the extracted three-dimensional structural feature vectors of the proteins through a PCA method, reducing the dimension of the three-dimensional structural feature vectors of the protein from 960 to 40, simultaneously combining one-dimensional amino acid sequence features (40 dimensions) extracted by ESM-2, inputting the obtained 80-dimensional feature vectors into a trained classifier model, and automatically outputting a pathogen candidate immunogen list to obtain potential vaccine target antigens.

Example 2 prediction of monkey poxvirus candidate immunogens

The specific process is as follows:

1. using monkey poxvirus as a model pathogen, and developing the application of the model; and carrying out three-dimensional structure prediction ON all 190 annotation proteins of the monkey pox virus (MPXV_USA_2022_MA001 (ID: ON 563414.3)) by adopting protein structure prediction software alpha fold2 to obtain an output PDB file.

2. Extracting structural feature vectors of 190 proteins of the monkey pox through a pre-trained graph neural network model, reducing the dimension of the extracted three-dimensional structural feature vectors of the proteins through a PCA method from 960 dimensions to 40 dimensions, and simultaneously combining one-dimensional amino acid sequence features (40 dimensions) extracted by ESM, and inputting the obtained 80-dimensional feature vectors into a trained classifier model. The classifier is trained by excluding known immunogens of the monkey pox in the data set to avoid overfitting and automatically outputting a candidate list of immunogens of the monkey pox virus, wherein a plurality of important physiological functional proteins of the monkey pox and known vaccine target immunogens, such as M1R, A35R, B R and the like, are still contained, and the efficacy of an immunogen prediction model based on a three-dimensional structure of the proteins and a map neural network is again proved in practical cases, wherein Top10 immunogens in the list are as follows:

table 3: monkey pox candidate immunogen list Top10 based on classifier model automatic output

Immunogens determine the targeting of vaccine-induced immune response attacks and are the determining factor in the development of novel vaccines. The invention provides a high-precision immunogen prediction method based on a protein three-dimensional structure and a graphic neural network, which overcomes the limitation caused by the traditional machine learning classification based on only one-dimensional amino acid sequences and realizes the universal (applicable to bacteria, viruses, parasites and the like) and high-precision (superior to the currently reported model) immunogen prediction. The method is favorable for rapid research and development of novel vaccines, and has important application value in the field of biological medicine.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the method described in the foregoing embodiments may be implemented by a program to control related hardware, where the program may be stored on a computer readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention and are more fully described herein by way of example only, and are not intended to be limiting, since the invention may be practiced with modification, equivalent arrangements and adaptations which are within the spirit and scope of the appended claims.

Claims

1. An immunogen prediction system based on a protein three-dimensional structure and a graphic neural network, which is characterized by comprising the following modules:

(1) The three-dimensional structure feature extraction module: characterizing and extracting three-dimensional structural characteristics of protein by using a graph neural network model designed based on the structural characteristics of an immunogen, and obtaining characteristic representation of interaction of amino acids in protein in a three-dimensional space by pre-training and learning a protein three-dimensional structure PDB database; in the three-dimensional structure feature extraction module, the graph neural network model is an improved neighborhood enhancement neural network model; the improved neighborhood enhanced neural network model is based on three-dimensional spatial distribution of amino acids in proteins, and an adjacency graph is built according to spatial distance among amino acids, chain connection of the amino acids and the relationship of the nearest neighbor distance among the amino acids; then, converting the amino acid adjacency graph into an edge graph based on acting force among amino acids, dividing the types of the edges into 8 quadrants according to the space positions and x, y and z axes, distinguishing different roles played by different adjacent edges in a three-dimensional space, and carrying out direction-sensitive message transmission on the edges in the vector direction of the space according to the different acting forces; finally, based on a contrast learning mechanism, obtaining the characteristic representation of the interaction of amino acids in the protein in a three-dimensional space, wherein the three-dimensional structural characteristics of the protein comprise the spatial positions of the amino acids, the interaction among different amino acids and the physical and chemical properties of the amino acids;

(2) An immunogen structure data set processing module: collecting known immunogens of pathogens as positive sample sets, randomly extracting non-immunogens of non-homologous proteins of the known immunogens from a pathogen protein database as negative sample sets, obtaining protein three-dimensional structure PDB files through structure prediction software to form immunogen structure data sets and non-immunogen structure data sets, inputting the protein three-dimensional structure PDB files of the immunogens and the non-immunogens into the pre-trained graph neural network model, and extracting to obtain feature vectors corresponding to the immunogen three-dimensional structures and feature vectors corresponding to the non-immunogen three-dimensional structures;

(3) A machine learning classifier module: after dimension reduction is carried out on the three-dimensional structural feature vectors of the immunogen and the non-immunogen proteins, the classification model training is carried out by combining the one-dimensional amino acid sequence features and adopting a machine learning algorithm, so that a trained immunogen classifier is obtained, and model prediction accuracy is evaluated based on a test set;

(4) And the automatic prediction output module is used for: and inputting a PDB file and a one-dimensional amino acid sequence file of all annotated proteins of pathogens to be predicted, applying the pre-trained graphic neural network model and an immunogen classifier, and automatically outputting a candidate immunogen list.

2. The system of claim 1, wherein in the immunogenic structure data set processing module, the pathogen is a bacterial, viral or parasitic microorganism pathogenic to a human and/or animal host.

3. The system of claim 1, wherein in the immunogen structure dataset processing module, the known immunogens refer to pathogen protein components that have been experimentally validated to elicit effective immunoprotection in a host, and wherein the known immunogens are collected by means of integrated Database screening, literature investigation, and experimental discovery, wherein the Database comprises IEDB Database, anti-gen Database, and/or Protegen Database; the non-immunogen refers to pathogen protein components which have no experimental evidence that can trigger effective immune protection in a host, the collection mode is that all pathogen protein sequences are downloaded from a Uniprot protein database, candidate proteins are obtained from the pathogen protein sequences by adopting a random extraction mode, and a non-immunogen data set, namely a negative sample set, is established after homologous sequences with known immunogens are removed by a search tool based on a local alignment algorithm.

4. The system of claim 1, wherein in the immunogen structure data set processing module, the structure prediction software bits are AlphaFold2, rosettafid, and/or ESMFold.

5. The system of claim 1, wherein in the machine learning classifier module, the one-dimensional amino acid sequence features are extracted by a protein language model ESM-2.

6. The system of claim 1, wherein the method employed to dimension down the three-dimensional structural feature vectors of immunogenic and non-immunogenic proteins in the machine-learned classifier module is principal component analysis.

7. The system of claim 1, wherein in the machine learning classifier module, the machine learning algorithm is an extreme gradient boost.

8. An immunogen prediction method based on a protein three-dimensional structure and a graphic neural network, which is characterized by comprising the following steps:

(1) Three-dimensional structural feature extraction: characterizing and extracting three-dimensional structural characteristics of protein by using a graph neural network model designed based on the structural characteristics of an immunogen, and obtaining characteristic representation of interaction of amino acids in protein in a three-dimensional space by pre-training and learning a protein three-dimensional structure PDB database; in the three-dimensional structural feature extraction step, the graph neural network model is an improved neighborhood enhancement neural network model; the improved neighborhood enhanced neural network model is based on three-dimensional spatial distribution of amino acids in proteins, and an adjacency graph is built according to spatial distance among amino acids, chain connection of the amino acids and the relationship of the nearest neighbor distance among the amino acids; then, converting the amino acid adjacency graph into an edge graph based on acting force among amino acids, dividing the types of the edges into 8 quadrants according to the space positions and x, y and z axes, distinguishing different roles played by different adjacent edges in a three-dimensional space, and carrying out direction-sensitive message transmission on the edges in the vector direction of the space according to the different acting forces; finally, based on a contrast learning mechanism, obtaining the characteristic representation of the interaction of amino acids in the protein in a three-dimensional space, wherein the three-dimensional structural characteristics of the protein comprise the spatial positions of the amino acids, the interaction among different amino acids and the physical and chemical properties of the amino acids;

(2) The immunogen structure data set processing step: collecting known immunogens of pathogens as positive sample sets, randomly extracting non-immunogens of non-homologous proteins of the known immunogens from a pathogen protein database as negative sample sets, obtaining protein three-dimensional structure PDB (product data B) files through structure prediction software to form a first immunogen structure data set and a non-immunogen structure data set, inputting the protein three-dimensional structure PDB files of the immunogens and the non-immunogens into the pre-trained graphic neural network model, and extracting to obtain feature vectors corresponding to the three-dimensional structures of the immunogens and feature vectors corresponding to the three-dimensional structures of the non-immunogens;

(3) Machine learning classifier step: after dimension reduction is carried out on the three-dimensional structural feature vectors of the immunogen protein and the non-immunogen, the classification model training is carried out by combining the one-dimensional amino acid sequence features and adopting a machine learning algorithm, so that a trained immunogen classifier is obtained, and model prediction accuracy is evaluated based on a test set;

(4) An automatic prediction output step: and inputting a PDB file and a one-dimensional amino acid sequence of all annotated proteins of pathogens to be predicted, and automatically outputting a candidate immunogen list by applying the pre-trained graphic neural network model and the immunogen classifier.

9. The method according to claim 8, wherein in the step of processing the immunogenic structure data set, the pathogen is a bacterial, viral or parasitic microorganism pathogenic to a human and/or animal host.

10. The method of claim 8, wherein in the step of processing the immunogen structure dataset, the known immunogens refer to pathogen protein components that have been experimentally verified to elicit potent immunoprotection in a host, and the known immunogens are collected by means comprising integrated Database screening, literature investigation, and experimental discovery, the Database comprising IEDB Database, anti-gen Database, and/or Protegen Database; the non-immunogen refers to pathogen protein components which have no experimental evidence that can trigger effective immune protection in a host, the collection mode is that all pathogen protein sequences are downloaded from a Uniprot protein database, candidate proteins are obtained from the pathogen protein sequences by adopting a random extraction mode, and a non-immunogen data set, namely a negative sample set, is established after homologous sequences with known immunogens are removed by a search tool based on a local alignment algorithm.

11. The method of claim 8, wherein in the immunogen structure dataset processing step, the structure prediction software bits are AlphaFold2, rosettafid, and/or ESMFold.

12. The method of claim 8, wherein in the immunogen structure dataset processing step, the one-dimensional amino acid sequence features are extracted by a protein language model ESM-2.

13. The method according to claim 8, wherein in the machine learning classifier step, a dimension reduction method used for dimension reduction of the three-dimensional structural feature vectors of the immunogenic and non-immunogenic proteins is used as a principal component analysis.

14. The method of claim 8, wherein in the machine learning classifier step, the machine learning algorithm is an extreme gradient boost.