WO2023217290A1

WO2023217290A1 - Genophenotypic prediction based on graph neural network

Info

Publication number: WO2023217290A1
Application number: PCT/CN2023/095224
Authority: WO
Inventors: 章依依; 吴翠玲; 徐晓刚; 王军; 李萧缘; 虞舒敏
Original assignee: 之江实验室
Priority date: 2022-10-11
Filing date: 2023-05-19
Publication date: 2023-11-16
Also published as: CN115331732A; CN115331732B; JP2024524795A

Abstract

The present disclosure relates to genophenotypic prediction based on a graph neural network. According to a method embodiment of the present disclosure, for a specific species, a graph neural network comprising multiple network layers is constructed according to the correlation between gene loci and phenotypes of the species. Each network layer of the graph neural network uses a one-dimensional convolution kernel having a length of 3, and the convolution kernel is shared by neighborhoods; and in each network layer, nodes represent the gene loci, edges represent that two gene loci are related to the same phenotype, and the weight of each edge is used for reflecting the degree of association between the gene loci. A result obtained after sample gene data of the species is input into the graph neural network is input into a multi-layer perceptron, and a corresponding phenotype classification result can be obtained. The graph neural network and the multi-layer perceptron can be trained and verified on the basis of the difference between the classification result of the sample gene data and a true value, and phenotypic classification can be carried out by using the trained graph neural network and the trained multi-layer perceptron.

Description

Gene phenotype prediction based on graph neural network

Technical field

The invention relates to the field of intelligent computing breeding, and in particular to gene phenotype prediction based on graph neural networks and corresponding graph neural network training.

Background technique

Along with the development of human civilization, crop breeding has mainly gone through three historical stages: farmers' subjective judgment based on experience, the establishment of crop breeding disciplines, and molecular selection breeding. Currently, with the development of disciplines such as big data, artificial intelligence, and the gradual establishment of genome-directed and precise improvement technologies such as gene editing and synthetic biology, crop breeding has opened a new era of intelligent breeding.

As a high-oil and high-protein plant, soybeans are an important component of grain production. How to select and cultivate high-yield soybeans is a problem that agriculturists are currently studying. The proposal of the genome-wide selection algorithm provides a direction for genetic breeding. The existing representative methods include Best Linear Unbiased Prediction (BLUP), Genomic Best Linear Unbiased Prediction (Genomic Best Linear Unbiased Prediction, GBLUP), Ridge Regression Best Linear Unbiased Prediction (RR-BLUP), Least Absolute Shrinkage and Selection Operator (LASSO), etc. However, the performance of these methods is still far from what we expect in breeding.

With the development of deep learning, researchers have begun to try to apply it in the field of breeding. For example, the deep learning-based gene phenotype prediction (DeepGS) algorithm proposed by the Northwest A&F University team can predict the phenotypic traits of wheat by constructing a convolutional neural network, and exceeds the performance of traditional whole-genome selection algorithms. However, most of the existing genome-wide selection algorithms based on deep learning use simple convolutional neural networks and do not utilize gene-related prior knowledge.

Graph neural networks can be trained on the basis of prior knowledge graphs and achieve considerable results. Graph neural networks are divided into spectrum-based methods and spatial domain-based methods, including Graph Neural Network (GNN), Graph Convolutional Network (GCN), and Graph Attention Network (GAT) and other methods. Combining graph neural networks with breeding-related knowledge, the relevant research work is still in the initial stage and is a major trend in intelligent breeding in the future.

Contents of the invention

In order to solve the deficiencies of the existing technology and achieve the purpose of improving the gene phenotype prediction performance, the present invention adopts the following technical solutions:

A gene phenotype training method based on a graph neural network, including the following steps: for a specific species, based on the correlation between the gene loci of the species and the phenotype, construct a graph neural network including a multi-layer network, wherein, in In each layer of the graph neural network, nodes represent gene sites, edges represent two gene sites related to the same phenotype, and the weight of the edge is used to reflect the degree of association between the gene sites; collecting the The genetic data and phenotypic data of multiple samples of the species are used as training data; for the training data, the genetic data is encoded based on the probability value of site detection to obtain the gene sites and genes corresponding to the genetic data. type representation; input the encoded genetic data into the graph neural network to pass through each layer of the graph neural network in sequence, wherein each layer of the graph neural network uses a network with a length of 3 dimensional convolution kernel, and the convolution kernel is shared between neighbors; based on the output result of each node in the last layer of the graph neural network, a multi-layer perceptron is used to obtain the phenotypic classification result corresponding to the genetic data ; According to the phenotype classification results and the genotype representation corresponding to the genetic data, use a loss function to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron.

Further, encoding the genetic data based on the probability value of gene site detection includes converting the probability value PL based on the gene site detection genotype of 0/0, 0/1, 1/1 respectively according to the following formula: The probability P of supporting each described genotype:

Among them, for each gene site, the obtained probability P of the gene site is formed into a 3-dimensional vector [a, b, c], which is used as the genotype representation corresponding to the gene site, where, The above a, b, c respectively represent the probability that the genotype of the gene site is 0/0, 0/1, 1/1; for undetected gene sites, the genotype is represented by the vector [0, 0,0] represents.

Further, when the genetic data passes through each layer of the graph neural network in sequence, uniform sampling can be used to select node neighborhoods, and each node can be updated through the weight and convolution kernel parameters of the neighborhood nodes. Specifically, the following steps may be included: for each node c in the current layer of the graph neural network, construct m candidate nodes from the first-order adjacent nodes of the node c, where m is an integer greater than 0; None Sample n nodes from m candidate nodes of node c with replacement as neighbor nodes of node c, and when m is less than n, sample all m candidate nodes as neighbor nodes of node c; aggregate all the The information of all neighboring nodes of the node c is obtained to obtain the neighborhood information of the node c. For the neighborhood information of the node c Convolution and activation operations are performed on the information obtained by splicing the information hc of the node c, and the output information h ^, _c of the node c in the current layer network of the graph neural network is obtained as the next layer of the graph neural network. Input to one layer of the network.

Among them, the formula for aggregating the information of all neighbor nodes of the node c is as follows:

Among them, hi represents the information of the i-th neighbor node of node c, and wi represents the weight of the i-th neighbor node of node c.

Among them, the specific formulas for performing the convolution and activation operations are as follows:

Among them, h ^and _c represent the information output by node c from the current layer network, that is, the input of the next layer network, σ represents the activation function, W represents the convolution kernel parameters, and hc represents the information input by node c to the current network layer.

Further, based on the output result of each node in the last layer of the graph neural network, using a multi-layer perceptron to obtain the phenotype classification result corresponding to the genetic data includes the following steps: converting the graph neural network The vectors with a dimension of 3 output by all nodes in the last layer of the network are spliced to obtain the spliced vector; the spliced vector is input into the multi-layer perceptron to obtain the classification result output by the multi-layer perceptron as the Phenotypic classification results corresponding to the above genetic data.

Further, based on the phenotype classification results and the genotype representation corresponding to the genetic data, a loss function is used to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron, which may specifically include the following Steps: Divide s phenotypes into k intervals on average as categories, and obtain a genotype representation true value vector with dimensions s×k, where the dimension s×k is consistent with the dimension of the phenotype classification result; use loss Function to perform multi-phenotype supervised training based on the phenotype classification result and the genotype representation true value vector of the phenotype.

Wherein, the loss function may be a focal loss function, and the formula for calculating the classification loss based on the phenotype classification result and the genotype representation true value vector of the phenotype is:

Among them, p _{x, y} represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map, Represents the true category label of the genotype representation true value vector of the phenotype at this position, 1 represents a positive sample, 0 represents a negative sample; γ is a value greater than 0, α is a decimal between [0, 1], γ and α are both fixed values and do not participate in training.

A gene phenotype prediction method based on graph neural network, including the following steps: for the gene data to be classified, encoding the gene data based on the probability value of site detection to obtain the gene corresponding to the gene data to be classified Locus and genotype characterization; input the encoded genetic data to be classified into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified. Among them, the graph neural network and the multi- The layer perceptron is a gene phenotype prediction network trained by the aforementioned method for the species to which the genetic data to be classified belongs.

A graph neural network-based gene phenotype training device, used to implement the graph neural network-based gene phenotype training method, including a graph neural network building module, a data acquisition module, a precoding module, a genetic data input module and Classification module; the graph neural network building module constructs a graph neural network for genes based on the correlation between gene sites and phenotypes: nodes represent gene sites, and edges represent two gene sites that are related to a certain phenotype at the same time. The weight of the edge is used to reflect the degree of association between gene sites; the data acquisition module collects the genetic data of the sample, obtains the phenotypic data corresponding to the sample, and divides the training set and the test set for training and verification. Graph neural network; the precoding module, for training data, precodes genetic data based on site detection to obtain gene sites and their corresponding genotypes; the genetic data input module converts the encoded genetic data into Input the graph neural network constructed. Each layer of the network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods; the classification module splices the output results of each node, and the spliced The results are input into the multi-layer perceptron, the phenotype classification results are output, and the model is supervised and trained according to the loss function.

A gene phenotype prediction device based on a graph neural network is used to implement the gene phenotype training device based on a graph neural network. After the genetic data to be classified is encoded by the precoding module, it is then passed through the genetic data input module. Input the trained classification module to obtain the phenotypic results corresponding to the genetic data to be classified.

The advantages and beneficial effects of the present invention are: first, by using the prior knowledge of the correlation between gene phenotypes to construct a gene graph neural network, and eliminating weakly correlated gene sites, it can effectively reduce the input gene dimension, thereby achieving reduction. The purpose of dimensional denoising; secondly, by dividing phenotypes into multiple intervals for classification prediction, it can effectively reduce the difficulty of training, increase the stability of the model algorithm, and support simultaneous training and prediction of multiple phenotypes; finally, compared with Compared with traditional whole-gene selection algorithms such as rrBLUP, the technical solution provided by the present invention has better performance in the prediction of various phenotypes, including a 20% to 30% improvement in the Pearson Correlation Coefficient.

Description of the drawings

FIG. 1 is a flow chart of a graph neural network training method for gene phenotype prediction according to an embodiment of the present invention.

Figure 2 is a flow chart of a gene phenotype prediction method based on graph neural network according to an embodiment of the present invention.

Figure 3 is a simplified model architecture diagram for classifying and identifying gene phenotypes based on graph neural networks according to an embodiment of the present invention.

Figure 4 is a schematic structural diagram of a graph neural network training device for gene phenotype prediction according to an embodiment of the present invention.

Detailed ways

Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

As shown in Figure 1, a training method of a graph neural network for gene phenotype prediction may include the following steps S110 to S160.

Step S110: For a specific species, construct a graph neural network including a multi-layer network based on the correlation between the gene locus and the phenotype of the species. In each layer of the constructed graph neural network, nodes represent gene sites, edges represent two gene sites that are simultaneously related to a certain phenotype, and the weight of the edge is used to reflect the degree of association between gene sites.

In one embodiment of the present invention, a graph neural network of soybean genes can be constructed based on the correlation information between soybean gene loci and phenotypes shown in Table 1 below. Among them, there are 39 gene loci. If two gene loci are associated with the same phenotype more times, the weight of the edge will be higher. Therefore, the edge weight can reflect the degree of association between gene sites.

Table 1

Step S120: Collect genetic data and phenotypic data of multiple samples of the species as training data. Wherein, the training data can be divided into a training set and a test set to respectively train and verify the graph neural network.

In one embodiment of the present invention, the genetic data of 3,000 soybean samples, that is, Single Nucleotide Polymorphisms (SNP) site information, is collected. During training and testing, only the 39 genes involved in Table 1 need to be used. Gene locus information. Collect s phenotypic data corresponding to each soybean sample, including plant height, number of branches, number of beans, and number of stems, that is, s=4. The above data can be randomly divided into multiple training sets and at least one test set according to a 4:1 ratio.

Step S130: For the training data, encode the gene data based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the gene data.

According to an embodiment of the present invention, the input genetic data is encoded based on the probability value PL of gene site detection, and the PL value based on the gene site detection genotype is 0/0, 0/1, 1/1 Convert to the probability P of supporting each genotype according to the following formula:

Among them, the probability P obtained for a certain gene site can be formed into a 3-dimensional vector [a, b, c], as the genotype representation corresponding to the gene site, where a, b, c represent in turn The genotype of this gene locus has a probability of 0/0, 0/1, and 1/1. For undetected gene loci, their genotype representation can be represented by the vector [0,0,0].

Step S140: Input the encoded genetic data into the constructed graph neural network, so as to pass through each layer of the graph neural network in sequence. Each layer of the graph neural network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods.

In one embodiment of the present invention, the encoded genetic data with a dimension of 39×3 is input into the constructed graph neural network. Wherein, the graph neural network can be a graph neural network with 8 network layers. Each layer of the network uses 3 one-dimensional convolution kernels with a length of 3, and the convolution kernels are shared between neighborhoods.

In each layer of the graph neural network, uniform sampling can be used to select node neighborhoods, and each node can be updated through the weight and convolution kernel parameters of the neighborhood nodes. Specifically, the steps S141 to S144 can be included.

In step S141: In the graph neural network, for each node in the current layer, m candidate nodes are constructed from its first-order neighboring nodes. Among them, m is an integer greater than 0.

In step S142: For node c as the central node, n nodes are sampled from m candidate nodes of node c without replacement as neighbor nodes of node c. Among them, when m is less than n, that is, there are less than n candidate nodes, then all candidate nodes are sampled as neighbor nodes. In this embodiment, n=4.

Step S143: Aggregate the information of all neighbor nodes of node c to obtain the neighbor information of node c

Specifically, the aggregation formula can be expressed as follows:

In an embodiment of the present invention, when n equals 4, the aggregated neighborhood information of node c The calculation formula can be:

Step S144: Aggregate neighborhood information of node c Perform vector splicing with the information hc of node c, and concatenate The subsequent information is subjected to convolution and activation operations to obtain the output information h ^, _c of the current layer of the graph neural network.

Specifically, the formula is as follows:

Among them, h ^and _c represent the information output by node c from the current network layer, that is, the input of the next layer of network, σ represents the activation function, W represents the convolution kernel parameters, CONCAT represents the splicing operation, and hc represents the input of node c to the current network layer. information.

Step S150: Based on the output result of each node in the last layer of the graph neural network, use a multi-layer perceptron to obtain the phenotype classification result corresponding to the genetic data.

Step S160: Use a loss function to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron based on the phenotypic classification results and genotype representation corresponding to the genetic data. Among them, the loss function is mainly used to calculate the loss value based on the phenotype classification result and the genotype characterization.

In an embodiment of the present invention, the above step S150 may specifically include the following step S151.

In step S151: splice the vectors with dimension 3 output by all nodes in the last layer of the graph neural network, and then input the spliced vector into the multi-layer perceptron to obtain the classification result output by the multi-layer perceptron as the above Phenotypic classification results.

In one embodiment of the present invention, the probability vectors with a dimension of 3 output for all 39 nodes are spliced, and a splicing probability vector with a dimension of 39×3=117 is output; then, the splicing probability vector is input into the 2-layer full Connect to the network to get the classification results. Among them, the first layer of the fully connected network inputs a splicing probability vector with a dimension of 117, and outputs an intermediate probability vector with a dimension of 80; then, the second layer of the fully connected network inputs an intermediate probability vector with a dimension of 80, and outputs a dimension of 80. The final probability vector of 20 is used as the phenotype classification result.

In an embodiment of the present invention, the above step S160 may specifically include the following steps S161 to S162.

In step S161: divide s phenotypes into k intervals on average as categories, and then obtain a genotype representation true value vector with dimensions s×k (hereinafter also referred to as the true value vector for short).

In one embodiment of the present invention, four phenotypes are divided into five intervals on average as categories, and the dimension of the true value vector is 4×5=20. In this way, the dimensions of the truth vector correspond one-to-one with the dimensions of the phenotype classification result output by the multi-layer perceptron network. Taking plant height as an example, it can be divided into five categories according to the average interval: extremely short, short, normal, high, and extremely high. Other phenotypes can be deduced in this way and will not be described here.

In step S162: Use the loss function to perform multi-table analysis based on the phenotype classification result and the true value vector of the phenotype. Supervised training.

According to an embodiment of the present invention, the loss function in supervised training can use the focal loss function Focal Loss, and the formula for calculating the classification loss can be:

Among them, p _{x, y} represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map, Represents the true category label of the target at the position of the true value vector of the phenotype, 1 represents a positive sample, 0 represents a negative sample; γ is a value greater than 0, α is a decimal between [0, 1], γ and α are all fixed values and do not participate in training. In an embodiment of the present invention, the training effect is optimal when α takes a value of 0.1 and γ takes a value of 2. For example, you can use Stochastic Gradient Descent (SGD) as the optimizer and train on 4 Graphics Processing Units (GPUs). The batch size is 16 and the number of training steps is 50k. The initial The learning rate is 0.01 and then scales down by a factor of 10 at 20k and 40k steps.

According to an embodiment of the present invention, a gene phenotype prediction method based on graph neural network is also provided. As shown in Figure 2, the gene phenotype prediction method may include the following steps S210 to S220.

In step S210, for the genetic data to be classified, the genetic data is encoded based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the genetic data to be classified. The processing of step S210 is basically the same as that of step S130. For details, please refer to the above and will not be described again here.

In step S220, the encoded genetic data to be classified is input into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified. Wherein, the graph neural network and the multi-layer perceptron can be a gene phenotype prediction network for the species to which the genetic data to be classified is obtained using the aforementioned training method, and the specific training method will not be described in detail here.

Taking a simplified graph neural network as an example, assuming that the species corresponding to the genetic data to be classified has 5 gene loci, each layer of the graph neural network includes 5 nodes. As shown in Figure 3, after the input genetic data to be classified undergoes convolution and activation operations through the multi-layer graph neural network 310, the output of the graph neural network is input to the multi-layer perceptron 320 and output by the multi-layer perceptron 320. The final classification result of the genetic data.

The implementation of this part is similar to the implementation of the above method embodiment, and will not be described again here.

A training device for graph neural network for gene phenotype prediction, including a graph neural network building module, a data acquisition module, a precoding module, a gene data input module and a classification module. Among them, the graph neural network building module is aimed at a specific species and constructs a graph neural network including a multi-layer network based on the correlation between the genetic loci and phenotype of the species. Through the network, in each layer of the graph neural network, the node represents a gene site, the edge represents two gene sites related to the same phenotype, and the weight of the edge is used to reflect the relationship between the gene sites. The degree of correlation; the data acquisition module collects the genetic data and phenotypic data of multiple samples of the species as training data; the precoding module, for the training data, performs on the genetic data based on the probability value of site detection Encoding to obtain the gene locus and genotype representation corresponding to the genetic data; the genetic data input module inputs the encoded genetic data into the graph neural network to pass through each layer of the graph neural network in sequence , wherein each layer of the network in the graph neural network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods; the classification module is based on each layer of the last layer of the graph neural network. Based on the output results of the nodes, a multi-layer perceptron is used to obtain the phenotypic classification result corresponding to the genetic data, so that the loss function can be used to compare the graph neural network and/or the multi-layered neural network according to the phenotypic classification result. The model parameters of the layer perceptron are supervised for training.

A graph neural network-based gene phenotype prediction device, used to implement the graph neural network-based gene phenotype prediction and classification, which may include the above-mentioned precoding module and be trained via the above-mentioned training method and/or training device The resulting gene phenotype prediction network. Specifically, after the genetic data to be classified is encoded by the precoding module, the encoded genetic data to be classified is input into the trained graph neural network and multi-layer perceptron, and the corresponding genetic data to be classified is obtained. Phenotypic results. Wherein, the graph neural network and the multi-layer perceptron are gene phenotype prediction networks obtained by using the aforementioned training method and/or training device to train the species to which the genetic data to be classified belongs.

Corresponding to the aforementioned embodiment of a gene phenotype prediction method based on a graph neural network, the present invention also provides an embodiment of a gene phenotype prediction device based on a graph neural network.

Referring to Figure 4, an embodiment of the present invention provides a device for gene phenotype prediction based on a graph neural network, including a memory (specifically, it may include a non-volatile memory 430 and/or a memory 440) and one or more processors 410 . Among them, executable codes are stored in the memories 430 and 440. When the one or more processors 410 execute the executable codes, they are used to implement a graph neural network-based gene phenotype prediction in the above embodiment. method. As shown in Figure 4, the device also includes an internal bus 420 to connect the processor 410 and the memories 430, 440. In addition, the device may also include a network interface 450 for the device to communicate with the outside.

The embodiment of the graph neural network-based gene phenotype prediction device of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer. The device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 4, it is a hardware structure diagram of any device with data processing capabilities where the gene phenotype prediction device based on graph neural network of the present invention is located. In addition to the processor shown in Figure 4 , memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. In this regard No longer.

For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.

As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the gene phenotype prediction method based on a graph neural network in the above embodiments is implemented.

The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, smart memory card (Smart Media Card, SMC), SD card, flash memory card equipped on the device (Flash Card) etc. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments. Modify the technical solution, or make equivalent substitutions for some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solution to depart from the scope of the technical solution of the embodiments of the present invention.

Claims

A method for training a graph neural network for gene phenotype prediction, including the following steps:

For a specific species, based on the correlation between the gene locus and the phenotype of the species, a graph neural network including a multi-layer network is constructed, wherein in each layer of the graph neural network, a node represents a gene locus , the edge represents that two gene loci are related to the same phenotype, and the weight of the edge is used to reflect the degree of association between the gene loci;

Collect genetic data and phenotypic data of multiple samples of the species as training data;

For the training data, the genetic data is encoded based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the genetic data;

The encoded gene data is input into the graph neural network to pass through each layer of the graph neural network in sequence, wherein each layer of the graph neural network adopts a one-dimensional convolution with a length of 3. Kernel, and the convolution kernel is shared between neighborhoods;

Based on the output results of each node in the last layer of the graph neural network, use a multi-layer perceptron to obtain the phenotypic classification results corresponding to the genetic data;

According to the phenotype classification results and the genotype representation corresponding to the genetic data, a loss function is used to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron.
The method according to claim 1, characterized in that: encoding the genetic data based on the probability value of gene site detection includes:

The probability values PL based on the detected genotypes of 0/0, 0/1, and 1/1 based on the gene loci are converted into the probability P of supporting each genotype according to the following formula:

For each gene site, the obtained probability P of the gene site is formed into a 3-dimensional vector [a, b, c], as the genotype representation corresponding to the gene site, where a , b, c respectively represent the probability that the genotype of the gene locus is 0/0, 0/1, 1/1,

For undetected gene loci, the genotype representation is represented by the vector [0,0,0].
The method according to claim 1, characterized in that: when the genetic data passes through each layer of the graph neural network in sequence, uniform sampling is used to select node neighborhoods, and the weights of the neighborhood nodes and The convolution kernel parameters update each node.
The method according to claim 3, characterized in that using uniform sampling to select node neighborhoods, and updating each node through the weight of neighborhood nodes and convolution kernel parameters, includes the following steps:

For each node c in the current layer of the graph neural network,

Construct m candidate nodes from the first-order adjacent nodes of the node c, where m is an integer greater than 0;

Sampling n nodes from m candidate nodes of node c without replacement as neighbor nodes of node c, and sampling all m candidate nodes as neighbor nodes of node c when m is less than n;

Aggregate the information of all neighboring nodes of the node c to obtain the neighborhood information of the node c

For the neighborhood information of the node c Convolution and activation operations are performed on the information obtained by splicing the information hc of the node c, and the output information h' c of the node c in the current layer network of the graph neural network is obtained as the next layer of the graph neural network. The input of one layer of network,

Among them, the formula for aggregating the information of all neighbor nodes of the node c is as follows:

Among them, h i represents the information of the i-th neighbor node of node c,

w i represents the weight of the i-th neighbor node of node c,

Among them, the specific formulas for performing the convolution and activation operations are as follows:

Where, h' c represents the information output from the node c in the current layer network of the graph neural network,

σ represents the activation function,

W represents the convolution kernel parameters,

h c represents the information input to the node c in the current network layer of the graph neural network.
The method according to claim 1, characterized in that: based on the output result of each node in the last layer of the graph neural network, a multi-layer perceptron is used to obtain the phenotype classification result corresponding to the genetic data, including Follow these steps:

Splice the vectors with a dimension of 3 output by all nodes in the last layer of the graph neural network to obtain the spliced vector;

The spliced vector is input into the multi-layer perceptron, and the classification result output by the multi-layer perceptron is obtained as the phenotype classification result corresponding to the genetic data.
The method according to claim 5, characterized in that, based on the phenotypic classification results and the genotype representation corresponding to the genetic data, a loss function is used to perform the graph neural network and/or the multi-layer perception Supervised training of the model parameters of the controller, including the following steps:

Divide s phenotypes into k intervals on average as categories, respectively, to obtain a genotype representation true value vector with dimensions s×k, where the dimension s×k is consistent with the dimension of the phenotype classification result;

Using a loss function, multi-phenotype supervised training is performed based on the phenotype classification result and the genotype representation true value vector of the phenotype.
The method according to claim 6, wherein the loss function is a focal loss function, and the formula for calculating the classification loss based on the phenotype classification result and the genotype representation true value vector of the phenotype is: :

Among them, p x, y represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map,

The true class label at that position of the genotype representation ground truth vector representing the phenotype,

1 represents a positive sample, 0 represents a negative sample;

γ is a value greater than 0, α is a decimal between [0, 1], γ and α are both fixed values and do not participate in training.
A method for gene phenotype prediction based on graph neural network, including the following steps:

For the genetic data to be classified, the genetic data is encoded based on the probability value of site detection to obtain the gene locus and genotype representation corresponding to the genetic data to be classified;

Input the encoded genetic data to be classified into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified, wherein the graph neural network and the multi-layer perceptron The layer perceptron is a gene phenotype prediction network trained by the method described in any one of claims 1 to 7 for the species to which the genetic data to be classified belongs.
A device for training a graph neural network for gene phenotype prediction, including a processor and a memory, characterized in that a program is stored on the memory, and when the program is executed by the processor, it can realize claims 1 to The steps in the method of any one of 7.
A device, processor and memory for gene phenotype prediction based on graph neural network, characterized in that a program is stored on the memory, and when the program is executed by the processor, the method of claim 8 can be realized A step of.