WO2023217290A1 - Genophenotypic prediction based on graph neural network - Google Patents

Genophenotypic prediction based on graph neural network Download PDF

Info

Publication number
WO2023217290A1
WO2023217290A1 PCT/CN2023/095224 CN2023095224W WO2023217290A1 WO 2023217290 A1 WO2023217290 A1 WO 2023217290A1 CN 2023095224 W CN2023095224 W CN 2023095224W WO 2023217290 A1 WO2023217290 A1 WO 2023217290A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
graph neural
node
gene
layer
Prior art date
Application number
PCT/CN2023/095224
Other languages
French (fr)
Chinese (zh)
Inventor
章依依
吴翠玲
徐晓刚
王军
李萧缘
虞舒敏
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to JP2023543455A priority Critical patent/JP2024524795A/en
Publication of WO2023217290A1 publication Critical patent/WO2023217290A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention relates to the field of intelligent computing breeding, and in particular to gene phenotype prediction based on graph neural networks and corresponding graph neural network training.
  • soybeans are an important component of grain production. How to select and cultivate high-yield soybeans is a problem that agriculturists are currently studying.
  • the proposal of the genome-wide selection algorithm provides a direction for genetic breeding.
  • the existing representative methods include Best Linear Unbiased Prediction (BLUP), Genomic Best Linear Unbiased Prediction (Genomic Best Linear Unbiased Prediction, GBLUP), Ridge Regression Best Linear Unbiased Prediction (RR-BLUP), Least Absolute Shrinkage and Selection Operator (LASSO), etc.
  • BLUP Best Linear Unbiased Prediction
  • GBLUP Genomic Best Linear Unbiased Prediction
  • RR-BLUP Ridge Regression Best Linear Unbiased Prediction
  • LASSO Least Absolute Shrinkage and Selection Operator
  • Deep learning-based gene phenotype prediction (DeepGS) algorithm proposed by the Northwest A&F University team can predict the phenotypic traits of wheat by constructing a convolutional neural network, and exceeds the performance of traditional whole-genome selection algorithms.
  • DGS deep learning-based gene phenotype prediction
  • most of the existing genome-wide selection algorithms based on deep learning use simple convolutional neural networks and do not utilize gene-related prior knowledge.
  • Graph neural networks can be trained on the basis of prior knowledge graphs and achieve considerable results.
  • Graph neural networks are divided into spectrum-based methods and spatial domain-based methods, including Graph Neural Network (GNN), Graph Convolutional Network (GCN), and Graph Attention Network (GAT) and other methods.
  • GCN Graph Neural Network
  • GCN Graph Convolutional Network
  • GAT Graph Attention Network
  • the present invention adopts the following technical solutions:
  • a gene phenotype training method based on a graph neural network including the following steps: for a specific species, based on the correlation between the gene loci of the species and the phenotype, construct a graph neural network including a multi-layer network, wherein, in In each layer of the graph neural network, nodes represent gene sites, edges represent two gene sites related to the same phenotype, and the weight of the edge is used to reflect the degree of association between the gene sites; collecting the The genetic data and phenotypic data of multiple samples of the species are used as training data; for the training data, the genetic data is encoded based on the probability value of site detection to obtain the gene sites and genes corresponding to the genetic data.
  • encoding the genetic data based on the probability value of gene site detection includes converting the probability value PL based on the gene site detection genotype of 0/0, 0/1, 1/1 respectively according to the following formula: The probability P of supporting each described genotype:
  • the obtained probability P of the gene site is formed into a 3-dimensional vector [a, b, c], which is used as the genotype representation corresponding to the gene site, where, The above a, b, c respectively represent the probability that the genotype of the gene site is 0/0, 0/1, 1/1; for undetected gene sites, the genotype is represented by the vector [0, 0,0] represents.
  • each node can be updated through the weight and convolution kernel parameters of the neighborhood nodes.
  • the following steps may be included: for each node c in the current layer of the graph neural network, construct m candidate nodes from the first-order adjacent nodes of the node c, where m is an integer greater than 0; None Sample n nodes from m candidate nodes of node c with replacement as neighbor nodes of node c, and when m is less than n, sample all m candidate nodes as neighbor nodes of node c; aggregate all the The information of all neighboring nodes of the node c is obtained to obtain the neighborhood information of the node c.
  • Convolution and activation operations are performed on the information obtained by splicing the information hc of the node c, and the output information h , c of the node c in the current layer network of the graph neural network is obtained as the next layer of the graph neural network. Input to one layer of the network.
  • hi represents the information of the i-th neighbor node of node c
  • wi represents the weight of the i-th neighbor node of node c.
  • h and c represent the information output by node c from the current layer network, that is, the input of the next layer network
  • represents the activation function
  • W represents the convolution kernel parameters
  • hc represents the information input by node c to the current network layer.
  • using a multi-layer perceptron to obtain the phenotype classification result corresponding to the genetic data includes the following steps: converting the graph neural network The vectors with a dimension of 3 output by all nodes in the last layer of the network are spliced to obtain the spliced vector; the spliced vector is input into the multi-layer perceptron to obtain the classification result output by the multi-layer perceptron as the Phenotypic classification results corresponding to the above genetic data.
  • a loss function is used to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron, which may specifically include the following Steps: Divide s phenotypes into k intervals on average as categories, and obtain a genotype representation true value vector with dimensions s ⁇ k, where the dimension s ⁇ k is consistent with the dimension of the phenotype classification result; use loss Function to perform multi-phenotype supervised training based on the phenotype classification result and the genotype representation true value vector of the phenotype.
  • the loss function may be a focal loss function, and the formula for calculating the classification loss based on the phenotype classification result and the genotype representation true value vector of the phenotype is:
  • p x, y represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map, Represents the true category label of the genotype representation true value vector of the phenotype at this position, 1 represents a positive sample, 0 represents a negative sample; ⁇ is a value greater than 0, ⁇ is a decimal between [0, 1], ⁇ and ⁇ are both fixed values and do not participate in training.
  • a gene phenotype prediction method based on graph neural network including the following steps: for the gene data to be classified, encoding the gene data based on the probability value of site detection to obtain the gene corresponding to the gene data to be classified Locus and genotype characterization; input the encoded genetic data to be classified into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified.
  • the graph neural network and the multi- The layer perceptron is a gene phenotype prediction network trained by the aforementioned method for the species to which the genetic data to be classified belongs.
  • a graph neural network-based gene phenotype training device used to implement the graph neural network-based gene phenotype training method, including a graph neural network building module, a data acquisition module, a precoding module, a genetic data input module and Classification module; the graph neural network building module constructs a graph neural network for genes based on the correlation between gene sites and phenotypes: nodes represent gene sites, and edges represent two gene sites that are related to a certain phenotype at the same time. The weight of the edge is used to reflect the degree of association between gene sites; the data acquisition module collects the genetic data of the sample, obtains the phenotypic data corresponding to the sample, and divides the training set and the test set for training and verification.
  • Graph neural network the precoding module, for training data, precodes genetic data based on site detection to obtain gene sites and their corresponding genotypes; the genetic data input module converts the encoded genetic data into Input the graph neural network constructed.
  • Each layer of the network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods; the classification module splices the output results of each node, and the spliced The results are input into the multi-layer perceptron, the phenotype classification results are output, and the model is supervised and trained according to the loss function.
  • a gene phenotype prediction device based on a graph neural network is used to implement the gene phenotype training device based on a graph neural network.
  • the genetic data to be classified is encoded by the precoding module, it is then passed through the genetic data input module. Input the trained classification module to obtain the phenotypic results corresponding to the genetic data to be classified.
  • the advantages and beneficial effects of the present invention are: first, by using the prior knowledge of the correlation between gene phenotypes to construct a gene graph neural network, and eliminating weakly correlated gene sites, it can effectively reduce the input gene dimension, thereby achieving reduction.
  • the purpose of dimensional denoising secondly, by dividing phenotypes into multiple intervals for classification prediction, it can effectively reduce the difficulty of training, increase the stability of the model algorithm, and support simultaneous training and prediction of multiple phenotypes; finally, compared with Compared with traditional whole-gene selection algorithms such as rrBLUP, the technical solution provided by the present invention has better performance in the prediction of various phenotypes, including a 20% to 30% improvement in the Pearson Correlation Coefficient.
  • FIG. 1 is a flow chart of a graph neural network training method for gene phenotype prediction according to an embodiment of the present invention.
  • Figure 2 is a flow chart of a gene phenotype prediction method based on graph neural network according to an embodiment of the present invention.
  • Figure 3 is a simplified model architecture diagram for classifying and identifying gene phenotypes based on graph neural networks according to an embodiment of the present invention.
  • Figure 4 is a schematic structural diagram of a graph neural network training device for gene phenotype prediction according to an embodiment of the present invention.
  • a training method of a graph neural network for gene phenotype prediction may include the following steps S110 to S160.
  • Step S110 For a specific species, construct a graph neural network including a multi-layer network based on the correlation between the gene locus and the phenotype of the species.
  • nodes represent gene sites
  • edges represent two gene sites that are simultaneously related to a certain phenotype
  • the weight of the edge is used to reflect the degree of association between gene sites.
  • a graph neural network of soybean genes can be constructed based on the correlation information between soybean gene loci and phenotypes shown in Table 1 below. Among them, there are 39 gene loci. If two gene loci are associated with the same phenotype more times, the weight of the edge will be higher. Therefore, the edge weight can reflect the degree of association between gene sites.
  • Step S120 Collect genetic data and phenotypic data of multiple samples of the species as training data.
  • the training data can be divided into a training set and a test set to respectively train and verify the graph neural network.
  • the genetic data of 3,000 soybean samples that is, Single Nucleotide Polymorphisms (SNP) site information
  • SNP Single Nucleotide Polymorphisms
  • Table 1 the genetic data of 3,000 soybean samples. During training and testing, only the 39 genes involved in Table 1 need to be used.
  • the input genetic data is encoded based on the probability value PL of gene site detection, and the PL value based on the gene site detection genotype is 0/0, 0/1, 1/1 Convert to the probability P of supporting each genotype according to the following formula:
  • the probability P obtained for a certain gene site can be formed into a 3-dimensional vector [a, b, c], as the genotype representation corresponding to the gene site, where a, b, c represent in turn
  • the genotype of this gene locus has a probability of 0/0, 0/1, and 1/1.
  • their genotype representation can be represented by the vector [0,0,0].
  • Step S140 Input the encoded genetic data into the constructed graph neural network, so as to pass through each layer of the graph neural network in sequence.
  • Each layer of the graph neural network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods.
  • the encoded genetic data with a dimension of 39 ⁇ 3 is input into the constructed graph neural network.
  • the graph neural network can be a graph neural network with 8 network layers. Each layer of the network uses 3 one-dimensional convolution kernels with a length of 3, and the convolution kernels are shared between neighborhoods.
  • step S141 In the graph neural network, for each node in the current layer, m candidate nodes are constructed from its first-order neighboring nodes. Among them, m is an integer greater than 0.
  • Step S143 Aggregate the information of all neighbor nodes of node c to obtain the neighbor information of node c
  • the aggregation formula can be expressed as follows:
  • hi represents the information of the i-th neighbor node of node c
  • wi represents the weight of the i-th neighbor node of node c.
  • the aggregated neighborhood information of node c can be:
  • Step S144 Aggregate neighborhood information of node c Perform vector splicing with the information hc of node c, and concatenate The subsequent information is subjected to convolution and activation operations to obtain the output information h , c of the current layer of the graph neural network.
  • h and c represent the information output by node c from the current network layer, that is, the input of the next layer of network
  • represents the activation function
  • W represents the convolution kernel parameters
  • CONCAT represents the splicing operation
  • hc represents the input of node c to the current network layer.
  • Step S150 Based on the output result of each node in the last layer of the graph neural network, use a multi-layer perceptron to obtain the phenotype classification result corresponding to the genetic data.
  • Step S160 Use a loss function to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron based on the phenotypic classification results and genotype representation corresponding to the genetic data.
  • the loss function is mainly used to calculate the loss value based on the phenotype classification result and the genotype characterization.
  • step S150 may specifically include the following step S151.
  • step S151 splice the vectors with dimension 3 output by all nodes in the last layer of the graph neural network, and then input the spliced vector into the multi-layer perceptron to obtain the classification result output by the multi-layer perceptron as the above Phenotypic classification results.
  • the first layer of the fully connected network inputs a splicing probability vector with a dimension of 117, and outputs an intermediate probability vector with a dimension of 80; then, the second layer of the fully connected network inputs an intermediate probability vector with a dimension of 80, and outputs a dimension of 80.
  • the final probability vector of 20 is used as the phenotype classification result.
  • the above step S160 may specifically include the following steps S161 to S162.
  • step S161 divide s phenotypes into k intervals on average as categories, and then obtain a genotype representation true value vector with dimensions s ⁇ k (hereinafter also referred to as the true value vector for short).
  • the dimensions of the truth vector correspond one-to-one with the dimensions of the phenotype classification result output by the multi-layer perceptron network. Taking plant height as an example, it can be divided into five categories according to the average interval: extremely short, short, normal, high, and extremely high. Other phenotypes can be deduced in this way and will not be described here.
  • step S162 Use the loss function to perform multi-table analysis based on the phenotype classification result and the true value vector of the phenotype. Supervised training.
  • the loss function in supervised training can use the focal loss function Focal Loss
  • the formula for calculating the classification loss can be:
  • p x, y represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map, Represents the true category label of the target at the position of the true value vector of the phenotype, 1 represents a positive sample, 0 represents a negative sample; ⁇ is a value greater than 0, ⁇ is a decimal between [0, 1], ⁇ and ⁇ are all fixed values and do not participate in training.
  • the training effect is optimal when ⁇ takes a value of 0.1 and ⁇ takes a value of 2.
  • SGD Stochastic Gradient Descent
  • GPUs Graphics Processing Units
  • the batch size is 16 and the number of training steps is 50k.
  • the initial The learning rate is 0.01 and then scales down by a factor of 10 at 20k and 40k steps.
  • a gene phenotype prediction method based on graph neural network is also provided.
  • the gene phenotype prediction method may include the following steps S210 to S220.
  • step S210 for the genetic data to be classified, the genetic data is encoded based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the genetic data to be classified.
  • the processing of step S210 is basically the same as that of step S130. For details, please refer to the above and will not be described again here.
  • step S220 the encoded genetic data to be classified is input into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified.
  • the graph neural network and the multi-layer perceptron can be a gene phenotype prediction network for the species to which the genetic data to be classified is obtained using the aforementioned training method, and the specific training method will not be described in detail here.
  • each layer of the graph neural network includes 5 nodes.
  • the output of the graph neural network is input to the multi-layer perceptron 320 and output by the multi-layer perceptron 320. The final classification result of the genetic data.
  • a training device for graph neural network for gene phenotype prediction including a graph neural network building module, a data acquisition module, a precoding module, a gene data input module and a classification module.
  • the graph neural network building module is aimed at a specific species and constructs a graph neural network including a multi-layer network based on the correlation between the genetic loci and phenotype of the species.
  • the node represents a gene site
  • the edge represents two gene sites related to the same phenotype
  • the weight of the edge is used to reflect the relationship between the gene sites.
  • the data acquisition module collects the genetic data and phenotypic data of multiple samples of the species as training data; the precoding module, for the training data, performs on the genetic data based on the probability value of site detection Encoding to obtain the gene locus and genotype representation corresponding to the genetic data; the genetic data input module inputs the encoded genetic data into the graph neural network to pass through each layer of the graph neural network in sequence , wherein each layer of the network in the graph neural network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods; the classification module is based on each layer of the last layer of the graph neural network.
  • a multi-layer perceptron is used to obtain the phenotypic classification result corresponding to the genetic data, so that the loss function can be used to compare the graph neural network and/or the multi-layered neural network according to the phenotypic classification result.
  • the model parameters of the layer perceptron are supervised for training.
  • a graph neural network-based gene phenotype prediction device used to implement the graph neural network-based gene phenotype prediction and classification, which may include the above-mentioned precoding module and be trained via the above-mentioned training method and/or training device
  • the resulting gene phenotype prediction network is input into the trained graph neural network and multi-layer perceptron, and the corresponding genetic data to be classified is obtained. Phenotypic results.
  • the graph neural network and the multi-layer perceptron are gene phenotype prediction networks obtained by using the aforementioned training method and/or training device to train the species to which the genetic data to be classified belongs.
  • the present invention also provides an embodiment of a gene phenotype prediction device based on a graph neural network.
  • an embodiment of the present invention provides a device for gene phenotype prediction based on a graph neural network, including a memory (specifically, it may include a non-volatile memory 430 and/or a memory 440) and one or more processors 410 .
  • executable codes are stored in the memories 430 and 440.
  • the one or more processors 410 execute the executable codes, they are used to implement a graph neural network-based gene phenotype prediction in the above embodiment. method.
  • the device also includes an internal bus 420 to connect the processor 410 and the memories 430, 440.
  • the device may also include a network interface 450 for the device to communicate with the outside.
  • the embodiment of the graph neural network-based gene phenotype prediction device of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer.
  • the device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 4, it is a hardware structure diagram of any device with data processing capabilities where the gene phenotype prediction device based on graph neural network of the present invention is located. In addition to the processor shown in Figure 4 , memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. In this regard No longer.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the gene phenotype prediction method based on a graph neural network in the above embodiments is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, smart memory card (Smart Media Card, SMC), SD card, flash memory card equipped on the device (Flash Card) etc.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Mathematical Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present disclosure relates to genophenotypic prediction based on a graph neural network. According to a method embodiment of the present disclosure, for a specific species, a graph neural network comprising multiple network layers is constructed according to the correlation between gene loci and phenotypes of the species. Each network layer of the graph neural network uses a one-dimensional convolution kernel having a length of 3, and the convolution kernel is shared by neighborhoods; and in each network layer, nodes represent the gene loci, edges represent that two gene loci are related to the same phenotype, and the weight of each edge is used for reflecting the degree of association between the gene loci. A result obtained after sample gene data of the species is input into the graph neural network is input into a multi-layer perceptron, and a corresponding phenotype classification result can be obtained. The graph neural network and the multi-layer perceptron can be trained and verified on the basis of the difference between the classification result of the sample gene data and a true value, and phenotypic classification can be carried out by using the trained graph neural network and the trained multi-layer perceptron.

Description

基于图神经网络的基因表型预测Gene phenotype prediction based on graph neural network 技术领域Technical field
本发明涉及智能计算育种领域,尤其是涉及基于图神经网络的基因表型预测以及相应的图神经网络训练。The invention relates to the field of intelligent computing breeding, and in particular to gene phenotype prediction based on graph neural networks and corresponding graph neural network training.
背景技术Background technique
伴随着人类文明的发展历程,作物育种主要经历了农民基于经验主观判断、作物育种学科建立、分子选择育种三个历史阶段。当前,随着大数据人工智能等学科的发展,以及基因编辑、合成生物等基因组定向精准改良技术的逐步建立,作物育种开启了全新的智能育种时代。Along with the development of human civilization, crop breeding has mainly gone through three historical stages: farmers' subjective judgment based on experience, the establishment of crop breeding disciplines, and molecular selection breeding. Currently, with the development of disciplines such as big data, artificial intelligence, and the gradual establishment of genome-directed and precise improvement technologies such as gene editing and synthetic biology, crop breeding has opened a new era of intelligent breeding.
大豆作为高油高蛋白植物,是粮食产量的重要组成部分。如何选择并培育出高产的大豆,是目前农学家致力研究的问题。全基因组选择算法的提出为基因育种提供了一个方向,已有的代表性方法有最佳线性无偏预测(Best Linear Unbiased Prediction,BLUP)、基因组最佳线性无偏预测(Genomic Best Linear Unbiased Prediction,GBLUP)、岭回归最佳线性无偏预测(Ridge Regression Best Linear Unbiased Prediction,RR-BLUP)、最小绝对收缩和选择运算符(Least Absolute Shrinkage and Selection Operator,LASSO)等。然而这些方法的性能距离我们在育种方面的预期尚为遥远。As a high-oil and high-protein plant, soybeans are an important component of grain production. How to select and cultivate high-yield soybeans is a problem that agriculturists are currently studying. The proposal of the genome-wide selection algorithm provides a direction for genetic breeding. The existing representative methods include Best Linear Unbiased Prediction (BLUP), Genomic Best Linear Unbiased Prediction (Genomic Best Linear Unbiased Prediction, GBLUP), Ridge Regression Best Linear Unbiased Prediction (RR-BLUP), Least Absolute Shrinkage and Selection Operator (LASSO), etc. However, the performance of these methods is still far from what we expect in breeding.
随着深度学习的发展,研究人员开始尝试将其应用在育种领域。如西北农林科技大学团队提出的基于深度学习的基因表型预测(DeepGS)算法,通过构建卷积神经网络,可预测小麦的表型性状,并超过了传统的全基因组选择算法的性能。然而,目前已有的基于深度学习的全基因组选择算法,大多采用简单的卷积神经网络,并未利用基因相关的先验知识。With the development of deep learning, researchers have begun to try to apply it in the field of breeding. For example, the deep learning-based gene phenotype prediction (DeepGS) algorithm proposed by the Northwest A&F University team can predict the phenotypic traits of wheat by constructing a convolutional neural network, and exceeds the performance of traditional whole-genome selection algorithms. However, most of the existing genome-wide selection algorithms based on deep learning use simple convolutional neural networks and do not utilize gene-related prior knowledge.
图神经网络可在有先验知识图谱的基础上进行训练,并获得可观的效果。图神经网络分为基于频谱的方法以及基于空域的方法,包括图神经网络(Graph Neural Network,GNN)、图卷积网络(Graph Convolutional Network,GCN)、图注意力网络(Graph Attention Network,GAT)等方法。将图神经网络与育种相关知识结合,目前相关的研究工作尚在初始阶段,是未来智能育种的一大趋势。Graph neural networks can be trained on the basis of prior knowledge graphs and achieve considerable results. Graph neural networks are divided into spectrum-based methods and spatial domain-based methods, including Graph Neural Network (GNN), Graph Convolutional Network (GCN), and Graph Attention Network (GAT) and other methods. Combining graph neural networks with breeding-related knowledge, the relevant research work is still in the initial stage and is a major trend in intelligent breeding in the future.
发明内容Contents of the invention
为解决现有技术的不足,实现提高基因表型预测性能的目的,本发明采用如下的技术方案: In order to solve the deficiencies of the existing technology and achieve the purpose of improving the gene phenotype prediction performance, the present invention adopts the following technical solutions:
一种基于图神经网络的基因表型训练方法,包括如下步骤:针对特定的物种,根据所述物种的基因位点与表型的相关性,构建包括多层网络的图神经网络,其中,在所述图神经网络的每一层网络中,节点代表基因位点,边代表两个基因位点与同一个表型相关,边的权重用于反映基因位点之间的关联程度;采集所述物种的多个样本的基因数据和表型数据,作为训练数据;对于所述训练数据,基于位点探测的概率值对所述基因数据进行编码,得到所述基因数据对应的基因位点及基因型表征;将编码后的所述基因数据输入所述图神经网络,以依次通过所述图神经网络中的每层网络,其中,所述图神经网络中的每层网络采用长度为3的一维卷积核,且邻域之间卷积核共享;基于所述图神经网络的最后一层网络中每个节点的输出结果,利用多层感知器得到所述基因数据对应的表型分类结果;根据所述表型分类结果和所述基因数据对应的基因型表征,利用损失函数对所述图神经网络和/或所述多层感知器的模型参数进行监督训练。A gene phenotype training method based on a graph neural network, including the following steps: for a specific species, based on the correlation between the gene loci of the species and the phenotype, construct a graph neural network including a multi-layer network, wherein, in In each layer of the graph neural network, nodes represent gene sites, edges represent two gene sites related to the same phenotype, and the weight of the edge is used to reflect the degree of association between the gene sites; collecting the The genetic data and phenotypic data of multiple samples of the species are used as training data; for the training data, the genetic data is encoded based on the probability value of site detection to obtain the gene sites and genes corresponding to the genetic data. type representation; input the encoded genetic data into the graph neural network to pass through each layer of the graph neural network in sequence, wherein each layer of the graph neural network uses a network with a length of 3 dimensional convolution kernel, and the convolution kernel is shared between neighbors; based on the output result of each node in the last layer of the graph neural network, a multi-layer perceptron is used to obtain the phenotypic classification result corresponding to the genetic data ; According to the phenotype classification results and the genotype representation corresponding to the genetic data, use a loss function to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron.
进一步地,基于基因位点探测的概率值对所述基因数据进行编码,包括将基于基因位点探测基因型是0/0,0/1,1/1的概率值PL分别按照以下公式换算为支持各所述基因型的概率P:
Further, encoding the genetic data based on the probability value of gene site detection includes converting the probability value PL based on the gene site detection genotype of 0/0, 0/1, 1/1 respectively according to the following formula: The probability P of supporting each described genotype:
其中,针对每个所述基因位点,将所得到的该基因位点的概率P构成为一个3维向量[a,b,c],作为该基因位点对应的基因型表征,其中,所述a,b,c依次表示该基因位点的基因型是0/0,0/1,1/1的概率;对于未探测到的基因位点,其所述基因型表征用向量[0,0,0]表示。Among them, for each gene site, the obtained probability P of the gene site is formed into a 3-dimensional vector [a, b, c], which is used as the genotype representation corresponding to the gene site, where, The above a, b, c respectively represent the probability that the genotype of the gene site is 0/0, 0/1, 1/1; for undetected gene sites, the genotype is represented by the vector [0, 0,0] represents.
进一步地,在所述基因数据依次通过所述图神经网络中的每一层网络时,可采用均匀采样进行节点邻域选择,并通过邻域节点的权重与卷积核参数更新各个节点。具体地,可包括如下步骤:对于所述图神经网络的当前层中的每个节点c,从所述节点c的一阶相邻节点中构建m个候选节点,m为大于0的整数;无放回地从节点c的m个候选节点中采样n个节点作为所述节点c的邻域节点,并在m小于n时采样全部m个候选节点作为所述节点c的邻域节点;聚合所述节点c的所有邻域节点的信息,得到所述节点c的邻域信息对将所述节点c的邻域信息与所述节点c的信息hc拼接得到的信息进行卷积与激活操作,得到所述图神经网络的当前层网络中所述节点c的输出信息h, c,作为所述图神经网络中的下一层网络的输入。Further, when the genetic data passes through each layer of the graph neural network in sequence, uniform sampling can be used to select node neighborhoods, and each node can be updated through the weight and convolution kernel parameters of the neighborhood nodes. Specifically, the following steps may be included: for each node c in the current layer of the graph neural network, construct m candidate nodes from the first-order adjacent nodes of the node c, where m is an integer greater than 0; None Sample n nodes from m candidate nodes of node c with replacement as neighbor nodes of node c, and when m is less than n, sample all m candidate nodes as neighbor nodes of node c; aggregate all the The information of all neighboring nodes of the node c is obtained to obtain the neighborhood information of the node c. For the neighborhood information of the node c Convolution and activation operations are performed on the information obtained by splicing the information hc of the node c, and the output information h , c of the node c in the current layer network of the graph neural network is obtained as the next layer of the graph neural network. Input to one layer of the network.
其中,聚合所述节点c的所有邻域节点的信息的公式如下:
Among them, the formula for aggregating the information of all neighbor nodes of the node c is as follows:
其中,hi代表节点c的第i个邻域节点的信息,wi代表节点c的第i个邻域节点的权重。Among them, hi represents the information of the i-th neighbor node of node c, and wi represents the weight of the i-th neighbor node of node c.
其中,进行所述进行卷积与激活操作的具体公式如下:
Among them, the specific formulas for performing the convolution and activation operations are as follows:
其中,h, c表示节点c从当前层网络输出的信息,即下一层网络的输入,σ表示激活函数,W表示卷积核参数,hc表示节点c输入当前网络层的信息。Among them, h and c represent the information output by node c from the current layer network, that is, the input of the next layer network, σ represents the activation function, W represents the convolution kernel parameters, and hc represents the information input by node c to the current network layer.
进一步地,基于所述图神经网络的最后一层网络中每个节点的输出结果,利用多层感知器得到所述基因数据对应的表型分类结果,包括如下步骤:将所述图神经网络的最后一层网络中所有节点输出的维度为3的向量进行拼接,得到拼接后向量;将所述拼接后向量输入所述多层感知器,得到所述多层感知器输出的分类结果,作为所述基因数据对应的表型分类结果。Further, based on the output result of each node in the last layer of the graph neural network, using a multi-layer perceptron to obtain the phenotype classification result corresponding to the genetic data includes the following steps: converting the graph neural network The vectors with a dimension of 3 output by all nodes in the last layer of the network are spliced to obtain the spliced vector; the spliced vector is input into the multi-layer perceptron to obtain the classification result output by the multi-layer perceptron as the Phenotypic classification results corresponding to the above genetic data.
进一步地,根据所述表型分类结果和所述基因数据对应的基因型表征,利用损失函数对所述图神经网络和/或所述多层感知器的模型参数进行监督训练,可具体包括以下步骤:分别将s种表型平均划分为k个区间作为类别,得到维度为s×k的基因型表征真值向量,其中,维度s×k与所述表型分类结果的维度一致;利用损失函数,基于所述表型分类结果与所述表型的基因型表征真值向量,进行多表型监督训练。Further, based on the phenotype classification results and the genotype representation corresponding to the genetic data, a loss function is used to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron, which may specifically include the following Steps: Divide s phenotypes into k intervals on average as categories, and obtain a genotype representation true value vector with dimensions s×k, where the dimension s×k is consistent with the dimension of the phenotype classification result; use loss Function to perform multi-phenotype supervised training based on the phenotype classification result and the genotype representation true value vector of the phenotype.
其中,所述损失函数可为焦点损失函数Focal Loss,其基于所述表型分类结果与所述表型的基因型表征真值向量计算分类损失的公式为:
Wherein, the loss function may be a focal loss function, and the formula for calculating the classification loss based on the phenotype classification result and the genotype representation true value vector of the phenotype is:
其中,px,y表示所述表型分类结果在特征图的横坐标x和纵坐标y处的置信度,表示所述表型的基因型表征真值向量在该位置上的真实类别标签,1表示正样本,0表示负样本;γ是大于0的值,α是[0,1]间的小数,γ和α都是固定值,不参与训练。Among them, p x, y represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map, Represents the true category label of the genotype representation true value vector of the phenotype at this position, 1 represents a positive sample, 0 represents a negative sample; γ is a value greater than 0, α is a decimal between [0, 1], γ and α are both fixed values and do not participate in training.
一种基于图神经网络的基因表型预测方法,包括以下步骤:对于待分类的基因数据,基于位点探测的概率值对所述基因数据进行编码,得到所述待分类的基因数据对应的基因位点及基因型表征;将编码后的所述待分类的基因数据输入训练好的图神经网络和多层感知器中,得到待分类的基因数据对应的表型结果。其中,所述图神经网络和所述多 层感知器是利用前述方法训练得到的针对所述待分类的基因数据所属物种的基因表型预测网络。A gene phenotype prediction method based on graph neural network, including the following steps: for the gene data to be classified, encoding the gene data based on the probability value of site detection to obtain the gene corresponding to the gene data to be classified Locus and genotype characterization; input the encoded genetic data to be classified into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified. Among them, the graph neural network and the multi- The layer perceptron is a gene phenotype prediction network trained by the aforementioned method for the species to which the genetic data to be classified belongs.
一种基于图神经网络的基因表型训练装置,用于实现所述的基于图神经网络的基因表型训练方法,包括图神经网络构建模块、数据获取模块、预编码模块、基因数据输入模块和分类模块;所述图神经网络构建模块,根据基因位点与表型的相关性,对基因构建图神经网络:节点代表基因位点,边代表两个基因位点同时与某一个表型相关,边的权重用于反映基因位点之间的关联程度;所述数据获取模块,采集样本的基因数据,并获取样本对应的表型数据,进行训练集和测试集的划分,用以训练与验证图神经网络;所述预编码模块,对于训练数据,基于位点探测对基因数据进行预编码,得到基因位点及其对应的基因型;所述基因数据输入模块,将编码后的基因数据,输入构建的图神经网络,每层网络采用长度为3的一维卷积核,邻域之间卷积核共享;所述分类模块,将每个节点的输出结果进行拼接,并将拼接后的结果输入多层感知器,输出表型分类结果,根据损失函数对模型进行监督训练。A graph neural network-based gene phenotype training device, used to implement the graph neural network-based gene phenotype training method, including a graph neural network building module, a data acquisition module, a precoding module, a genetic data input module and Classification module; the graph neural network building module constructs a graph neural network for genes based on the correlation between gene sites and phenotypes: nodes represent gene sites, and edges represent two gene sites that are related to a certain phenotype at the same time. The weight of the edge is used to reflect the degree of association between gene sites; the data acquisition module collects the genetic data of the sample, obtains the phenotypic data corresponding to the sample, and divides the training set and the test set for training and verification. Graph neural network; the precoding module, for training data, precodes genetic data based on site detection to obtain gene sites and their corresponding genotypes; the genetic data input module converts the encoded genetic data into Input the graph neural network constructed. Each layer of the network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods; the classification module splices the output results of each node, and the spliced The results are input into the multi-layer perceptron, the phenotype classification results are output, and the model is supervised and trained according to the loss function.
一种基于图神经网络的基因表型预测装置,用于实现所述的基于图神经网络的基因表型训练装置,将待分类的基因数据经预编码模块编码后,再通过基因数据输入模块,输入训练好的分类模块中,得到待分类的基因数据对应的表型结果。A gene phenotype prediction device based on a graph neural network is used to implement the gene phenotype training device based on a graph neural network. After the genetic data to be classified is encoded by the precoding module, it is then passed through the genetic data input module. Input the trained classification module to obtain the phenotypic results corresponding to the genetic data to be classified.
本发明的优势和有益效果在于:首先,通过利用基因表型之间相关性的先验知识构建基因图神经网络,将相关性弱的基因位点剔除,能有效减少输入基因维度,从而达到降维去噪的目的;其次,通过将表型划分为多个区间进行分类预测,可有效降低训练难度,增加模型算法的稳定性,且能够支持同时训练并预测多个表型;最后,相对于传统的全基因选择算法例如rrBLUP,本发明提供的技术方案在各个表型的预测中有更好的性能表现,包括皮尔逊相关系数(Pearson Correlation Coefficient)有20%至30%的提升。The advantages and beneficial effects of the present invention are: first, by using the prior knowledge of the correlation between gene phenotypes to construct a gene graph neural network, and eliminating weakly correlated gene sites, it can effectively reduce the input gene dimension, thereby achieving reduction. The purpose of dimensional denoising; secondly, by dividing phenotypes into multiple intervals for classification prediction, it can effectively reduce the difficulty of training, increase the stability of the model algorithm, and support simultaneous training and prediction of multiple phenotypes; finally, compared with Compared with traditional whole-gene selection algorithms such as rrBLUP, the technical solution provided by the present invention has better performance in the prediction of various phenotypes, including a 20% to 30% improvement in the Pearson Correlation Coefficient.
附图说明Description of the drawings
图1是根据本发明一实施例的用于基因表型预测的图神经网络的训练方法的流程图。FIG. 1 is a flow chart of a graph neural network training method for gene phenotype prediction according to an embodiment of the present invention.
图2是根据本发明一实施例的基于图神经网络的基因表型预测方法的流程图。Figure 2 is a flow chart of a gene phenotype prediction method based on graph neural network according to an embodiment of the present invention.
图3是根据本发明一实施例简化的基于图神经网络进行基因表型分类识别的模型架构图。Figure 3 is a simplified model architecture diagram for classifying and identifying gene phenotypes based on graph neural networks according to an embodiment of the present invention.
图4是根据本发明一实施例的用于基因表型预测的图神经网络的训练设备的结构示意图。 Figure 4 is a schematic structural diagram of a graph neural network training device for gene phenotype prediction according to an embodiment of the present invention.
具体实施方式Detailed ways
以下结合附图对本发明的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本发明,并不用于限制本发明。Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.
如图1所示,一种用于基因表型预测的图神经网络的训练方法,可包括如下步骤S110至步骤S160。As shown in Figure 1, a training method of a graph neural network for gene phenotype prediction may include the following steps S110 to S160.
步骤S110:针对特定的物种,根据所述物种的基因位点与表型的相关性,构建包括多层网络的图神经网络。在所构建的图神经网络的每一层网络中,节点代表基因位点,边代表两个基因位点同时与某一个表型相关,边的权重用于反映基因位点之间的关联程度。Step S110: For a specific species, construct a graph neural network including a multi-layer network based on the correlation between the gene locus and the phenotype of the species. In each layer of the constructed graph neural network, nodes represent gene sites, edges represent two gene sites that are simultaneously related to a certain phenotype, and the weight of the edge is used to reflect the degree of association between gene sites.
在本发明一实施例中,可根据下表1中所示的大豆基因位点与表型的相关性信息,构建大豆基因的图神经网络。其中,基因位点有39个,若两个基因位点与同一个表型相关次数越多,则边的权重越高。因此,边的权重可反映基因点位之间的关联程度。In one embodiment of the present invention, a graph neural network of soybean genes can be constructed based on the correlation information between soybean gene loci and phenotypes shown in Table 1 below. Among them, there are 39 gene loci. If two gene loci are associated with the same phenotype more times, the weight of the edge will be higher. Therefore, the edge weight can reflect the degree of association between gene sites.
表1

Table 1

步骤S120:采集所述物种的多个样本的基因数据和表型数据,作为训练数据。其中,可对所述训练数据进行训练集和测试集的划分,以分别用以训练与验证所述图神经网络。Step S120: Collect genetic data and phenotypic data of multiple samples of the species as training data. Wherein, the training data can be divided into a training set and a test set to respectively train and verify the graph neural network.
在本发明一实施例中,采集了3000份大豆样本的基因数据,即单核苷酸多态性(Single Nucleotide Polymorphisms,SNP)位点信息,训练及测试时只需要利用表1涉及的39个基因位点的信息。收集各个大豆样本对应的s种表型数据有株高、分枝数、豆粒数、茎秆数,即s=4。可将上述数据按照4:1比例随机划分为多个训练集和至少一个测试集。In one embodiment of the present invention, the genetic data of 3,000 soybean samples, that is, Single Nucleotide Polymorphisms (SNP) site information, is collected. During training and testing, only the 39 genes involved in Table 1 need to be used. Gene locus information. Collect s phenotypic data corresponding to each soybean sample, including plant height, number of branches, number of beans, and number of stems, that is, s=4. The above data can be randomly divided into multiple training sets and at least one test set according to a 4:1 ratio.
步骤S130:对于所述训练数据,基于位点探测的概率值对所述基因数据进行编码,得到所述基因数据对应的基因位点及基因型表征。Step S130: For the training data, encode the gene data based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the gene data.
根据本发明一实施例,对输入的基因数据,基于基因位点探测的概率值PL进行编码,并通过将基于基因位点探测基因型是0/0,0/1,1/1的PL值分别按照以下公式换算为支持各基因型的概率P:
According to an embodiment of the present invention, the input genetic data is encoded based on the probability value PL of gene site detection, and the PL value based on the gene site detection genotype is 0/0, 0/1, 1/1 Convert to the probability P of supporting each genotype according to the following formula:
其中,可将针对某一基因位点得到的概率P构成为一个3维向量[a,b,c],作为该基因位点对应的基因型表征,其中,所述a,b,c依次表示该基因位点的基因型是0/0,0/1,1/1的概率。对于未探测到的基因位点,其基因型表征可用向量[0,0,0]表示。Among them, the probability P obtained for a certain gene site can be formed into a 3-dimensional vector [a, b, c], as the genotype representation corresponding to the gene site, where a, b, c represent in turn The genotype of this gene locus has a probability of 0/0, 0/1, and 1/1. For undetected gene loci, their genotype representation can be represented by the vector [0,0,0].
步骤S140:将编码后的所述基因数据输入所构建的图神经网络,以依次通过所述图神经网络中的每层网络。其中,所述图神经网络中的每层网络采用长度为3的一维卷积核,邻域之间卷积核共享。Step S140: Input the encoded genetic data into the constructed graph neural network, so as to pass through each layer of the graph neural network in sequence. Each layer of the graph neural network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods.
在本发明一实施例中,将维度为39×3的编码后基因数据输入所构建的图神经网络。其中,所述图神经网络可为网络层数为8的图神经网络,每层网络采用3个长度为3的一维卷积核,邻域之间卷积核共享。In one embodiment of the present invention, the encoded genetic data with a dimension of 39×3 is input into the constructed graph neural network. Wherein, the graph neural network can be a graph neural network with 8 network layers. Each layer of the network uses 3 one-dimensional convolution kernels with a length of 3, and the convolution kernels are shared between neighborhoods.
在图神经网络中的每一层,可采用均匀采样进行节点邻域选择,并通过邻域节点的权重与卷积核参数更新各个节点,具体可包括如下步骤S141至步骤S144。In each layer of the graph neural network, uniform sampling can be used to select node neighborhoods, and each node can be updated through the weight and convolution kernel parameters of the neighborhood nodes. Specifically, the steps S141 to S144 can be included.
在步骤S141:在图神经网络中,对于当前层中的各个节点,从其一阶相邻节点中构建m个候选节点。其中,m为大于0的整数。In step S141: In the graph neural network, for each node in the current layer, m candidate nodes are constructed from its first-order neighboring nodes. Among them, m is an integer greater than 0.
在步骤S142:对于作为中心节点的节点c,无放回地从节点c的m个候选节点中采样n个节点作为节点c的邻域节点。其中,当m小于n时,即候选节点不足n个,则采样全部候选节点作为邻域节点。在本实施例中,n=4。In step S142: For node c as the central node, n nodes are sampled from m candidate nodes of node c without replacement as neighbor nodes of node c. Among them, when m is less than n, that is, there are less than n candidate nodes, then all candidate nodes are sampled as neighbor nodes. In this embodiment, n=4.
步骤S143:聚合节点c的所有邻域节点的信息,得到节点c的邻域信息 Step S143: Aggregate the information of all neighbor nodes of node c to obtain the neighbor information of node c
具体地,聚合公式可表示为如下:
Specifically, the aggregation formula can be expressed as follows:
其中,hi代表节点c的第i个邻域节点的信息,wi代表节点c的第i个邻域节点的权重。Among them, hi represents the information of the i-th neighbor node of node c, and wi represents the weight of the i-th neighbor node of node c.
在本发明一实施例中,n等于4时,节点c的聚合邻域信息的计算公式可为:
In an embodiment of the present invention, when n equals 4, the aggregated neighborhood information of node c The calculation formula can be:
步骤S144:将节点c的聚合邻域信息与节点c的信息hc进行向量拼接,并对拼 接后的信息进行卷积与激活操作,得到图神经网络当前层的输出信息h, cStep S144: Aggregate neighborhood information of node c Perform vector splicing with the information hc of node c, and concatenate The subsequent information is subjected to convolution and activation operations to obtain the output information h , c of the current layer of the graph neural network.
具体地,公式如下:
Specifically, the formula is as follows:
其中,h, c表示节点c从当前网络层输出的信息,即下一层网络的输入,σ表示激活函数,W表示卷积核参数,CONCAT表示拼接操作,hc表示节点c输入当前网络层的信息。Among them, h and c represent the information output by node c from the current network layer, that is, the input of the next layer of network, σ represents the activation function, W represents the convolution kernel parameters, CONCAT represents the splicing operation, and hc represents the input of node c to the current network layer. information.
步骤S150:基于所述图神经网络的最后一层网络中每个节点的输出结果,利用多层感知器得到所述基因数据对应的表型分类结果。Step S150: Based on the output result of each node in the last layer of the graph neural network, use a multi-layer perceptron to obtain the phenotype classification result corresponding to the genetic data.
步骤S160,根据所述基因数据对应的表型分类结果和基因型表征,利用损失函数对所述图神经网络和/或所述多层感知器的模型参数进行监督训练。其中,损失函数主要用于基于所述表型分类结果和所述基因型表征计算损失值。Step S160: Use a loss function to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron based on the phenotypic classification results and genotype representation corresponding to the genetic data. Among them, the loss function is mainly used to calculate the loss value based on the phenotype classification result and the genotype characterization.
在本发明的一实施例中,上述步骤S150可具体包括如下步骤S151。In an embodiment of the present invention, the above step S150 may specifically include the following step S151.
在步骤S151:将图神经网络的最后一层网络中所有节点输出的维度为3的向量进行拼接,再将拼接后向量输入多层感知器,得到多层感知器输出的分类结果,作为所述表型分类结果。In step S151: splice the vectors with dimension 3 output by all nodes in the last layer of the graph neural network, and then input the spliced vector into the multi-layer perceptron to obtain the classification result output by the multi-layer perceptron as the above Phenotypic classification results.
在本发明一实施例中,将针对所有39个节点输出的维度为3的概率向量进行拼接,则输出维度为39×3=117的拼接概率向量;然后,将该拼接概率向量输入2层全连接网络,即可得到分类结果。其中,全连接网络的第一层输入维度为117的拼接概率向量,并输出维度为80的中间概率向量;接着,全连接网络的第二层输入维度为80的中间概率向量,并输出维度为20的最终概率向量,作为所述表型分类结果。In one embodiment of the present invention, the probability vectors with a dimension of 3 output for all 39 nodes are spliced, and a splicing probability vector with a dimension of 39×3=117 is output; then, the splicing probability vector is input into the 2-layer full Connect to the network to get the classification results. Among them, the first layer of the fully connected network inputs a splicing probability vector with a dimension of 117, and outputs an intermediate probability vector with a dimension of 80; then, the second layer of the fully connected network inputs an intermediate probability vector with a dimension of 80, and outputs a dimension of 80. The final probability vector of 20 is used as the phenotype classification result.
在本发明的一实施例中,上述步骤S160可具体包括如下步骤S161至步骤S162。In an embodiment of the present invention, the above step S160 may specifically include the following steps S161 to S162.
在步骤S161:分别将s种表型平均划分为k个区间作为类别,则得到维度为s×k的基因型表征真值向量(以下也可简称为真值向量)。In step S161: divide s phenotypes into k intervals on average as categories, and then obtain a genotype representation true value vector with dimensions s×k (hereinafter also referred to as the true value vector for short).
在本发明一实施例中,分别将4种表型平均划分为5个区间作为类别,则真值向量的维度为4×5=20。这样,真值向量的维度与多层感知器网络输出的表型分类结果的维度一一对应。以株高为例,可按照区间平均划分为极矮,矮,正常,高,极高共5类。其他表型可以此类推,并在此不赘述。In one embodiment of the present invention, four phenotypes are divided into five intervals on average as categories, and the dimension of the true value vector is 4×5=20. In this way, the dimensions of the truth vector correspond one-to-one with the dimensions of the phenotype classification result output by the multi-layer perceptron network. Taking plant height as an example, it can be divided into five categories according to the average interval: extremely short, short, normal, high, and extremely high. Other phenotypes can be deduced in this way and will not be described here.
在步骤S162:利用损失函数,基于所述表型分类结果与表型的真值向量,进行多表 型监督训练。In step S162: Use the loss function to perform multi-table analysis based on the phenotype classification result and the true value vector of the phenotype. Supervised training.
根据本发明一实施例,监督训练中的损失函数可采用焦点损失函数Focal Loss,其计算分类损失的公式可为:
According to an embodiment of the present invention, the loss function in supervised training can use the focal loss function Focal Loss, and the formula for calculating the classification loss can be:
其中,px,y表示所述表型分类结果在特征图的横坐标x和纵坐标y处的置信度,表示所述表型的真值向量在该位置上目标所属的真实类别标签,1表示正样本,0表示负样本;γ是大于0的值,α是[0,1]间的小数,γ和α都是固定值,不参与训练。在本发明一实施例中,在α取值0.1,γ取值2时的训练效果最佳。例如,可使用随机梯度下降(Stochastic Gradient Descent,SGD)作为优化器,在4个图形处理器(Graphics Processing Unit,GPU)上进行训练,批处理量batch size为16,训练步数为50k,初始学习率为0.01,之后在20k步和40k步时可缩小10倍。Among them, p x, y represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map, Represents the true category label of the target at the position of the true value vector of the phenotype, 1 represents a positive sample, 0 represents a negative sample; γ is a value greater than 0, α is a decimal between [0, 1], γ and α are all fixed values and do not participate in training. In an embodiment of the present invention, the training effect is optimal when α takes a value of 0.1 and γ takes a value of 2. For example, you can use Stochastic Gradient Descent (SGD) as the optimizer and train on 4 Graphics Processing Units (GPUs). The batch size is 16 and the number of training steps is 50k. The initial The learning rate is 0.01 and then scales down by a factor of 10 at 20k and 40k steps.
根据本发明的一实施例,还提供了一种基于图神经网络的基因表型预测方法。如图2所述,该基因表型预测方法可包括以下步骤S210至步骤S220。According to an embodiment of the present invention, a gene phenotype prediction method based on graph neural network is also provided. As shown in Figure 2, the gene phenotype prediction method may include the following steps S210 to S220.
在步骤S210,对于待分类的基因数据,基于位点探测的概率值对所述基因数据进行编码,得到所述待分类的基因数据对应的基因位点及基因型表征。其中,该步骤S210的处理与前述步骤S130中的基本上一致,具体可参考上文,并在此不赘述。In step S210, for the genetic data to be classified, the genetic data is encoded based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the genetic data to be classified. The processing of step S210 is basically the same as that of step S130. For details, please refer to the above and will not be described again here.
在步骤S220,将编码后的所述待分类的基因数据输入训练好的图神经网络和多层感知器中,得到所述待分类的基因数据对应的表型结果。其中,所述图神经网络和所述多层感知器可以是利用前述训练方法得到的针对所述待分类的基因数据所属物种的基因表型预测网络,并在此不赘述具体的训练方式。In step S220, the encoded genetic data to be classified is input into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified. Wherein, the graph neural network and the multi-layer perceptron can be a gene phenotype prediction network for the species to which the genetic data to be classified is obtained using the aforementioned training method, and the specific training method will not be described in detail here.
以一个简化的图神经网络为例,假设待分类的基因数据对应的物种有5个基因位点,则图神经网络的每一层网络中包括5个节点。如图3所示,输入的待分类的基因数据经过多层的图神经网络310进行卷积和激活操作后,图神经网络的输出被输入多层感知器320,并由多层感知器320输出所述基因数据最终的分类结果。Taking a simplified graph neural network as an example, assuming that the species corresponding to the genetic data to be classified has 5 gene loci, each layer of the graph neural network includes 5 nodes. As shown in Figure 3, after the input genetic data to be classified undergoes convolution and activation operations through the multi-layer graph neural network 310, the output of the graph neural network is input to the multi-layer perceptron 320 and output by the multi-layer perceptron 320. The final classification result of the genetic data.
这部分内容实施方式与上述方法实施例的实施方式类似,此处不再赘述。The implementation of this part is similar to the implementation of the above method embodiment, and will not be described again here.
一种用于基因表型预测的图神经网络的训练装置,包括图神经网络构建模块、数据获取模块、预编码模块、基因数据输入模块和分类模块。其中,图神经网络构建模块,针对特定的物种,根据所述物种的基因位点与表型的相关性,构建包括多层网络的图神 经网络,其中,在所述图神经网络的每一层网络中,节点代表基因位点,边代表两个基因位点与同一个表型相关,边的权重用于反映基因位点之间的关联程度;数据获取模块,采集所述物种的多个样本的基因数据和表型数据,作为训练数据;预编码模块,对于所述训练数据,基于位点探测的概率值对所述基因数据进行编码,得到所述基因数据对应的基因位点及基因型表征;基因数据输入模块,将编码后的所述基因数据输入所述图神经网络,以依次通过所述图神经网络中的每层网络,其中,所述图神经网络中的每层网络采用长度为3的一维卷积核,且邻域之间卷积核共享;分类模块,基于所述图神经网络的最后一层网络中每个节点的输出结果,利用多层感知器得到所述基因数据对应的表型分类结果,以使得能够根据所述表型分类结果,利用损失函数对对所述图神经网络和/或所述多层感知器的模型参数进行监督训练。A training device for graph neural network for gene phenotype prediction, including a graph neural network building module, a data acquisition module, a precoding module, a gene data input module and a classification module. Among them, the graph neural network building module is aimed at a specific species and constructs a graph neural network including a multi-layer network based on the correlation between the genetic loci and phenotype of the species. Through the network, in each layer of the graph neural network, the node represents a gene site, the edge represents two gene sites related to the same phenotype, and the weight of the edge is used to reflect the relationship between the gene sites. The degree of correlation; the data acquisition module collects the genetic data and phenotypic data of multiple samples of the species as training data; the precoding module, for the training data, performs on the genetic data based on the probability value of site detection Encoding to obtain the gene locus and genotype representation corresponding to the genetic data; the genetic data input module inputs the encoded genetic data into the graph neural network to pass through each layer of the graph neural network in sequence , wherein each layer of the network in the graph neural network uses a one-dimensional convolution kernel with a length of 3, and the convolution kernel is shared between neighborhoods; the classification module is based on each layer of the last layer of the graph neural network. Based on the output results of the nodes, a multi-layer perceptron is used to obtain the phenotypic classification result corresponding to the genetic data, so that the loss function can be used to compare the graph neural network and/or the multi-layered neural network according to the phenotypic classification result. The model parameters of the layer perceptron are supervised for training.
这部分内容实施方式与上述方法实施例的实施方式类似,此处不再赘述。The implementation of this part is similar to the implementation of the above method embodiment, and will not be described again here.
一种基于图神经网络的基因表型预测装置,用于实现所述的基于图神经网络的基因表型预测和分类,其可包括上述预编码模块以及经由上述训练方法和/或训练装置进行训练得到的基因表型预测网络。具言之,将待分类的基因数据经预编码模块编码后,将编码后的所述待分类的基因数据输入训练好的图神经网络和多层感知器中,得到待分类的基因数据对应的表型结果。其中,所述图神经网络和所述多层感知器是利用前述训练方法和/或训练装置针对所述待分类的基因数据所属物种进行训练所得到的的基因表型预测网络。A graph neural network-based gene phenotype prediction device, used to implement the graph neural network-based gene phenotype prediction and classification, which may include the above-mentioned precoding module and be trained via the above-mentioned training method and/or training device The resulting gene phenotype prediction network. Specifically, after the genetic data to be classified is encoded by the precoding module, the encoded genetic data to be classified is input into the trained graph neural network and multi-layer perceptron, and the corresponding genetic data to be classified is obtained. Phenotypic results. Wherein, the graph neural network and the multi-layer perceptron are gene phenotype prediction networks obtained by using the aforementioned training method and/or training device to train the species to which the genetic data to be classified belongs.
这部分内容实施方式与上述方法实施例的实施方式类似,此处不再赘述。The implementation of this part is similar to the implementation of the above method embodiment, and will not be described again here.
与前述一种基于图神经网络的基因表型预测方法的实施例相对应,本发明还提供了一种基于图神经网络的基因表型预测设备的实施例。Corresponding to the aforementioned embodiment of a gene phenotype prediction method based on a graph neural network, the present invention also provides an embodiment of a gene phenotype prediction device based on a graph neural network.
参见图4,本发明实施例提供的一种基于图神经网络的基因表型预测的设备,包括存储器(具体可包括非易失性存储器430和/或内存440)和一个或多个处理器410。其中,存储器430、440中存储有可执行代码,所述一个或多个处理器410执行所述可执行代码时,用于实现上述实施例中的一种基于图神经网络的基因表型预测的方法。如图4所示,该设备还包括内部总线420,以连接所述处理器410和所述存储器430、440。此外,该设备还可包括网络接口450,以供该设备与外部进行通信。Referring to Figure 4, an embodiment of the present invention provides a device for gene phenotype prediction based on a graph neural network, including a memory (specifically, it may include a non-volatile memory 430 and/or a memory 440) and one or more processors 410 . Among them, executable codes are stored in the memories 430 and 440. When the one or more processors 410 execute the executable codes, they are used to implement a graph neural network-based gene phenotype prediction in the above embodiment. method. As shown in Figure 4, the device also includes an internal bus 420 to connect the processor 410 and the memories 430, 440. In addition, the device may also include a network interface 450 for the device to communicate with the outside.
本发明一种基于图神经网络的基因表型预测设备的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。 装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图4所示,为本发明一种基于图神经网络的基因表型预测设备所在任意具备数据处理能力的设备的一种硬件结构图,除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。The embodiment of the graph neural network-based gene phenotype prediction device of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer. The device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 4, it is a hardware structure diagram of any device with data processing capabilities where the gene phenotype prediction device based on graph neural network of the present invention is located. In addition to the processor shown in Figure 4 , memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. In this regard No longer.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的一种基于图神经网络的基因表型预测方法。Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the gene phenotype prediction method based on a graph neural network in the above embodiments is implemented.
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, smart memory card (Smart Media Card, SMC), SD card, flash memory card equipped on the device (Flash Card) etc. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的范围。 The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments. Modify the technical solution, or make equivalent substitutions for some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solution to depart from the scope of the technical solution of the embodiments of the present invention.

Claims (10)

  1. 一种训练用于基因表型预测的图神经网络的方法,包括如下步骤:A method for training a graph neural network for gene phenotype prediction, including the following steps:
    针对特定的物种,根据所述物种的基因位点与表型的相关性,构建包括多层网络的图神经网络,其中,在所述图神经网络的每一层网络中,节点代表基因位点,边代表两个基因位点与同一个表型相关,边的权重用于反映基因位点之间的关联程度;For a specific species, based on the correlation between the gene locus and the phenotype of the species, a graph neural network including a multi-layer network is constructed, wherein in each layer of the graph neural network, a node represents a gene locus , the edge represents that two gene loci are related to the same phenotype, and the weight of the edge is used to reflect the degree of association between the gene loci;
    采集所述物种的多个样本的基因数据和表型数据,作为训练数据;Collect genetic data and phenotypic data of multiple samples of the species as training data;
    对于所述训练数据,基于位点探测的概率值对所述基因数据进行编码,得到所述基因数据对应的基因位点及基因型表征;For the training data, the genetic data is encoded based on the probability value of site detection to obtain the gene site and genotype representation corresponding to the genetic data;
    将编码后的所述基因数据输入所述图神经网络,以依次通过所述图神经网络中的每层网络,其中,所述图神经网络中的每层网络采用长度为3的一维卷积核,且邻域之间卷积核共享;The encoded gene data is input into the graph neural network to pass through each layer of the graph neural network in sequence, wherein each layer of the graph neural network adopts a one-dimensional convolution with a length of 3. Kernel, and the convolution kernel is shared between neighborhoods;
    基于所述图神经网络的最后一层网络中每个节点的输出结果,利用多层感知器得到所述基因数据对应的表型分类结果;Based on the output results of each node in the last layer of the graph neural network, use a multi-layer perceptron to obtain the phenotypic classification results corresponding to the genetic data;
    根据所述基因数据对应的所述表型分类结果和所述基因型表征,利用损失函数对所述图神经网络和/或所述多层感知器的模型参数进行监督训练。According to the phenotype classification results and the genotype representation corresponding to the genetic data, a loss function is used to perform supervised training on the model parameters of the graph neural network and/or the multi-layer perceptron.
  2. 根据权利要求1所述的方法,其特征在于:基于基因位点探测的概率值对所述基因数据进行编码,包括:The method according to claim 1, characterized in that: encoding the genetic data based on the probability value of gene site detection includes:
    将基于基因位点探测基因型是0/0,0/1,1/1的概率值PL分别按照以下公式换算为支持各所述基因型的概率P:
    The probability values PL based on the detected genotypes of 0/0, 0/1, and 1/1 based on the gene loci are converted into the probability P of supporting each genotype according to the following formula:
    针对每个所述基因位点,将所得到的该基因位点的概率P构成为一个3维向量[a,b,c],作为该基因位点对应的基因型表征,其中,所述a,b,c依次表示该基因位点的基因型是0/0,0/1,1/1的概率,For each gene site, the obtained probability P of the gene site is formed into a 3-dimensional vector [a, b, c], as the genotype representation corresponding to the gene site, where a , b, c respectively represent the probability that the genotype of the gene locus is 0/0, 0/1, 1/1,
    对于未探测到的基因位点,其所述基因型表征用向量[0,0,0]表示。For undetected gene loci, the genotype representation is represented by the vector [0,0,0].
  3. 根据权利要求1所述的方法,其特征在于:在所述基因数据依次通过所述图神经网络中的每一层网络时,采用均匀采样进行节点邻域选择,并通过邻域节点的权重与卷积核参数更新各个节点。The method according to claim 1, characterized in that: when the genetic data passes through each layer of the graph neural network in sequence, uniform sampling is used to select node neighborhoods, and the weights of the neighborhood nodes and The convolution kernel parameters update each node.
  4. 根据权利要求3所述的方法,其特征在于,所述采用均匀采样进行节点邻域选择,并通过邻域节点的权重与卷积核参数更新各个节点,包括如下步骤:The method according to claim 3, characterized in that using uniform sampling to select node neighborhoods, and updating each node through the weight of neighborhood nodes and convolution kernel parameters, includes the following steps:
    对于所述图神经网络的当前层中的每个节点c,For each node c in the current layer of the graph neural network,
    从所述节点c的一阶相邻节点中构建m个候选节点,m为大于0的整数; Construct m candidate nodes from the first-order adjacent nodes of the node c, where m is an integer greater than 0;
    无放回地从节点c的m个候选节点中采样n个节点作为所述节点c的邻域节点,并在m小于n时采样全部m个候选节点作为所述节点c的邻域节点;Sampling n nodes from m candidate nodes of node c without replacement as neighbor nodes of node c, and sampling all m candidate nodes as neighbor nodes of node c when m is less than n;
    聚合所述节点c的所有邻域节点的信息,得到所述节点c的邻域信息 Aggregate the information of all neighboring nodes of the node c to obtain the neighborhood information of the node c
    对将所述节点c的邻域信息与所述节点c的信息hc拼接得到的信息进行卷积与激活操作,得到所述图神经网络的当前层网络中所述节点c的输出信息h’c,作为所述图神经网络中的下一层网络的输入,For the neighborhood information of the node c Convolution and activation operations are performed on the information obtained by splicing the information hc of the node c, and the output information h' c of the node c in the current layer network of the graph neural network is obtained as the next layer of the graph neural network. The input of one layer of network,
    其中,聚合所述节点c的所有邻域节点的信息的公式如下:
    Among them, the formula for aggregating the information of all neighbor nodes of the node c is as follows:
    其中,hi代表节点c的第i个邻域节点的信息,Among them, h i represents the information of the i-th neighbor node of node c,
    wi代表节点c的第i个邻域节点的权重,w i represents the weight of the i-th neighbor node of node c,
    其中,进行所述进行卷积与激活操作的具体公式如下:
    Among them, the specific formulas for performing the convolution and activation operations are as follows:
    其中,h’c表示从所述图神经网络的当前层网络中的所述节点c输出的信息,Where, h' c represents the information output from the node c in the current layer network of the graph neural network,
    σ表示激活函数,σ represents the activation function,
    W表示卷积核参数,W represents the convolution kernel parameters,
    hc表示输入所述图神经网络的当前网络层中的所述节点c的信息。h c represents the information input to the node c in the current network layer of the graph neural network.
  5. 根据权利要求1所述的方法,其特征在于:基于所述图神经网络的最后一层网络中每个节点的输出结果,利用多层感知器得到所述基因数据对应的表型分类结果,包括如下步骤:The method according to claim 1, characterized in that: based on the output result of each node in the last layer of the graph neural network, a multi-layer perceptron is used to obtain the phenotype classification result corresponding to the genetic data, including Follow these steps:
    将所述图神经网络的最后一层网络中所有节点输出的维度为3的向量进行拼接,得到拼接后向量;Splice the vectors with a dimension of 3 output by all nodes in the last layer of the graph neural network to obtain the spliced vector;
    将所述拼接后向量输入所述多层感知器,得到所述多层感知器输出的分类结果,作为所述基因数据对应的表型分类结果。The spliced vector is input into the multi-layer perceptron, and the classification result output by the multi-layer perceptron is obtained as the phenotype classification result corresponding to the genetic data.
  6. 根据权利要求5所述的方法,其特征在于,根据所述基因数据对应的所述表型分类结果和所述基因型表征,利用损失函数对所述图神经网络和/或所述多层感知器的模型参数进行监督训练,包括以下步骤:The method according to claim 5, characterized in that, based on the phenotypic classification results and the genotype representation corresponding to the genetic data, a loss function is used to perform the graph neural network and/or the multi-layer perception Supervised training of the model parameters of the controller, including the following steps:
    分别将s种表型平均划分为k个区间作为类别,得到维度为s×k的基因型表征真值向量,其中,维度s×k与所述表型分类结果的维度一致;Divide s phenotypes into k intervals on average as categories, respectively, to obtain a genotype representation true value vector with dimensions s×k, where the dimension s×k is consistent with the dimension of the phenotype classification result;
    利用损失函数,基于所述表型分类结果与所述表型的基因型表征真值向量,进行多表型监督训练。 Using a loss function, multi-phenotype supervised training is performed based on the phenotype classification result and the genotype representation true value vector of the phenotype.
  7. 根据权利要求6所述的方法,其特征在于,所述损失函数为焦点损失函数Focal Loss,其基于所述表型分类结果与所述表型的基因型表征真值向量计算分类损失的公式为:
    The method according to claim 6, wherein the loss function is a focal loss function, and the formula for calculating the classification loss based on the phenotype classification result and the genotype representation true value vector of the phenotype is: :
    其中,px,y表示所述表型分类结果在特征图的横坐标x和纵坐标y处的置信度,Among them, p x, y represents the confidence of the phenotype classification result at the abscissa x and ordinate y of the feature map,
    表示所述表型的基因型表征真值向量在该位置上的真实类别标签, The true class label at that position of the genotype representation ground truth vector representing the phenotype,
    1表示正样本,0表示负样本;1 represents a positive sample, 0 represents a negative sample;
    γ是大于0的值,α是[0,1]间的小数,γ和α都是固定值,不参与训练。γ is a value greater than 0, α is a decimal between [0, 1], γ and α are both fixed values and do not participate in training.
  8. 一种基于图神经网络的基因表型预测的方法,包括以下步骤:A method for gene phenotype prediction based on graph neural network, including the following steps:
    对于待分类的基因数据,基于位点探测的概率值对所述基因数据进行编码,得到所述待分类的基因数据对应的基因位点及基因型表征;For the genetic data to be classified, the genetic data is encoded based on the probability value of site detection to obtain the gene locus and genotype representation corresponding to the genetic data to be classified;
    将编码后的所述待分类的基因数据输入训练好的图神经网络和多层感知器中,得到所述待分类的基因数据对应的表型结果,其中,所述图神经网络和所述多层感知器是利用权利要求1至7中任一项所述方法训练得到的针对所述待分类的基因数据所属物种的基因表型预测网络。Input the encoded genetic data to be classified into the trained graph neural network and multi-layer perceptron to obtain the phenotypic results corresponding to the genetic data to be classified, wherein the graph neural network and the multi-layer perceptron The layer perceptron is a gene phenotype prediction network trained by the method described in any one of claims 1 to 7 for the species to which the genetic data to be classified belongs.
  9. 一种训练用于基因表型预测的图神经网络的装置,包括处理器和存储器,其特征在于,所述存储器上存储有程序,该程序被所述处理器执行时,能够实现权利要求1至7中任一项所述方法中的步骤。A device for training a graph neural network for gene phenotype prediction, including a processor and a memory, characterized in that a program is stored on the memory, and when the program is executed by the processor, it can realize claims 1 to The steps in the method of any one of 7.
  10. 一种基于图神经网络的基因表型预测的装置,处理器和存储器,其特征在于,所述存储器上存储有程序,该程序被所述处理器执行时,能够实现权利要求8所述方法中的步骤。 A device, processor and memory for gene phenotype prediction based on graph neural network, characterized in that a program is stored on the memory, and when the program is executed by the processor, the method of claim 8 can be realized A step of.
PCT/CN2023/095224 2022-10-11 2023-05-19 Genophenotypic prediction based on graph neural network WO2023217290A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023543455A JP2024524795A (en) 2022-10-11 2023-05-19 Gene phenotype prediction based on graph neural networks

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211238697.7 2022-10-11
CN202211238697.7A CN115331732B (en) 2022-10-11 2022-10-11 Gene phenotype training and predicting method and device based on graph neural network

Publications (1)

Publication Number Publication Date
WO2023217290A1 true WO2023217290A1 (en) 2023-11-16

Family

ID=83915021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095224 WO2023217290A1 (en) 2022-10-11 2023-05-19 Genophenotypic prediction based on graph neural network

Country Status (3)

Country Link
JP (1) JP2024524795A (en)
CN (1) CN115331732B (en)
WO (1) WO2023217290A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331732B (en) * 2022-10-11 2023-03-28 之江实验室 Gene phenotype training and predicting method and device based on graph neural network
CN116072214B (en) * 2023-03-06 2023-07-11 之江实验室 Phenotype intelligent prediction and training method and device based on gene significance enhancement
CN116580767B (en) * 2023-04-26 2024-03-12 之江实验室 Gene phenotype prediction method and system based on self-supervision and transducer
CN117198406B (en) * 2023-09-21 2024-06-11 亦康(北京)医药科技有限公司 Feature screening method, system, electronic equipment and medium
CN116959561B (en) * 2023-09-21 2023-12-19 北京科技大学 Gene interaction prediction method and device based on neural network model
CN116992919B (en) * 2023-09-28 2023-12-19 之江实验室 Plant phenotype prediction method and device based on multiple groups of science

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096327A (en) * 2016-06-07 2016-11-09 广州麦仑信息科技有限公司 Gene character recognition methods based on the study of the Torch supervised degree of depth
CN113593635A (en) * 2021-08-06 2021-11-02 上海市农业科学院 Corn phenotype prediction method and system
CN114333986A (en) * 2021-09-06 2022-04-12 腾讯科技(深圳)有限公司 Method and device for model training, drug screening and affinity prediction
CN114649097A (en) * 2022-03-04 2022-06-21 广州中医药大学(广州中医药研究院) Medicine efficacy prediction method based on graph neural network and omics information
CN114765063A (en) * 2021-01-12 2022-07-19 上海交通大学 Protein and nucleic acid binding site prediction method based on graph neural network characterization
CN114783524A (en) * 2022-06-17 2022-07-22 之江实验室 Path abnormity detection system based on self-adaptive resampling depth encoder network
US20220301658A1 (en) * 2021-03-19 2022-09-22 X Development Llc Machine learning driven gene discovery and gene editing in plants
CN115331732A (en) * 2022-10-11 2022-11-11 之江实验室 Gene phenotype training and predicting method and device based on graph neural network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5438644A (en) * 1991-09-09 1995-08-01 University Of Florida Translation of a neural network into a rule-based expert system
CN108388768A (en) * 2018-02-08 2018-08-10 南京恺尔生物科技有限公司 Utilize the biological nature prediction technique for the neural network model that biological knowledge is built
AU2019403566A1 (en) * 2018-12-21 2021-08-12 TeselaGen Biotechnology Inc. Method, apparatus, and computer-readable medium for efficiently optimizing a phenotype with a specialized prediction model
CN110010201A (en) * 2019-04-16 2019-07-12 山东农业大学 A kind of site recognition methods of RNA alternative splicing and system
CN114360654A (en) * 2022-01-05 2022-04-15 重庆邮电大学 Construction method of graph neural network data set based on gene expression
CN114637923B (en) * 2022-05-19 2022-09-02 之江实验室 Data information recommendation method and device based on hierarchical attention-graph neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096327A (en) * 2016-06-07 2016-11-09 广州麦仑信息科技有限公司 Gene character recognition methods based on the study of the Torch supervised degree of depth
CN114765063A (en) * 2021-01-12 2022-07-19 上海交通大学 Protein and nucleic acid binding site prediction method based on graph neural network characterization
US20220301658A1 (en) * 2021-03-19 2022-09-22 X Development Llc Machine learning driven gene discovery and gene editing in plants
CN113593635A (en) * 2021-08-06 2021-11-02 上海市农业科学院 Corn phenotype prediction method and system
CN114333986A (en) * 2021-09-06 2022-04-12 腾讯科技(深圳)有限公司 Method and device for model training, drug screening and affinity prediction
CN114649097A (en) * 2022-03-04 2022-06-21 广州中医药大学(广州中医药研究院) Medicine efficacy prediction method based on graph neural network and omics information
CN114783524A (en) * 2022-06-17 2022-07-22 之江实验室 Path abnormity detection system based on self-adaptive resampling depth encoder network
CN115331732A (en) * 2022-10-11 2022-11-11 之江实验室 Gene phenotype training and predicting method and device based on graph neural network

Also Published As

Publication number Publication date
CN115331732A (en) 2022-11-11
CN115331732B (en) 2023-03-28
JP2024524795A (en) 2024-07-09

Similar Documents

Publication Publication Date Title
WO2023217290A1 (en) Genophenotypic prediction based on graph neural network
CN112232413B (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
US20210081798A1 (en) Neural network method and apparatus
JP2010157214A (en) Gene clustering program, gene clustering method, and gene cluster analyzing device
CN111899882A (en) Method and system for predicting cancer
WO2023124342A1 (en) Low-cost automatic neural architecture search method for image classification
Lin et al. A novel chromosome cluster types identification method using ResNeXt WSL model
CN115985503B (en) Cancer prediction system based on ensemble learning
CN113627471A (en) Data classification method, system, equipment and information data processing terminal
CN114154557A (en) Cancer tissue classification method, apparatus, electronic device, and storage medium
CN109063418A (en) Determination method, apparatus, equipment and the readable storage medium storing program for executing of disease forecasting classifier
CN116798652A (en) Anticancer drug response prediction method based on multitasking learning
CN117591953A (en) Cancer classification method and system based on multiple groups of study data and electronic equipment
CN115812210A (en) Method and apparatus for enhancing performance of machine learning classification tasks
CN117611974B (en) Image recognition method and system based on searching of multiple group alternative evolutionary neural structures
CN113192556B (en) Genotype and phenotype association analysis method in multigroup chemical data based on small sample
CN105913085A (en) Tensor model-based multi-source data classification optimizing method and system
CN112687329A (en) Cancer prediction system based on non-cancer tissue mutation information and construction method thereof
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN115661498A (en) Self-optimization single cell clustering method
CN109308936B (en) Grain crop production area identification method, grain crop production area identification device and terminal identification equipment
CN107766887A (en) A kind of local weighted deficiency of data mixes clustering method
CN114334168A (en) Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2023543455

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23803058

Country of ref document: EP

Kind code of ref document: A1