CN116992919A - Plant phenotype prediction method and device based on multiple groups of science - Google Patents

Plant phenotype prediction method and device based on multiple groups of science Download PDF

Info

Publication number
CN116992919A
CN116992919A CN202311269915.8A CN202311269915A CN116992919A CN 116992919 A CN116992919 A CN 116992919A CN 202311269915 A CN202311269915 A CN 202311269915A CN 116992919 A CN116992919 A CN 116992919A
Authority
CN
China
Prior art keywords
phenotype
data
sample
features
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311269915.8A
Other languages
Chinese (zh)
Other versions
CN116992919B (en
Inventor
吴翠玲
徐晓刚
冯献忠
于慧
王军
韩强
何鹏飞
曹卫强
马寅星
李萧缘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Institute of Geography and Agroecology of CAS
Zhejiang Lab
Original Assignee
Northeast Institute of Geography and Agroecology of CAS
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Institute of Geography and Agroecology of CAS, Zhejiang Lab filed Critical Northeast Institute of Geography and Agroecology of CAS
Priority to CN202311269915.8A priority Critical patent/CN116992919B/en
Publication of CN116992919A publication Critical patent/CN116992919A/en
Application granted granted Critical
Publication of CN116992919B publication Critical patent/CN116992919B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a plant phenotype prediction method and device based on multiple mathematics, wherein the method is based on a graph convolution neural network, data of multiple mathematics such as genome, transcriptome and metabolome are used as graph nodes, the association degree among different mathematics is used as the edge of a graph to construct graph structure data of each plant, the constructed graph structure data is input into the graph convolution neural network, node characteristics are extracted, the node characteristics are updated through a transducer network, a full-connection layer is input after the node characteristics are spliced, a phenotype prediction value is output, and the phenotype prediction is realized by fusing multiple groups of chemical characteristics through the whole graph structure. The invention creatively utilizes the graph convolution neural network to combine with the transducer network to realize the prediction from gene to phenotype, utilizes the multi-mathematic construction graph structure to fuse with multi-mathematic data to realize accurate phenotype prediction, solves the problem of inaccurate phenotype prediction only by single histology to a certain extent, and improves the effect of phenotype prediction.

Description

Plant phenotype prediction method and device based on multiple groups of science
Technical Field
The invention relates to the field of deep learning technology and biological information direction, in particular to a plant phenotype prediction method and device based on multiple groups of science.
Background
Gene-to-phenotype prediction plays an important role in disease prediction pathogenic gene mining and crop yield prediction, and is a core problem in genetic biology. The current phenotype prediction method is based on genome to perform phenotype prediction, the expression constitution of the phenotype is not only related to genome, but also related to multiple groups such as intermediate transcriptome, metabolome and the like, the span from direct genes to phenotype prediction is too large, the effect of phenotype prediction is low, and the multiple groups of groups are related to perform phenotype prediction, so that accurate phenotype prediction can be realized; the graph structure can associate a plurality of types of nodes, and the graph neural network is utilized to mine association relations among the plurality of nodes, so that multiple groups of genetics are integrated to perform phenotype prediction. In recent years, a transducer is widely applied to computer vision, an attention mechanism can learn association relations among different local features, and application of the transducer to protein coding region prediction has related work, but application of the transducer to phenotype prediction is rarely studied. The graph neural network and the transducer mine association relations of different nodes and association relations among different local features, and meanwhile, the graph neural network and the transducer can be combined to enhance feature mining.
Therefore, a phenotype prediction method combining a graph convolutional neural network and a transducer is provided, and the phenotype prediction method integrates multiple groups of mathematical data to realize phenotype prediction, so that the problem of low effect of single group of mathematical phenotype prediction is solved.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a plant phenotype prediction method and device based on multiple groups of science. The invention constructs the graph structure data of the multiple students, and utilizes the graph convolution neural network and the transform network to mine the association relation among the multiple students so as to realize the prediction of the phenotype of the multiple students.
The aim of the invention is realized by the following technical scheme: an embodiment of the present invention provides a multiple-mathematics-based plant phenotype prediction method, including the following steps:
(1) Acquiring multiple groups of chemical data and phenotype data of a target species, performing quality control and dimension reduction treatment on the multiple groups of chemical data, and performing outlier and missing value treatment on the phenotype data;
(2) Normalizing the multiple groups of chemical features obtained in the step (1), calculating first similarity among the multiple groups of chemical features of each sample of the target species, and constructing graph structure data of each sample by using the first similarity;
(3) Inputting the graph structure data of each sample obtained in the step (2) into a three-layer graph neural network to extract the characteristics of each node;
(4) Splicing the characteristics of each node of each sample extracted in the step (3) according to a multi-mathematic sequence, and then inputting the spliced characteristics into two first full-connection layers to obtain a first phenotype predicted value; calculating a first penalty based on the first phenotype predicted value and the corresponding real phenotype data;
(5) The characteristics of each node obtained in the step (3) are used as local identification embedded characteristics and global identification embedded characteristics to be input into a transducer network together, updated global identification embedded characteristics and local identification embedded characteristics of each node are obtained, second similarity between the global identification embedded characteristics and the local identification embedded characteristics of each node is calculated, the local identification embedded characteristics and the global identification embedded characteristics of two nodes with the largest second similarity are selected to be spliced, and then two second full-connection layers are input to obtain a second phenotype predicted value; calculating a second loss based on the second phenotype predicted value and the corresponding real phenotype data;
(6) Repeating the step (3) -the step (5) for training, and adopting a random gradient descent method to reversely propagate and adjust network parameters of the graph neural network, the first full-connection layer, the transducer network and the second full-connection layer according to the total loss added by the first loss and the second loss so as to obtain a trained graph neural network, the first full-connection layer, the transducer network and the second full-connection layer;
(7) Acquiring a first phenotype predicted value through the trained graphic neural network and the first full-connection layer, acquiring a second phenotype predicted value through the trained graphic neural network, the transducer network and the second full-connection layer, and taking the average value of the first phenotype predicted value and the second phenotype predicted value as a final predicted phenotype result.
Further, the multiple set of students includes a genome, a transcriptome, and a metabolome;
the graph neural network includes a graph convolutional layer and a ReLU activation layer.
Further, in the step (1), quality control and dimension reduction processing are performed on multiple groups of chemical data, and abnormal value and missing value processing are performed on the surface data, which specifically comprises the following sub-steps:
(1.1) removing the phenotype data deficiency value and abnormal outliers for each sample of the target species;
(1.2) deleting the multiple groups of chemical data of each sample of the target species at positions with the deletion rate of more than 90% at each position, deleting the loci with the minimum allele frequency of less than 5% of the genome data, so as to complete quality control filtering of the multiple groups of chemical data;
(1.3) firstly encoding the multi-group data after the quality control filtering in the step (1.2), wherein the genome data are encoded into-1, 0 and 1 according to three states of homozygosity, heterozygosity and homozygosity variation, and other group data are not encoded according to the measured values; and then, calculating the pearson correlation coefficient of the coding features of all samples and the corresponding phenotype data for the data of each position of each histology, obtaining the association degree of each position of each histology data and the phenotype data, sequencing the data according to the association degree from large to small, and selecting the first L positions of each histology data as the histology features after the dimension reduction treatment so as to obtain a plurality of groups of the histology features after the dimension reduction treatment.
Further, the calculation formula of the pearson correlation coefficient is as follows:
wherein ,for the pearson correlation coefficient X, Y, the coding features and the corresponding phenotype data, respectively,/->Representing covariance +_>Representing standard deviation.
Further, the step (2) includes the following substeps:
(2.1) carrying out normalization treatment on each group of chemical characteristics after the dimension reduction treatment obtained in the step (1.3), wherein the specific mode is as follows: assuming a total of M samples, each sample has N histologies, the ith sampleThe j-th dimension-reduced histology feature is recorded asI= … M, j= … N, and normalization processing is performed according to the following formula:
wherein ,representing the characteristics of the j-th histology of the i-th sample after normalization treatment,/->Is the mean of all samples of group j, < >>Is the standard deviation of all samples of group j;
(2.2) for each sample, taking the cosine similarity between different groups of the features after normalization processing calculated according to the step (2.1) as a first similarity, and calculating the following formula:
wherein ,for cosine similarity A, B shows the normalized multi-set of mathematical features,/for example>Is the inner product of the two-dimensional space,representing norms corresponding to the normalized histology characteristics A;
(2.3) for each sample, constructing graph structure data of each sample by taking different groups of the sample as graph nodes and taking the first similarity between the different groups of the chemical features obtained in the step (2.2) as the edge weight of the graph.
Further, the first loss includes a predictive regression loss and a pearson correlation coefficient loss, expressed as:
where Loss1 represents the first Loss, M is the number of samples,is a node sample->True phenotype data,/->Is a predicted first phenotype predictive value, < >>Is the true phenotype data mean value of all samples, +.>Is the predicted first phenotype predictive value mean,/->Is the adjustment coefficient between the predicted regression loss and the pearson correlation coefficient loss.
Further, the step (5) includes the sub-steps of:
initializing and generating a global identification embedded feature by using random normal distribution parameters, wherein the value of each dimension of the global identification embedded feature is from a random value generated by normal distribution, and the feature length is the same as the feature length of each node extracted in the step (3);
(5.2) for each sample, computing a first attention matrix for that sample from the features of all nodes extracted in said step (3);
(5.3) taking the characteristics of each node extracted in the step (3) as local identifier embedded characteristics, inputting the global identifier embedded characteristics obtained in the step (5.1) and the local identifier embedded characteristics of each node into a transducer network, wherein the transducer network comprises a self-attention module and a feedforward neural network module, a second attention matrix corresponding to a sample is obtained through calculation of the self-attention module, the second attention matrix is combined with the first attention matrix obtained in the step (5.2) to update the characteristics, and the updated global identifier embedded characteristics and the local identifier embedded characteristics of each node are obtained through the feedforward neural network module;
(5.4) calculating cosine similarity between the updated global identification embedded features obtained in the step (5.3) and the local identification embedded features of each node as second similarity, selecting the local identification embedded features and the global identification embedded features of two nodes with the largest second similarity, splicing the local identification embedded features and the global identification embedded features to obtain second splicing features, and inputting the second splicing features into two second full-connection layers to obtain second phenotype predicted values; and calculating a second loss based on the second phenotype predicted value and the corresponding real phenotype data.
Further, the second loss is a predictive regression loss, expressed as:
where Loss2 is the second Loss, M is the number of samples,is a node sample->True phenotype data,/->Is a second phenotype predicted value predicted by the transducer network.
A second aspect of embodiments of the present invention provides a multiple-mathematics-based plant phenotype prediction apparatus comprising one or more processors and a memory coupled to the processors; the memory is used for storing program data, and the processor is used for executing the program data to realize the plant phenotype prediction method based on multiple groups of science.
A third aspect of an embodiment of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, is configured to implement the multiple-study-based plant phenotype prediction method described above.
The method has the advantages that multiple learners are used as nodes, a graph structure is constructed according to the correlation among the multiple learners, the graph neural network is utilized to realize the regression task of the whole graph, meanwhile, the graph neural network and the transducer are combined to mine the correlation among different learners, the attention of the graph nodes is integrated into the attention update of the transducer, and simultaneously, the histology characteristics with larger contribution to the phenotype prediction are dynamically selected according to the correlation degree with the global characteristics, so that the accurate phenotype prediction is realized, and the problem of low accuracy of single histology phenotype prediction is solved; the method for realizing the phenotype prediction based on the multiple-mathematics creatively utilizes the combination of the graphic neural network and the transducer, and is beneficial to improving the effect of phenotype prediction.
Drawings
FIG. 1 is a detailed view of the structure of a network predicted based on multiple sets of phenotype in the present invention;
FIG. 2 is a flow chart of the multiple-mathematics-based plant phenotype prediction method of the present invention;
FIG. 3 is a test flow chart of the multiple-mathematics-based plant phenotype prediction method of the present invention;
FIG. 4 is a graph showing the comparison of experimental results of the present invention;
FIG. 5 is a schematic diagram showing a structure of a plant phenotype predicting apparatus based on multiple groups of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The present invention will be described in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.
Referring to fig. 2, the multiple-mathematics-based plant phenotype prediction method of the present invention specifically comprises the following steps:
(1) And acquiring multiple groups of chemical data and phenotype data of the target species, performing quality control and dimension reduction treatment on the multiple groups of chemical data, and performing outlier and missing value treatment on the phenotype data.
Further, multiple genetics include, but are not limited to, genome, transcriptome, metabolome, and the like.
In this embodiment, an open source tomato dataset is used from which multiple sets of chemical and phenotypic data for the target species can be obtained. The open source tomato dataset included 332 tomato material parts, each material comprising 6971059 SNPs (single nucleotide polymorphisms), 657549 InDels (base insert base deletions), 54838 SVs (structural variations), RAN-Seg gene expression data of about 17 Mo Zuo.
Taking multi-group chemical data and phenotype data of a target species obtained from an open-source tomato dataset as an example, performing quality control and dimension reduction processing on the multi-group chemical data, and performing outlier and missing value processing on the phenotype data, wherein the method specifically comprises the following substeps:
(1.1) removing the phenotype data deficiency value and outlier for each sample of the target species in the tomato dataset.
(1.2) deleting the positions with the deletion rate of more than 90% at each position of the multiple groups of chemical data of each sample of the target species in the tomato dataset, and deleting the positions with the minimum allele frequency of less than 5% of the genome data, so as to complete the quality control filtering of the multiple groups of chemical data.
(1.3) firstly encoding a plurality of groups of data after the quality control filtering in the step (1.2), wherein the genome data are encoded into-1, 0 and 1 according to three states of homozygosity, heterozygosity and homozygosity variation, and other groups of data are not encoded according to the measured values; and then, calculating the pearson correlation coefficient of the coding features of all samples and the corresponding phenotype data for the data of each position of each histology, obtaining the association degree of each position of each histology data and the phenotype data, sequencing the data according to the association degree from large to small, and selecting the first L positions of each histology data as the histology features after the dimension reduction treatment so as to obtain a plurality of groups of the histology features after the dimension reduction treatment.
It should be understood that, in the tomato data of this embodiment, there are four kinds of histology, each plant sample will obtain a plurality of sets of morphology of 4*L shape, and finally a plurality of sets of morphology of reduced dimension can be obtained. The value of L may be selected according to actual needs, for example, in this embodiment, each of the histology data selects the first l=1000 positions as the histology features after dimension reduction.
Further, the calculation formula of the pearson correlation coefficient is:
wherein ,for the pearson correlation coefficient X, Y, the coding features and the corresponding phenotype data, respectively,/->Representing covariance +_>Representing standard deviation. />And the pearson correlation coefficient is calculated by the formula, the value range of the coefficient is-1, the pearson correlation coefficient is less than 0 and is in negative correlation, the pearson correlation coefficient is greater than 0 and is in positive correlation, and the pearson correlation coefficient is equal to 0, so that no correlation exists.
It should be appreciated that the greater the absolute value of the pearson correlation coefficient, the more closely related the two variables, i.e., the greater the degree of correlation, such that the degree of correlation of each location of each of the omics data with the phenotypic data is derived from the pearson correlation coefficient.
(2) Normalizing the multiple groups of chemical features obtained in the step (1), calculating first similarity among the multiple groups of chemical features of each sample of the target species, and constructing graph structure data of each sample by using the first similarity; wherein the nodes of the graph represent different groups and the edges of the graph represent a first similarity between the different groups.
(2.1) carrying out normalization treatment on each group of chemical characteristics after the dimension reduction treatment obtained in the step (1.3), wherein the specific mode is as follows: assuming a total of M samples, each sample having N histology, the j-th dimension-reduced histology feature of the i-th sample is recorded asI= … M, j= … N, and normalization processing is performed according to the following formula:
wherein ,representing the characteristics of the j-th histology of the i-th sample after normalization treatment,/->Is the mean of all samples of group j, < >>Is the standard deviation of all samples of the j-th group.
It should be understood that there are N histologies per sample, i.e., there are N histology data per sample, i.e., per plant. Illustratively, m=332 and n=4 in the present embodiment.
(2.2) for each sample, taking the cosine similarity between the normalized different sets of chemical features calculated in the step (2.1) as a first similarity, and calculating the following formula:
wherein ,for cosine similarity A, B shows the normalized multi-set of mathematical features,/for example>Is the inner product of the two-dimensional space,and representing norms corresponding to the normalized histology characteristics A.
It should be understood that the number of the devices,the value range of (2) is-1 to 1, < >>The larger the value of (c) is, the higher the feature similarity is, and the higher the degree of association is.
(2.3) for each sample, constructing graph structure data of each sample by taking different groups of the sample as graph nodes and taking the first similarity between the different groups of the features obtained in the step (2.2) as the edge weight of the graph.
It should be appreciated that the weight of an edge reflects the degree of association between different groups, i.e. the greater the first similarity, the higher the degree of association between different groups.
(3) And (3) inputting the graph structure data of each sample obtained in the step (2) into a three-layer graph neural network to extract the characteristics of each node, wherein the graph neural network comprises a graph convolution layer (graphConvolition) and a ReLU activation layer.
Specifically, in the three-layer graph neural network, each graph convolution layer is followed by a ReLU activation layer, the input-output feature dimensions of the three graph convolution layers are set to 1000, and the graph structure data of each sample is input into the three-layer graph neural network to extract the feature of each node, as shown in fig. 1.
(4) Splicing the characteristics of each node of each sample extracted in the step (3) according to a multi-mathematic sequence to obtain a first splicing characteristic; inputting the first splicing characteristics into the two first full-connection layers to obtain a first phenotype predicted value; and calculating a first penalty based on the first phenotype predicted value and the corresponding real phenotype data.
It should be understood that the real phenotype data is the phenotype data after the outlier and missing value treatment in the step (1).
Specifically, the features of each node of each sample extracted in the step (3) are spliced according to a multi-histology sequence, and the open-source tomato dataset in the embodiment is spliced according to a SNPs, inDels, SVs, RAN-Seg four-histology sequence to obtain 4000-dimensional first splicing features. Then inputting 4000-dimensional first splicing characteristics into two first full-connection layers, and outputting a first phenotype predicted value, as shown in figure 1; wherein the input dimension of the first full connection layer is 4000 and the output dimension is 32; the second first fully connected layer has an input dimension of 32 and an output dimension of 1. Finally, calculating a first loss according to the first phenotype predicted value and the corresponding real phenotype data, wherein the first loss comprises two parts, namely a predicted regression loss and a pearson correlation coefficient loss, and the expression is as follows:
where Loss1 represents the first Loss, M is the number of samples, in this embodiment the value 332,is a node sample->True phenotype data,/->Is a predicted first phenotype predictive value, < >>Is the true phenotype data mean value of all samples, +.>Is the predicted first phenotype predictive value mean,/->The adjustment coefficient between the predictive regression loss and the pearson correlation coefficient loss is in the range of 0 to 1, and in this embodiment, the value is 0.5, although other values can be taken according to actual needs.
(5) The global identification embedded feature and the feature of each node extracted in the step (3) are input into a transducer network together as local identification embedded features, so that updated global identification embedded features and local identification embedded features of each node are obtained, second similarity between the global identification embedded features and the local identification embedded features of each node is calculated, the local identification embedded features and the global identification embedded features of two nodes with the largest second similarity are selected to be spliced and then input into two second full-connection layers, and a second phenotype predicted value is obtained; a second penalty is calculated based on the second phenotype predicted value and the corresponding real phenotype data.
And (5.1) initializing and generating a global identification embedded feature by using random normal distribution parameters, wherein the value of each dimension of the global identification embedded feature is from a random value generated by normal distribution, and the feature length is the same as the feature length of each node extracted in the step (3).
Specifically, a random normal distribution can be obtained by using a xavier_normal function of a Pytorch framework, and then a global identification embedded feature can be generated by initializing a random normal distribution parameter, wherein the feature length of the global identification embedded feature is the same as the feature length of each node extracted in the step (3), namely, the feature length of the global identification embedded feature takes a value of 1000.
(5.2) for each sample, calculating a first attention matrix of the sample according to the characteristics of all the nodes extracted in the step (3), wherein a calculation formula is as follows:
wherein ,a first attention matrix representing the ith sample,/->Representing a softmax activation function, +.>Representing the feature matrix of all nodes obtained in the step (3), wherein each node feature vector is +.>J=1, 2,3,4 in this embodiment.
(5.3) extracting the characteristics of each node extracted in the step (3)And (3) as local identification embedded features, inputting the global identification embedded features G obtained in the step (5.1) and the local identification embedded features of all nodes into a transducer network, wherein the transducer network comprises a Self-attention module and a feedforward neural network (Feed Forward Neural Network) module, calculating a second attention matrix corresponding to a sample through the Self-attention module, combining the second attention matrix with the first attention matrix obtained in the step (5.2) to update the features, and acquiring the updated global identification embedded features and the local identification embedded features of all nodes through the feedforward neural network module.
The calculation formula of the second attention matrix is as follows:
wherein ,a second attention matrix representing the sample, +.>Feature matrix composed of global identification embedded feature and local identification feature of each node respectively>Is->Feature dimensions.
The updated global identifier embedded features and the local identifier embedded features of each node are expressed as follows:
wherein ,representing the updated global identification embedded feature and the local identification embedded feature of each node, and V represents the global identification embedded feature G and the local identification embedded feature +.>
(5.4) calculating cosine similarity between the updated global identification embedded features obtained in the step (5.3) and the local identification embedded features of each node to obtain second similarity, selecting the local identification embedded features and the global identification embedded features of two nodes with the largest second similarity for splicing to obtain second spliced features, and inputting the second spliced features into two second full-connection layers to obtain second phenotype predicted values; and calculating a second loss based on the second phenotype predicted value and the corresponding real phenotype data.
Specifically, calculating a second similarity between each group of chemical features and the global features, namely calculating cosine similarity between the updated global identification embedded features obtained in the step (5.3) and the local identification embedded features of each node, wherein the cosine similarity is the second similarity, and splicing the local identification embedded features and the global identification embedded features of two nodes with the largest second similarity to obtain a second spliced feature with 3000 dimensions; the splicing method is consistent with the splicing method in the step (4). Inputting the second splicing characteristic into two second full-connection layers to obtain a second phenotype predicted value, as shown in figure 1; wherein the input dimension of the first and second fully-connected layers is 3000 and the output dimension is 32; the second fully connected layer has an input dimension of 32 and an output dimension of 1. Finally, calculating a second loss according to the second phenotype predicted value and the corresponding real phenotype data, wherein the second loss is a predicted regression loss, and the expression is as follows:
where Loss2 is the second Loss, M is the number of samples, in this embodiment the value 332,is a node sample->True phenotype data,/->Is a second phenotype predicted value predicted by the transducer network.
(6) Repeating the steps (3) - (5) for iterative training, and adopting a random gradient descent method to reversely propagate and adjust network parameters of the three-layer graph neural network in the step (3), the two first full-connection layers in the step (4) and the transducer network and the two second full-connection layers in the step (5) according to the total loss of the first loss and the second loss to obtain a trained three-layer graph neural network and the two first full-connection layers, the transducer network and the two second full-connection layers.
It should be understood that the steps (3) - (5) are repeated for iterative training, and training is stopped after the set number of iterations is reached, so as to obtain the final network parameter, for example, the number of iterations in this embodiment is set to 100.
(7) Obtaining a first phenotype predicted value through the trained three-layer graph neural network and the two first full-connection layers, obtaining a second phenotype predicted value through the trained three-layer graph neural network, the transducer network and the two second full-connection layers, and taking the average value of the first phenotype predicted value and the second phenotype predicted value as a final predicted phenotype result.
In this embodiment, in the process of predicting phenotypes by using the trained three-layer graph neural network and two first fully connected layers, the transducer network and two second fully connected layers, as shown in fig. 3, the method specifically includes the following steps:
(7.1) acquiring multiple sets of chemical data of the material: SNPs (single nucleotide polymorphisms), inDels (base insertion base deletions), SVs (structural variations), RAN-Seg data.
And (7.2) deleting the positions of the multiple groups of chemical data of each sample, wherein the deletion rate of the positions is more than 90%, deleting the positions of the genome data, the minimum allele frequency of which is less than 5%, and finishing the quality control and filtration of the multiple groups of chemical data.
(7.3) coding a plurality of groups of data after quality control filtering, wherein the genome data are coded into-1, 0 and 1 according to three states of homozygosity, heterozygosity and homozygosity variation, and other groups of data are not coded according to the measured values; and then, dimension reduction is carried out on each histology according to the L positions selected by each histology data in the step (1.3). The graph structure data is then constructed according to the method of step (2).
And (7.4) inputting the graph structure data into the process of predicting phenotypes of the trained three-layer graph neural network and the two first fully-connected layers, the transducer network and the two second fully-connected layers, wherein the network parameters are the network parameters obtained through training in the step (6), respectively obtaining a predicted first phenotype predicted value and a predicted second phenotype predicted value, and taking the average value of the first phenotype predicted value and the second phenotype predicted value as a final predicted phenotype result.
The experimental results of this example on tomato dataset are shown in Table 1, wherein SNPs (single nucleotide polymorphisms), inDels (base insertion base deletions), SVs (structural variations), RAN-Seg gene expression are the results of the conventional method RRBLUP. In this embodiment, the pearson correlation coefficient results of each are shown in fig. 4, and the method of the present invention improves by 11.94% to 84.02% compared with the conventional method.
Table 1 experimental results
Referring to FIG. 5, a multiple-study-based plant phenotype prediction apparatus provided in an embodiment of the present invention includes one or more processors and a memory coupled to the processors; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the multiple-mathematics-based plant phenotype prediction method in the above embodiment.
The embodiments of the multiple-mathematics-based plant phenotype prediction apparatus of the present invention may be applied to any device with data processing capabilities, such as a computer or the like. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of an arbitrary device with data processing capability where the plant phenotype prediction apparatus based on multiple groups of science is located in the present invention is shown in fig. 5, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the arbitrary device with data processing capability where the apparatus is located in the embodiment generally includes other hardware according to the actual function of the arbitrary device with data processing capability, which is not described herein again.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the present invention also provides a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements the multiple-learning-based plant phenotype prediction method in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of multiunit-based plant phenotype prediction comprising the steps of:
(1) Acquiring multiple groups of chemical data and phenotype data of a target species, performing quality control and dimension reduction treatment on the multiple groups of chemical data, and performing outlier and missing value treatment on the phenotype data;
(2) Normalizing the multiple groups of chemical features obtained in the step (1), calculating first similarity among the multiple groups of chemical features of each sample of the target species, and constructing graph structure data of each sample by using the first similarity;
(3) Inputting the graph structure data of each sample obtained in the step (2) into a three-layer graph neural network to extract the characteristics of each node;
(4) Splicing the characteristics of each node of each sample extracted in the step (3) according to a multi-mathematic sequence, and then inputting the spliced characteristics into two first full-connection layers to obtain a first phenotype predicted value; calculating a first penalty based on the first phenotype predicted value and the corresponding real phenotype data;
(5) The characteristics of each node obtained in the step (3) are used as local identification embedded characteristics and global identification embedded characteristics to be input into a transducer network together, updated global identification embedded characteristics and local identification embedded characteristics of each node are obtained, second similarity between the global identification embedded characteristics and the local identification embedded characteristics of each node is calculated, the local identification embedded characteristics and the global identification embedded characteristics of two nodes with the largest second similarity are selected to be spliced, and then two second full-connection layers are input to obtain a second phenotype predicted value; calculating a second loss based on the second phenotype predicted value and the corresponding real phenotype data;
(6) Repeating the step (3) -the step (5) for training, and adopting a random gradient descent method to reversely propagate and adjust network parameters of the graph neural network, the first full-connection layer, the transducer network and the second full-connection layer according to the total loss added by the first loss and the second loss so as to obtain a trained graph neural network, the first full-connection layer, the transducer network and the second full-connection layer;
(7) Acquiring a first phenotype predicted value through the trained graphic neural network and the first full-connection layer, acquiring a second phenotype predicted value through the trained graphic neural network, the transducer network and the second full-connection layer, and taking the average value of the first phenotype predicted value and the second phenotype predicted value as a final predicted phenotype result.
2. The method of claim 1, wherein the plurality of sets of chemicals comprises a genome, a transcriptome, and a metabolome;
the graph neural network includes a graph convolutional layer and a ReLU activation layer.
3. The method for predicting plant phenotype based on multiple genetics according to claim 1, wherein in the step (1), the quality control and dimension reduction processing are performed on the multiple genetics data, and the abnormal value and missing value processing are performed on the phenotype data, specifically comprising the following sub-steps:
(1.1) removing the phenotype data deficiency value and abnormal outliers for each sample of the target species;
(1.2) deleting the multiple groups of chemical data of each sample of the target species at positions with the deletion rate of more than 90% at each position, deleting the loci with the minimum allele frequency of less than 5% of the genome data, so as to complete quality control filtering of the multiple groups of chemical data;
(1.3) firstly encoding the multi-group data after the quality control filtering in the step (1.2), wherein the genome data are encoded into-1, 0 and 1 according to three states of homozygosity, heterozygosity and homozygosity variation, and other group data are not encoded according to the measured values; and then, calculating the pearson correlation coefficient of the coding features of all samples and the corresponding phenotype data for the data of each position of each histology, obtaining the association degree of each position of each histology data and the phenotype data, sequencing the data according to the association degree from large to small, and selecting the first L positions of each histology data as the histology features after the dimension reduction treatment so as to obtain a plurality of groups of the histology features after the dimension reduction treatment.
4. A method of multiunit-based plant phenotype prediction according to claim 3, wherein the pearson correlation coefficient is calculated by the formula:
wherein ,for the pearson correlation coefficient X, Y, the coding features and the corresponding phenotype data, respectively,/->Representing covariance +_>Representing standard deviation.
5. The multiple-genetics-based plant phenotype prediction method according to claim 1, wherein step (2) comprises the sub-steps of:
(2.1) carrying out normalization treatment on each group of chemical characteristics after the dimension reduction treatment obtained in the step (1.3), wherein the specific mode is as follows: assuming a total of M samples, each sample having N histology, the j-th dimension-reduced histology feature of the i-th sample is recorded asI= … M, j= … N, and normalization processing is performed according to the following formula:
wherein ,representing the characteristics of the j-th histology of the i-th sample after normalization treatment,/->Is the mean of all samples of group j, < >>Is the j-th kindStandard deviation of all samples of the group;
(2.2) for each sample, taking the cosine similarity between different groups of the features after normalization processing calculated according to the step (2.1) as a first similarity, and calculating the following formula:
wherein ,for cosine similarity A, B shows the normalized multi-set of mathematical features,/for example>Inner volume, I/O>Representing norms corresponding to the normalized histology characteristics A;
(2.3) for each sample, constructing graph structure data of each sample by taking different groups of the sample as graph nodes and taking the first similarity between the different groups of the chemical features obtained in the step (2.2) as the edge weight of the graph.
6. The multiple-genetics-based plant phenotype prediction method of claim 1 wherein the first penalty comprises a predictive regression penalty and a pearson correlation coefficient penalty expressed as:
where Loss1 represents the first Loss, M is the number of samples,is a node sample->True phenotype data,/->Is a predicted first phenotype predictive value, < >>Is the true phenotype data mean value of all samples, +.>Is the predicted first phenotype predictive value mean,/->Is the adjustment coefficient between the predicted regression loss and the pearson correlation coefficient loss.
7. The multiple-genetics-based plant phenotype prediction method according to claim 1, wherein step (5) comprises the sub-steps of:
initializing and generating a global identification embedded feature by using random normal distribution parameters, wherein the value of each dimension of the global identification embedded feature is from a random value generated by normal distribution, and the feature length is the same as the feature length of each node extracted in the step (3);
(5.2) for each sample, computing a first attention matrix for that sample from the features of all nodes extracted in said step (3);
(5.3) taking the characteristics of each node extracted in the step (3) as local identifier embedded characteristics, inputting the global identifier embedded characteristics obtained in the step (5.1) and the local identifier embedded characteristics of each node into a transducer network, wherein the transducer network comprises a self-attention module and a feedforward neural network module, a second attention matrix corresponding to a sample is obtained through calculation of the self-attention module, the second attention matrix is combined with the first attention matrix obtained in the step (5.2) to update the characteristics, and the updated global identifier embedded characteristics and the local identifier embedded characteristics of each node are obtained through the feedforward neural network module;
(5.4) calculating cosine similarity between the updated global identification embedded features obtained in the step (5.3) and the local identification embedded features of each node as second similarity, selecting the local identification embedded features and the global identification embedded features of two nodes with the largest second similarity, splicing the local identification embedded features and the global identification embedded features to obtain second splicing features, and inputting the second splicing features into two second full-connection layers to obtain second phenotype predicted values; and calculating a second loss based on the second phenotype predicted value and the corresponding real phenotype data.
8. The multiple-genetics-based plant phenotype prediction method according to claim 1 or 7 wherein the second loss is a predictive regression loss expressed as:
where Loss2 is the second Loss, M is the number of samples,is a node sample->True phenotype data,/->Is a second phenotype predicted value predicted by the transducer network.
9. A multiple-mathematics-based plant phenotype prediction apparatus comprising one or more processors and a memory, wherein the memory is coupled to the processors; wherein the memory is for storing program data and the processor is for executing the program data to implement the multiple-mathematics-based plant phenotype prediction method of any one of claims 1-8.
10. A computer readable storage medium, having stored thereon a program which, when executed by a processor, is adapted to carry out the multiple-mathematic based plant phenotype prediction method of any one of claims 1-8.
CN202311269915.8A 2023-09-28 2023-09-28 Plant phenotype prediction method and device based on multiple groups of science Active CN116992919B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311269915.8A CN116992919B (en) 2023-09-28 2023-09-28 Plant phenotype prediction method and device based on multiple groups of science

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311269915.8A CN116992919B (en) 2023-09-28 2023-09-28 Plant phenotype prediction method and device based on multiple groups of science

Publications (2)

Publication Number Publication Date
CN116992919A true CN116992919A (en) 2023-11-03
CN116992919B CN116992919B (en) 2023-12-19

Family

ID=88530695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311269915.8A Active CN116992919B (en) 2023-09-28 2023-09-28 Plant phenotype prediction method and device based on multiple groups of science

Country Status (1)

Country Link
CN (1) CN116992919B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021183408A1 (en) * 2020-03-09 2021-09-16 Pioneer Hi-Bred International, Inc. Multi-modal methods and systems
CN114927162A (en) * 2022-05-19 2022-08-19 大连理工大学 Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution
CN115331732A (en) * 2022-10-11 2022-11-11 之江实验室 Gene phenotype training and predicting method and device based on graph neural network
WO2023108324A1 (en) * 2021-12-13 2023-06-22 中国科学院深圳先进技术研究院 Comparative learning enhanced two-stream model recommendation system and algorithm
CN116417093A (en) * 2022-12-06 2023-07-11 苏州科技大学 Drug target interaction prediction method combining transducer and graph neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021183408A1 (en) * 2020-03-09 2021-09-16 Pioneer Hi-Bred International, Inc. Multi-modal methods and systems
WO2023108324A1 (en) * 2021-12-13 2023-06-22 中国科学院深圳先进技术研究院 Comparative learning enhanced two-stream model recommendation system and algorithm
CN114927162A (en) * 2022-05-19 2022-08-19 大连理工大学 Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution
CN115331732A (en) * 2022-10-11 2022-11-11 之江实验室 Gene phenotype training and predicting method and device based on graph neural network
CN116417093A (en) * 2022-12-06 2023-07-11 苏州科技大学 Drug target interaction prediction method combining transducer and graph neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨玉亭;冯林;代磊超;苏菡;: "面向上下文注意力联合学习网络的方面级情感分类模型", 模式识别与人工智能, no. 08 *
王文涛;吴淋涛;黄烨;朱容波;: "基于密集连接卷积神经网络的链路预测模型", 计算机应用, no. 06 *

Also Published As

Publication number Publication date
CN116992919B (en) 2023-12-19

Similar Documents

Publication Publication Date Title
Lanchantin et al. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks
Telenti et al. Deep learning of genomic variation and regulatory network data
CN113705772A (en) Model training method, device and equipment and readable storage medium
CN111933212B (en) Clinical histology data processing method and device based on machine learning
JP7522936B2 (en) Gene phenotype prediction based on graph neural networks
CN114927162A (en) Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution
Han et al. Heuristic hyperparameter optimization of deep learning models for genomic prediction
Kao et al. naiveBayesCall: an efficient model-based base-calling algorithm for high-throughput sequencing
Zhang et al. Base-resolution prediction of transcription factor binding signals by a deep learning framework
CN114783526A (en) Depth unsupervised single cell clustering method based on Gaussian mixture graph variation self-encoder
CN116401555A (en) Method, system and storage medium for constructing double-cell recognition model
Guha Majumdar et al. Integrated framework for selection of additive and nonadditive genetic markers for genomic selection
Vijayabaskar Introduction to hidden Markov models and its applications in biology
CN114974421A (en) Single-cell transcriptome sequencing data interpolation method and system based on diffusion-noise reduction
Wang et al. Fusang: a framework for phylogenetic tree inference via deep learning
CN116992919B (en) Plant phenotype prediction method and device based on multiple groups of science
US20230307089A1 (en) Method for estimating a variable of interest associated to a given disease as a function of a plurality of different omics data, corresponding device, and computer program product
Dong et al. Prediction of genomic breeding values using new computing strategies for the implementation of MixP
Durge et al. Heuristic analysis of genomic sequence processing models for high efficiency prediction: A statistical perspective
CN116383441A (en) Community detection method, device, computer equipment and storage medium
US9183503B2 (en) Sparse higher-order Markov random field
CN114512188B (en) DNA binding protein recognition method based on improved protein sequence position specificity matrix
CN117995283B (en) Single-sample metagenome clustering method, system, terminal and storage medium
CN111461350B (en) Data error risk prediction method based on nonlinear integration model
Kaminuma Introduction of deep learning approaches in plant omics research

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant