CN116072214B - Phenotype intelligent prediction and training method and device based on gene significance enhancement - Google Patents

Phenotype intelligent prediction and training method and device based on gene significance enhancement Download PDF

Info

Publication number
CN116072214B
CN116072214B CN202310202392.9A CN202310202392A CN116072214B CN 116072214 B CN116072214 B CN 116072214B CN 202310202392 A CN202310202392 A CN 202310202392A CN 116072214 B CN116072214 B CN 116072214B
Authority
CN
China
Prior art keywords
gene
phenotype
value
morphology
chi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310202392.9A
Other languages
Chinese (zh)
Other versions
CN116072214A (en
Inventor
应志文
章依依
徐晓刚
王军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310202392.9A priority Critical patent/CN116072214B/en
Publication of CN116072214A publication Critical patent/CN116072214A/en
Application granted granted Critical
Publication of CN116072214B publication Critical patent/CN116072214B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention discloses a phenotype intelligent prediction and training method and device based on gene saliency enhancement, which constructs an actual distribution list through gene morphology and phenotype height, constructs an expected distribution list of gene morphology and phenotype height according to chi-square hypothesis, carries out chi-square test on each gene locus and phenotype, obtains probability of chi-square hypothesis establishment based on the chi-square list, obtains saliency value of the gene locus on the phenotype, and encodes the gene at the same time; and then amplifying the codes of the genes according to the significance value of each gene locus, so that the association degree of the gene data and the phenotype is enhanced, and the accuracy of predicting the phenotype based on the gene loci is greatly improved. Aiming at organisms with chromosomes which are diploid, the invention adopts a deep learning training method, and improves the prediction accuracy from gene loci to phenotypes by enhancing the data of the gene loci.

Description

Phenotype intelligent prediction and training method and device based on gene significance enhancement
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a phenotype intelligent prediction and training method and device based on gene significance enhancement.
Background
In the process of gene prediction phenotype, a mode of predicting by using a deep learning model is widely paid attention to and applied. One method currently in mainstream is to use convolutional neural networks to perform convolutional feature extraction on gene data, so as to train a model of gene prediction phenotype. However, this approach ignores that the contribution of each gene itself to the phenotype is magnitude-differential, resulting in lower accuracy of prediction of the phenotype.
Disclosure of Invention
In order to solve the defects in the prior art and achieve the purpose of improving the prediction precision of the gene prediction phenotype, the invention adopts the following technical scheme:
the phenotype intelligent prediction training method based on gene significance enhancement comprises the following steps:
step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;
step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics under the assumption through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of the assumption through inquiring the chi-square list, and calculating a significance value of a gene locus to a phenotype based on the probability value;
step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples;
step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.
The second step comprises the following steps:
step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made 0k : assuming that the gene locus k has no significant relationship with phenotype y;
step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution condition O of each gene form under different phenotype categories mn
Step 2.3: according to chi-square hypothesis H 0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics:
Figure SMS_1
wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the chi-square list is queried through chi-square statistics to obtain the probability of the chi-square assumption being established as P k Based on probability P k Significance values were calculated for all loci.
In the step 2.2, the gene loci are in three forms of AA, AA and AA, and the deletion is not counted.
In the step 2.2, all the gene samples are classified into two categories according to the mean value:
Figure SMS_2
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_3
representing the average value, wherein the average value is equal to or greater than the average value, and the total HN is calculated; the LN strains are classified as a class with less than the average value;
in the step 2.3, H is assumed according to chi-square 0k The gene morphology has no significant relationship with phenotype, then:
Figure SMS_4
thus obtaining the expected distribution of the morphology and phenotype of the gene:
Figure SMS_5
Figure SMS_6
Figure SMS_7
Figure SMS_8
Figure SMS_9
Figure SMS_10
wherein O is 11 、E 11 High phenotype actual and expected values representing the morphology of the gene AA, O 12 、E 12 Indicating low phenotype actual and expected values of AA in gene morphology, O 21 、E 21 Indicating high phenotypic actual and expected values of gene morphology Aa, O 22 、E 22 Indicating low phenotype actual and expected values for gene morphology Aa, O 31 、E 31 Indicating high phenotypic reality and expectation of aa gene morphology, O 32 、E 32 Indicating low phenotypic actual and expected values for aa gene morphology.
The gene enhancement data in the third step:
Figure SMS_11
wherein x is k Representing the encoded gene locus, P k Representing a significance value.
The encoding mode of the gene locus adopts single-hot encoding.
In the fourth step, a training set and a test set are divided for the gene enhancement data set X, the training set is input into a neural network model for learning training, firstly, the data quantity of each input network is set as the batch size, the input dimension is the batch size, m is K, m represents the number of the gene locus forms, K represents the sequence length, the characteristics of the gene enhancement data are extracted through the neural network, the characteristics are connected through a full connection layer, so that a predicted phenotype value is output, the actual phenotype value and the predicted phenotype value are compared, the actual value and the predicted phenotype value are input into a loss network for loss calculation, the obtained loss value is transmitted forwards, corresponding parameters are updated, after repeated iterative updating, the loss value is converged, iteration is stopped, and the model of the trained gene enhancement data prediction phenotype is obtained.
The phenotype intelligent prediction training device based on the gene significance enhancement comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the phenotype intelligent prediction training method based on the gene significance enhancement when executing the executable codes.
The phenotype of the gene sample is predicted by a model for predicting the phenotype of the gene enhanced data trained by the intelligent prediction method based on the gene enhanced importance.
The phenotype intelligent prediction device based on gene significance enhancement comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the phenotype intelligent prediction method based on gene significance enhancement when executing the executable codes.
The invention has the advantages that:
according to the phenotype intelligent prediction and training method and device based on gene saliency enhancement, saliency values of each SNP locus are calculated through chi-square test, the saliency values are used as contribution degrees of the gene loci to scale gene coding data, and then deep learning neural network is used for extracting characteristics of the scaled gene data. Compared with the existing intelligent prediction, the method is simpler to extract the characteristics of the gene data by using the deep learning network, scales the gene coding data by using the significance values of different sites for the phenotype, extracts the characteristics of the gene data by using the deep learning network, and improves the accuracy of the gene phenotype prediction by adding the contribution degree of each gene to the phenotype.
Drawings
FIG. 1 is a flow chart of a method for intelligently predicting phenotypes based on gene significance enhancement in an embodiment of the invention.
FIG. 2 is a schematic diagram of a saliency enhancement process in an embodiment of the invention.
FIG. 3 is a schematic structural diagram of a phenotype intelligent prediction device based on gene significance enhancement in an embodiment of the invention.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
As shown in fig. 1, the intelligent phenotype prediction method based on gene significance enhancement comprises the following steps:
step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;
in the embodiment of the invention, the following steps are included: and collecting the phenotype values of N gene samples, wherein the length of the gene sequence corresponding to the gene samples is K, and the gene sequence consists of gene loci (single nucleotide polymorphism SNP, single Nucleotide Polymorphisms).
Step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics under the assumption through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of the assumption through inquiring the chi-square list, and calculating a significance value of a gene locus to a phenotype based on the probability value;
the significance of the gene locus on phenotype was calculated in the examples of the present invention. First calculate the phenotype average for data set sample N
Figure SMS_12
According to the phenotype mean->
Figure SMS_13
Classifying phenotypes by subtypingClassifying the types and three forms of genes to obtain an actual distribution list of the phenotype of the gene sample under the three forms of the genes; making the assumption: the morphology of the gene is not related to the phenotype, so that a desired distribution list of the phenotype of the gene sample under three gene morphologies is obtained; calculating chi-square statistics under the assumption through the actual distribution list and the expected distribution list of the samples, obtaining a probability value of assuming establishment through inquiring the chi-square list, and calculating a significance value of the gene locus on the phenotype based on the probability value; as shown in fig. 2, the method specifically comprises the following steps:
step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made 0k : it is assumed that the gene locus k has no significance (0) in relation to the phenotype y.
Step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution condition O of each gene form under different phenotype categories mn
Specifically, the mean value of N gene sample phenotypes is calculated
Figure SMS_14
Figure SMS_15
According to the mean value
Figure SMS_16
Two classifications were made for all gene samples: the average value is higher than the average value, and HN strains are summed; the LN strain was classified as a class lower than the average. Three morphologies of the hypothetical gene locus were expressed as: AA. Aa, deletions were not counted. The actual distribution list of the gene morphology and the phenotype can be obtained through the quantity statistics:
Figure SMS_17
O 11 high phenotype actual value representing the gene morphology of AA, O 12 Indicating a low phenotypic actual value of AA in gene morphology, O 21 Indicating a high phenotypic actual value of the gene morphology Aa, O 22 Indicating a low phenotypic actual value of the gene morphology Aa, O 31 Indicating a high phenotypic reality of aa in gene morphology, O 32 Indicating a low phenotypic actual value for aa.
Step 2.3: according to chi-square hypothesis H 0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics, querying the chi-square list through the chi-square statistics to obtain the probability of the chi-square hypothesis being established as P k Based on probability P k Significance values were calculated for all loci.
Specifically, according to chi-square hypothesis H 0k The gene morphology has no significant relation with phenotype, and can be theoretically obtained:
Figure SMS_18
thus obtaining a desired distribution list of the morphology and phenotype of each gene:
Figure SMS_19
wherein E is 11 High phenotype expected value indicating that the gene morphology is AA, E 12 Indicating a low phenotype expected value of AA in gene morphology, E 21 Indicating a high phenotypic desired value for gene morphology Aa, E 22 Indicating a low phenotype expected value for gene morphology Aa, E 31 Indicating a high phenotypic desirability of aa in gene morphology, E 32 Indicating a low phenotype expected value for aa in gene morphology.
Calculating chi-square statistics:
Figure SMS_20
wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the probability that chi-square assumption is established by querying chi-square list is P k Based on probability P of chi-square hypothesis being established k Significance values were calculated for all loci.
Step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples;
in the embodiment of the invention, a significance value of the phenotype corresponding to all K gene loci is calculated through the second step, and the single hot onehot coding is carried out on each gene locus of each gene sample, so that the gene locus coding x is obtained k The weights of three gene morphologies are balanced, for example: coding the morphology AA of the gene locus as [1,0 ]]Aa is encoded as [0,1,0 ]]Aa is encoded as [0,1 ]]The deletion is encoded as [0,0]The method comprises the steps of carrying out a first treatment on the surface of the The encoded gene data is then scaled, i.e., the code x for each gene locus k Significance value-log of the gene corresponding thereto 10 P k To obtain the gene enhancement data corresponding to the gene sample:
Figure SMS_21
step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.
Dividing a training set and a test set for a gene enhancement data set X, inputting the training set into a neural network model for learning training, firstly setting the data quantity of each input network, wherein the input dimension is the quantity of the patterns of the gene loci, K represents the length of the sequences, extracting the characteristics of the gene enhancement data through the neural network, connecting the characteristics through a full connecting layer, and then outputting a predicted phenotype value, comparing the real phenotype value with the predicted phenotype value, inputting the real value and the predicted value into a loss network for loss calculation, transmitting the obtained loss value forward, updating corresponding parameters, and stopping iteration after repeated iteration updating, so as to obtain a model of the predicted phenotype of the trained gene enhanced data.
In the embodiment of the invention, a neural network model is constructed, a convolutional neural network for feature extraction is established by using CNN and a fully connected neural network, L1loss is used as a loss network of the model, parameters of the whole neural network are initialized, the parameters comprise condition parameters and super parameters for stopping iteration, and the like, and a gene enhancement data set X obtained in the step three is obtained by using 7:3, inputting the training set into a deep learning network for learning training, firstly setting the data quantity batch size of each time of inputting the network, extracting the characteristics of the gene enhancement data through a convolutional neural network, connecting the characteristics through a full-connection layer, outputting a predicted phenotype value, comparing the real phenotype value with the predicted phenotype value, inputting the real value with the predicted value into a loss network for loss calculation, transmitting the obtained loss value forward, updating the corresponding parameters of the network, recording as one training iteration after all the data are iterated, setting the iteration times of the network to be 200 or 300, and the like, and mainly enabling the loss value to achieve convergence. And stopping iteration after the loss value reaches a convergence or stopping condition, so that a model of the trained gene enhancement data prediction phenotype is obtained.
Fifthly, predicting the phenotype of the gene sample through a trained model for predicting the phenotype by the gene enhancement data.
Corresponding to the previous embodiments of the intelligent prediction method based on the phenotype with enhanced gene salience, the invention also provides the embodiment of the intelligent prediction device based on the phenotype with enhanced gene salience.
Referring to fig. 3, the intelligent prediction apparatus for gene significance enhancement provided by the embodiment of the invention comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the intelligent prediction method for gene significance enhancement in the embodiment when executing the executable codes.
The embodiment of the phenotype intelligent prediction device based on gene significance enhancement can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 3, a hardware structure diagram of an arbitrary device with data processing capability where the phenotype intelligent prediction apparatus based on gene significance enhancement of the present invention is located is shown in fig. 3, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 3, the arbitrary device with data processing capability where the apparatus is located in an embodiment generally includes other hardware according to an actual function of the arbitrary device with data processing capability, which is not described herein again.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the phenotype intelligent prediction method based on gene significance enhancement in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims (7)

1. The phenotype intelligent prediction training method based on gene significance enhancement is characterized by comprising the following steps of:
step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;
step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of a hypothesis through inquiring the chi-square list, and calculating a significance value of a gene locus on a phenotype based on the probability value; the method comprises the following steps:
step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made 0k : assuming that the gene locus k has no significant relationship with phenotype y;
step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution condition O of each gene form under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the The gene loci are in three forms of AA, AA and AA, and the deletion is not counted; classifying all the gene samples according to the average value, wherein the gene samples are of one type with the average value being greater than or equal to the average value and the gene samples are of one type with the average value being less than the average value;
step 2.3: according to chi-square hypothesis H 0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics:
Figure FDA0004239156990000011
wherein m represents the number of gene locus forms, n represents the phenotype category number of the gene sample, and the chi-square list is queried through chi-square statistics to obtain the probability of the chi-square assumption being established as P k Based on probability P k Calculating significance values of all gene loci;
step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples; scaling the gene data, namely multiplying the codes of each gene locus by the significance value of the corresponding gene to obtain gene enhancement data corresponding to the gene sample:
X k =-log 10 P k *x k
wherein x is k Representing the encoded gene locus, P k Representing the probability that chi-square assumption holds, -log 10 P k A significance value representing a gene;
step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.
2. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: in the step 2.2, all the gene samples are classified into two categories according to the mean value:
Figure FDA0004239156990000021
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004239156990000028
representing the average value, wherein the average value is equal to or greater than the average value, and the total HN is calculated; the LN strains are classified as a class with less than the average value;
in the step 2.3, H is assumed according to chi-square 0k The gene morphology has no significant relationship with phenotype, then:
O 11 :O 12 ≈O 21 :O 22 ≈O 31 :O 32 ≈HN:LN
thus obtaining the expected distribution of the morphology and phenotype of the gene:
Figure FDA0004239156990000022
Figure FDA0004239156990000023
Figure FDA0004239156990000024
Figure FDA0004239156990000025
Figure FDA0004239156990000026
Figure FDA0004239156990000027
wherein O is 11 、E 11 High phenotype actual and expected values representing the morphology of the gene AA, O 12 、E 12 Indicating low phenotype actual and expected values of AA in gene morphology, O 21 、E 21 Indicating high phenotypic actual and expected values of gene morphology Aa, O 22 、E 21 Indicating low phenotype actual and expected values for gene morphology Aa, O 31 、E 31 Indicating high phenotypic reality and expectation of aa gene morphology, O 32 、E 32 Indicating low phenotypic actual and expected values for aa gene morphology.
3. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: the encoding mode of the gene locus adopts single-hot encoding.
4. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: in the fourth step, a training set and a test set are divided for the gene enhancement data set X, the training set is input into a neural network model for learning training, firstly, the data quantity of each input network is set as the batch size, the input dimension is the batch size, m is K, m represents the number of the gene locus forms, K represents the sequence length, the characteristics of the gene enhancement data are extracted through the neural network, the characteristics are connected through a full connection layer, so that a predicted phenotype value is output, the actual phenotype value and the predicted phenotype value are compared, the actual value and the predicted phenotype value are input into a loss network for loss calculation, the obtained loss value is transmitted forwards, corresponding parameters are updated, after repeated iterative updating, the loss value is converged, iteration is stopped, and the model of the trained gene enhancement data prediction phenotype is obtained.
5. Phenotype intelligence prediction trainer based on gene saliency reinforcing, its characterized in that: comprising a memory and one or more processors, the memory having executable code stored therein, which when executed, is operable to implement the gene significance enhancement based phenotype intelligent predictive training method of any of claims 1-4.
6. The phenotype intelligent prediction method based on gene significance enhancement is characterized by comprising the following steps of: predicting a phenotype of a gene sample by a model of a gene enhanced data prediction phenotype trained based on a gene significance enhanced phenotype intelligent prediction training method of claim 1.
7. Phenotype intelligence prediction unit based on gene saliency reinforcing, its characterized in that: comprising a memory and one or more processors, the memory having executable code stored therein, which when executed by the one or more processors, is operable to implement the gene significance enhancement based phenotype intelligent prediction method of claim 6.
CN202310202392.9A 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement Active CN116072214B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310202392.9A CN116072214B (en) 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310202392.9A CN116072214B (en) 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Publications (2)

Publication Number Publication Date
CN116072214A CN116072214A (en) 2023-05-05
CN116072214B true CN116072214B (en) 2023-07-11

Family

ID=86182149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310202392.9A Active CN116072214B (en) 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Country Status (1)

Country Link
CN (1) CN116072214B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN115148278A (en) * 2022-07-21 2022-10-04 平安科技(深圳)有限公司 Training method and device of gene sequencing model, electronic equipment and storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195266A1 (en) * 2005-02-25 2006-08-31 Yeatman Timothy J Methods for predicting cancer outcome and gene signatures for use therein
CN101210266A (en) * 2006-12-30 2008-07-02 苏州市长三角系统生物交叉科学研究院有限公司 Measuring method for relativity of interaction and genetic character between genome genetic markers
GB201408687D0 (en) * 2014-05-16 2014-07-02 Univ Leuven Kath Method for predicting a phenotype from a genotype
CN105936907B (en) * 2016-04-27 2017-12-12 湖南杂交水稻研究中心 A kind of breeding method for reducing rice grain cadmium content
CN110400597A (en) * 2018-04-23 2019-11-01 成都二十三魔方生物科技有限公司 A kind of genetype for predicting method based on deep learning
CN108959848A (en) * 2018-05-30 2018-12-07 广州普世医学科技有限公司 Based on genetic mutation and the matched hereditary disease forecasting system of disease phenotype auto-associating
CN109182538B (en) * 2018-09-29 2022-01-04 南京农业大学 Method for genotyping and analyzing key SNPs sites rs88640083 and 2b-RAD of dairy cow mastitis
AU2019370896A1 (en) * 2018-10-31 2021-06-17 Ancestry.Com Dna, Llc Estimation of phenotypes using DNA, pedigree, and historical data
CN113502293B (en) * 2021-08-25 2022-07-22 湖南工业大学 Camellia oleifera self-incompatibility related gene, SNP molecular marker and application
CN114373547A (en) * 2022-01-11 2022-04-19 平安科技(深圳)有限公司 Method and system for predicting disease risk
CN115547408A (en) * 2022-07-15 2022-12-30 宋炜宸 Method and equipment for predicting individual phenotype based on human whole genome genotype
CN115691661A (en) * 2022-09-26 2023-02-03 之江实验室 Gene coding breeding prediction method and device based on graph clustering
CN115331732B (en) * 2022-10-11 2023-03-28 之江实验室 Gene phenotype training and predicting method and device based on graph neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN115148278A (en) * 2022-07-21 2022-10-04 平安科技(深圳)有限公司 Training method and device of gene sequencing model, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116072214A (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN111832101B (en) Construction method of cement strength prediction model and cement strength prediction method
CN105488528B (en) Neural network image classification method based on improving expert inquiry method
CN110766044B (en) Neural network training method based on Gaussian process prior guidance
Zhu et al. DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm
CN111898689B (en) Image classification method based on neural network architecture search
Anderson Large-scale parentage inference with SNPs: an efficient algorithm for statistical confidence of parent pair allocations
CN111985310B (en) Training method of deep convolutional neural network for face recognition
WO2023217290A1 (en) Genophenotypic prediction based on graph neural network
CN112580259B (en) Intelligent mine automatic ore blending method and system based on genetic algorithm
WO2023124342A1 (en) Low-cost automatic neural architecture search method for image classification
CN109214444B (en) Game anti-addiction determination system and method based on twin neural network and GMM
CN103914527B (en) Graphic image recognition and matching method based on genetic programming algorithms of novel coding modes
CN114118369A (en) Image classification convolution neural network design method based on group intelligent optimization
CN116401555A (en) Method, system and storage medium for constructing double-cell recognition model
CN114496069A (en) Method for predicting off-target of CIRPCAs 9 system based on Transformer architecture
CN107240100B (en) Image segmentation method and system based on genetic algorithm
CN116072214B (en) Phenotype intelligent prediction and training method and device based on gene significance enhancement
CN108509764A (en) A kind of extinct plants and animal pedigree evolution analysis method based on genetic property yojan
CN115908909A (en) Evolutionary neural architecture searching method and system based on Bayes convolutional neural network
CN110705704A (en) Neural network self-organizing genetic evolution algorithm based on correlation analysis
CN115063374A (en) Model training method, face image quality scoring method, electronic device and storage medium
CN110533186B (en) Method, device, equipment and readable storage medium for evaluating crowdsourcing pricing system
CN113762591A (en) Short-term electric quantity prediction method and system based on GRU and multi-core SVM counterstudy
Shuai et al. A Self-adaptive neuroevolution approach to constructing Deep Neural Network architectures across different types
CN107301040B (en) Software product line product derivation method based on subtree decomposition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant