WO2024065070A1 - 一种基于图聚类的基因编码育种预测方法和装置 - Google Patents

一种基于图聚类的基因编码育种预测方法和装置 Download PDF

Info

Publication number
WO2024065070A1
WO2024065070A1 PCT/CN2022/121174 CN2022121174W WO2024065070A1 WO 2024065070 A1 WO2024065070 A1 WO 2024065070A1 CN 2022121174 W CN2022121174 W CN 2022121174W WO 2024065070 A1 WO2024065070 A1 WO 2024065070A1
Authority
WO
WIPO (PCT)
Prior art keywords
gene
clustering
information
breeding
genetic
Prior art date
Application number
PCT/CN2022/121174
Other languages
English (en)
French (fr)
Inventor
吕劲松
陈红阳
王浩
冯献忠
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to PCT/CN2022/121174 priority Critical patent/WO2024065070A1/zh
Priority to US18/454,036 priority patent/US20240119314A1/en
Publication of WO2024065070A1 publication Critical patent/WO2024065070A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Definitions

  • the present invention mainly relates to the field of gene breeding prediction for crop precision molecular breeding, and mainly to a gene coding breeding prediction method and device based on graph clustering.
  • the current gene prediction methods for crop phenotypes mainly include traditional statistical analysis methods such as Bayesian methods, linear regression, and ridge regression.
  • traditional statistical analysis methods such as Bayesian methods, linear regression, and ridge regression.
  • the deep learning methods that have achieved great success in the fields of speech, images, and natural language cannot achieve good results due to the shortcomings of small samples in the field of crop breeding.
  • the dimension of genetic data is very high, and traditional statistical analysis methods also find it difficult to use feature selection methods to quickly extract effective features from such high-dimensional genetic feature data. It can be seen that the existing popular methods cannot meet the high-dimensional small sample problem of crop molecular breeding.
  • the purpose of the present invention is to address the deficiencies of the prior art and propose a method and device for genetic coding breeding prediction based on graph clustering.
  • the method and device utilize the gene interaction relationship network contained in the gene map to extract the co-regulatory genome clustering information through graph clustering, and newly propose a gene clustering encoding method that integrates gene allele information and gene map clustering information, and utilizes the weight sharing of a deep convolutional neural network to effectively extract the regulatory gene characteristics used to control the biological phenotypic output, thereby solving the problem of insufficient encoding of the gene interaction relationship between gene maps in the input coding layer of the classical model, and ensuring the accuracy of genetic breeding prediction of biological phenotypes.
  • the present invention provides the following technical solutions:
  • the present invention discloses a gene coding breeding prediction method based on graph clustering, comprising the following steps:
  • the gene map is clustered and solved to obtain the number of co-regulated genomes and the genome cluster number of each gene;
  • the allele information and genome cluster number information corresponding to each gene in the genotype data are fused and concatenated to obtain the gene cluster code of the sample;
  • the gene clustering code and gene position information are input into the gene coding breeding prediction model to obtain the biological phenotypic information of the offspring to be predicted; based on the predicted biological phenotypic information of the offspring, a high-quality seed collection is screened.
  • the gene-encoded breeding prediction model is obtained by training based on the collected data set, and each sample data of the data set includes the gene clustering code, gene location information and biological phenotype information of the sample.
  • the biological phenotypic information includes measurable information such as quantity, quality, percentage, classification, etc. related to the target phenotype
  • the allele information to be encoded includes SNP alleles, such as homozygous 0/0, 1/1 and heterozygous 0/1, etc.
  • the gene map is clustered to obtain the number of co-regulated genomes and the genome cluster number information of each gene, as follows:
  • the number of co-regulated genomes i.e. the number of gene clusters, is estimated based on the spatial distribution characteristics of the gene map;
  • the intra-class distance and inter-class distance are calculated for each gene to determine the cluster to which the gene belongs;
  • each gene cluster is given a unique cluster number information as the genome cluster number of each gene in the corresponding gene cluster.
  • the strength of the correlation between genes is generally obtained by calculating the similarity of the multi-sample SNP site strings of every two genes.
  • Commonly used methods include Pearson correlation coefficient, Spearman correlation coefficient, Euclidean distance, cosine similarity, Manhattan distance, Hamming distance, edit distance, etc.; the weight of the adjacent edge is generally expressed by the strength of the correlation between genes or its normalized value.
  • gene clustering methods include spatial clustering (Kmeans, etc.), density clustering (DBSCAN, etc.), hierarchical clustering (bottom-up method and top-down method), spectral clustering, etc.
  • the estimation methods for determining the number of gene clusters include statistical methods, random methods, exhaustive methods, iterative methods, etc., wherein the iterative method mainly refers to the method of determining the number of clusters by iterative clustering from bottom to top or from top to bottom in hierarchical clustering.
  • the spectral clustering method mainly utilizes the Laplacian matrix and other methods to calculate the connected components of the graph for clustering; the calculation method of the intra-class distance and the inter-class distance includes the gene similarity calculation method as preferably described above, and the intra-class and inter-class distances defined by graph connectivity and neighborhood characteristics.
  • the gene clustering number information can be given by the clustering method itself, or in a random or sequential manner.
  • the preferred method for fusing gene allele information and genome cluster number information is string concatenation.
  • the structure of the gene-encoded breeding prediction model includes modules such as gene clustering encoding input layer, embedding layer, convolution layer, pooling layer, fully connected layer, output layer, and strategies to improve the generalization ability of neural networks, including L1/L2 regularization, Dropout, etc., and optimized learning algorithms include Adam, etc.
  • modules such as gene clustering encoding input layer, embedding layer, convolution layer, pooling layer, fully connected layer, output layer, and strategies to improve the generalization ability of neural networks, including L1/L2 regularization, Dropout, etc., and optimized learning algorithms include Adam, etc.
  • the input layer includes the gene cluster encoding information obtained in step 4, or the gene cluster encoding information plus gene position information
  • the output layer includes a classification layer or regression layer related to the target task, or a pre-trained multi-task classification and regression layer.
  • the genetically encoded breeding prediction model is obtained through two-stage learning and training, wherein in the first stage of learning, it serves as a pre-trained twin network, accepts encoding inputs from two gene strings, and simultaneously learns differential tasks and addition tasks in the output layer; in the second stage of learning, it serves as a pre-fixed weight network layer for continued training and participates in fine-tuning learning of the target task.
  • the method for screening high-quality seed sets is to obtain the preferred seed sets and their corresponding parent combinations by setting and optimizing reasonable thresholds.
  • the present invention is a method for predicting breeding based on graph clustering. It first collects the biological phenotypic information to be predicted and the allele information to be encoded required for precise molecular breeding; then determines the gene map and the weight of adjacent edges based on the strength of the correlation between genes; then clusters the gene map to obtain the number of co-regulated genomes and the genome cluster number information of each gene; then fuses the gene allele information and the genome cluster number information to obtain the gene cluster coding of the sample; finally, based on the gene cluster coding information, or additional gene position information and the biological phenotypic information to be predicted, , construct a deep convolutional neural network to optimize the genetic breeding prediction performance; this method makes full use of the gene interaction relationship network contained in the gene map, extracts the co-regulatory genome clustering information through graph clustering, and newly proposes a gene clustering encoding method that integrates gene allele information and gene map clustering information, as well as additional gene position
  • FIG1 is a flow chart of a method for predicting genetically encoded breeding based on graph clustering provided by an embodiment of the present invention
  • FIG2 is a schematic diagram of cloud-edge-end collaborative deployment of a gene-encoded breeding prediction device based on graph clustering provided by an embodiment of the present invention
  • FIG3 is a flow chart of collecting gene information provided by an embodiment of the present invention.
  • FIG4 is an architecture diagram of a single-stage deep convolutional neural network model provided by an embodiment of the present invention.
  • FIG5 is an architecture diagram of a two-stage deep convolutional neural network model provided by an embodiment of the present invention.
  • FIG6 is a structural block diagram of a gene-encoded breeding prediction device based on graph clustering provided in an embodiment of the present invention.
  • 11-information collection module 12-gene clustering module; 13-encoding pre-training module; 14-continuation training module; 15-breeding prediction and screening module
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • a gene-encoded breeding prediction method based on graph clustering which encodes genes in combination with graph clustering, and then predicts the biological phenotypic information of the offspring to be predicted through a gene-encoded breeding prediction model; based on the predicted biological phenotypic information of the offspring, a high-quality seed collection is screened.
  • the gene-encoded breeding prediction model is obtained by training based on the collected data set, and the training method refers to Figure 1, which specifically includes:
  • S101 Collect the biological phenotypic information, genotypic data and gene location information of each sample required for precision molecular breeding, and construct a data set.
  • the execution subject of the method is a computational breeding center. Specifically, if the computational breeding center is set on a computing cloud, then the computing cloud is the execution subject of the method; if the computational breeding center is set on a computing end node, then the computing end node is the execution subject of the method.
  • the computational breeding center predicts biological phenotypes based on high-dimensional genotype data, with the goal of screening the father and mother based on the predicted values to produce excellent offspring.
  • ordinary gene encoding which is mainly based on numerical values, numerical mapping or one-hot encoding, cannot effectively extract multi-neighborhood structural features in gene network maps.
  • Gene encoding based on graph clustering can effectively extract this structural information, thereby improving the ability of breeding prediction.
  • the collection of phenotypic and genotypic information required for breeding prediction includes: collecting biological tissue samples, extracting DNA/RNA, preparing samples for library construction and sequencing.
  • the strength of the correlation between genes can be obtained by calculating the similarity of the multi-SNP site strings of two genes in the genotype data, including Pearson correlation coefficient, Jaccard correlation coefficient, Spearman correlation coefficient, Euclidean distance, angle cosine similarity, Manhattan distance, Hamming distance, edit distance, Chebyshev distance, Minkowski distance and information entropy; the calculated similarity is the adjacent edge weight of the undirected graph.
  • the strength of the correlation between genes is determined by the Pearson correlation coefficient, and the calculation formula is as follows:
  • X and K are vector representations of two genes
  • Xi and Ki are the i-th components of the vector representations of genes X and K, respectively.
  • N is the dimension of the gene vector, where the weight of the adjacent edge is the part of the correlation between genes from 0 to 1. If it is not in the interval, it is taken as 0, and it is considered that there is no edge.
  • the correlation between each gene locus is calculated to obtain a heat map of the correlation between gene loci, that is, the gene locus adjacent edge weights and the adjacency matrix.
  • step S103 clustering the gene map constructed in step S102 to obtain the number of co-regulated genomes and genome cluster number information of each gene.
  • Gene clustering methods include spatial clustering, density clustering, hierarchical clustering or spectral clustering. Specifically, this step includes the following sub-steps:
  • the number of co-regulated genomes i.e., the number of gene clusters
  • the methods for estimating the number of co-regulated genomes are statistical, random, exhaustive or iterative methods, among which the iterative method mainly refers to the method of determining the number of clusters by iterative clustering from bottom to top or from top to bottom in hierarchical clustering.
  • the intra-class distance and inter-class distance are calculated for each gene to determine the cluster to which the gene belongs;
  • each gene cluster is given a unique cluster number information as the genome cluster number of each gene in the corresponding gene cluster.
  • the gene cluster number information can be given by the clustering method itself, or in a random or sequential manner.
  • a connected subgraph algorithm is applied to obtain the gene locus clustering results, and the isolated gene points are connected together and numbered sequentially to obtain the number of co-regulated genomes and the genome cluster number information of each gene.
  • the gene cluster coding of the fused sample can be obtained by element-by-element concatenation.
  • the gene cluster coding method is as follows:
  • S ij is the gene of the jth component of sample i
  • SNP (S ij ) is the corresponding allele feature, that is, the original code of the gene in the genotype data
  • Group (S ij ) is the corresponding cluster number
  • the gene clustering coding feature of the sample is obtained. Based on this feature, and with the addition of gene location information and biological phenotype information to be predicted, a gene coding breeding prediction model based on a deep convolutional neural network is constructed, and Dropout and L1/L2 regularization strategy optimization is added to optimize the gene breeding prediction performance.
  • the input and encoding layer (gene cluster encoding input layer) of the deep convolutional neural network is constructed in an element-by-element splicing manner, and then the embedding layer, SpatialDropout1D, 1D convolution, 1D maximum pooling, flattening, full connection are sequentially connected, and finally the classification/regression output layer and phenotypic output related to the target task are connected.
  • Stage 1 is the pre-training stage, which constitutes the input and encoding layer of a dual-channel multi-task deep convolutional neural network by fusing gene allele information, gene cluster number information, gene position information, etc., and then connects to the deep convolutional neural network as a shared twin network, and the output layer is connected to the difference task and the addition task.
  • the input of the entire network is a left and right dual channel, that is, the input of the gene string of two different samples
  • the output is a multi-task output
  • the difference task is the positive and negative polarity judgment task of the difference between the phenotypic values corresponding to the left and right channel gene strings
  • the addition task is the regression task of the sum of the phenotypic values corresponding to the left and right channel gene strings.
  • the target loss functions of the difference task and the addition task are L - and L + , respectively.
  • L + can be taken as mean square error MSE
  • L - can be taken as cross entropy.
  • Yi1 and Yi2 are the actual sample label values.
  • I(X) is the indicative function, when X is true, the function value is 1, when X is false, the function value is 0
  • M represents the number of samples
  • is the Sigmoid function, and the calculation formula is as follows:
  • e is a natural constant, its value is approximately equal to 2.71828.
  • stage one The multi-task objective loss function of stage one is as follows:
  • L 1 ⁇ L + + ⁇ L -
  • L + , L - , and L are the loss functions of the sum task, difference task, and total task, respectively.
  • ⁇ and ⁇ are the weight hyperparameters of the loss functions of the sum task and difference task, respectively.
  • the parameter values can be determined by the grid search method.
  • Phase 2 is the continued training phase, which loads the shared twin network that has been pre-trained in phase 1 and fixes the network weights so that it does not participate in the network weight tuning in the continued training phase, and then connects to the classification/regression output layer and phenotypic output related to the target task, thereby constructing a deep convolutional neural network in the continued training phase.
  • the target continued training phase model is learned and tuned to obtain the target prediction model for the breeding task.
  • the target loss function of phase 2 is the mean square error MSE; for the phenotypic classification prediction task, the target loss function of phase 2 is the cross entropy. Taking the regression prediction task as an example, the target loss function of phase 2 is as follows:
  • the seed pool predicted to be the specified trait classification is directly screened based on the constructed genetic breeding prediction model, which is the optimal parent combination; for trait regression, i.e., numerical prediction tasks, the seed pool whose predicted value reaches or exceeds the specified threshold is screened based on the constructed genetic breeding prediction model, which is the optimal parent combination.
  • the specified screening threshold can be obtained through comprehensive optimization of the screening ratio and the scale of the experimental field.
  • the existing gene input feature encoding methods usually consider numerical format or one-hot format encoding.
  • the numerical format usually maps 0/0, 0/1 and 1/1 to -1, 0 and 1, or to 0, 1 and 2, which cannot express complex gene network features; and the one-hot encoding only discretizes the original features and does not increase the information content of the features.
  • the gene encoding based on graph clustering combines the original allele features and the co-regulatory genome features, thereby capturing the upper-level graph neighborhood structure features, which greatly improves the prediction ability of the model based on deep neural network.
  • This method makes full use of the gene interaction relationship network contained in the gene map, extracts the co-regulatory genome clustering information through graph clustering, and newly proposes a gene clustering encoding method that integrates gene allele information and gene map clustering information, and uses the weight sharing of the dual-channel multi-task deep convolutional neural network to effectively extract the regulatory gene features used to control the biological phenotype output, solves the problem that the input encoding layer of the classic model does not adequately encode the gene interaction relationship between gene maps, and ensures the accuracy of genetic breeding prediction of biological phenotypes.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • a gene coding breeding prediction device based on graph clustering referring to Figures 2 and 6, the device comprises:
  • the information collection module 11 is used to collect the phenotypic information and genotypic data of the organisms to be predicted required for precision molecular breeding. This module is mainly completed by the intelligent terminal.
  • the gene clustering module 12 is used to determine the gene map and the weight of the adjacent edges based on the strength of the correlation between genes, cluster the gene map, and obtain the number of co-regulated genomes and the genome cluster number information of each gene. This module is completed by cloud-edge collaborative computing.
  • the encoding pre-training module 13 is used to fuse gene allele information, genome cluster number information and gene position information, and build a dual-channel multi-task deep convolutional neural network based on the fused encoding information, and simultaneously perform learning and training of difference tasks and addition tasks and network weight tuning for dual gene string enhanced data to obtain a shared twin network and its weights for subsequent training.
  • This module is completed by cloud-edge-end collaborative computing.
  • the continued training module 14 is used to load the shared twin network and its weights obtained by the encoding pre-training module and solidify the weights, and to continue training and tuning for the classification/regression tasks related to the breeding target traits to obtain a prediction model related to the breeding target traits.
  • This module is completed by cloud-edge-end collaborative computing.
  • the breeding prediction and screening module 15 is used to screen high-quality seed sets based on the genetic breeding prediction model constructed by the continuous training module, that is, to optimize the optimal parent combination. This module is completed by cloud-edge-end collaborative computing.
  • the training set and prediction set data required for gene breeding prediction are first collected, wherein the training set data includes the biological phenotypic information to be predicted and the allele information to be encoded required for precision molecular breeding, and the prediction set data includes the allele information of the seeds to be predicted; then, the correlation between different gene loci is calculated, and the gene adjacency matrix and map are constructed based on the strength of the correlation, and the gene map is clustered and solved to obtain the number of co-regulated genomes and the genome clustering number information of each gene; then, the gene allele information, genome clustering number information and gene position information are integrated to construct a dual-channel multi-task deep convolutional neural network for pre-training to obtain a shared twin network and its weights; then, the shared twin network and its weights are loaded and the weights are solidified, and continued training and optimization are performed for classification/regression tasks related to breeding target traits to obtain a prediction model related to breeding target traits; finally
  • the device can make full use of the gene interaction relationship network contained in the gene map, extract the co-regulatory genome clustering information through graph clustering, and newly propose a gene clustering encoding method that integrates gene allele information and gene map clustering information. It also uses the weight sharing of a dual-channel multi-task deep convolutional neural network to effectively extract regulatory gene features used to control biological phenotypic outputs, solve the problem of insufficient encoding of gene interaction relationships between gene maps in the input coding layer of the classical model, and ensure the accuracy of genetic breeding predictions of biological phenotypes.
  • the computer program product of the graph clustering-based genetically encoded breeding prediction method and device provided in the embodiments of the present invention includes a computer-readable storage medium storing program code.
  • the instructions included in the program code can be used to execute the methods described in the previous method embodiments. The specific implementation can be found in the method embodiments, which will not be repeated here.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Physiology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Ecology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种基于图聚类的基因编码育种预测方法及装置,基于基因间相关性强弱构建基因图谱;对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息;融合基因等位信息和基因组聚类编号信息,得到样本的基因聚类编码;基于基因聚类编码信息和待预测生物表型信息,构建深度卷积神经网络,以优化基因育种预测性能。利用基因图谱蕴含的基因间相互作用关系网络,能够有效提取用于控制生物表型输出的调控基因特征,解决经典模型输入编码层对基因图谱间基因相互作用关系编码不足的问题,保障生物表型的基因育种预测精准性,进而提高基因育种的速度、效率和质量,尤其是产量。

Description

一种基于图聚类的基因编码育种预测方法和装置 技术领域
本发明主要涉及作物精准分子育种的基因育种预测领域,主要涉及一种基于图聚类的基因编码育种预测方法和装置。
背景技术
随着基因测序技术的发展,实验技术人员通过样本采集、文库制备和测序等湿实验过程,基于测序、PCR(聚合酶链式反应)、基因芯片、光学图谱等可获得大规模、具有数据挖掘应用价值的多样本基因数据信息。全基因组测序后,基因组预测模型准确率很低。以大豆为例,大豆含有约6万基因,其中4万对基因发现8000万突变。而基因型预测表型,只能定性描述,不能定量分析。这极大限制了作物育种的数量、速度、质量,尤其产量的提高。
为了提高分子育种的准确率,当前面向作物表型的基因预测方法主要包括贝叶斯方法、线性回归、岭回归等传统统计分析方法。而当前在语音、图像和自然语言领域获得极大成功的深度学习方法却因为作物育种领域样本少的缺点无法获得很好的效果。另一方面,基因数据的维度很高,传统的统计分析方法也很难利用特征选择方法在如此高维的基因特征数据中快速提取有效的特征。可见,已有流行方法均无法满足作物分子育种的这种高维小样本问题。
为了应对作物分子育种的这种高维小样本问题,需要提出创新的基因育种预测方法,以同时解决高维度特征的特征选择提取问题和复杂模型样本基因图谱特征编码不足的问题。
发明内容
本发明的目的在于针对现有技术的不足,提出了一种基于图聚类的基因编码育种预测方法和装置,利用基因图谱蕴含的基因间相互作用关系网络,通过图聚类提取共同调控基因组聚类信息,以及新提出融合基因等位信息和基因图谱聚类信息的基因聚类编码方式,并利用深度卷积神经网络的权值共享,有效提取用于控制生物表型输出的调控基因特征,解决经典模型输入编码层对基因图谱间基因相互作用关系编码不足的问题,保障生物表型的基因育种预测精准性。
为实现上述目的,本发明提供如下技术方案:
本发明公开了一种基于图聚类的基因编码育种预测方法,包括如下步骤:
获取待预测的子代的基因型数据和基因位置信息;
基于基因型数据中基因间相关性强弱构建无向图作为基因图谱;
对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号;
融合基因型数据中每个基因对应的等位基因信息和基因组聚类编号信息,串接得到样本的基因聚类编码;
将基因聚类编码、基因位置信息输入至基因编码育种预测模型,获得待预测的子代的生物表型信息;基于预测的子代的生物表型信息,筛选优质种子集合。
其中,所述基因编码育种预测模型是基于收集的数据集训练获得的,所述数据集的每一样本数据包括样本的基因聚类编码、基因位置信息和生物表型信息。
作为优选,生物表型信息包括目标表型相关的数量、质量、百分比、分类等可测量信息,待编码的等位基因信息包括SNP等位基因,如纯合0/0、1/1和杂合0/1等。
作为优选,对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息,具体如下:
基于基因图谱空间分布特征估算共同调控基因组个数,即基因聚类簇数;
根据估算的基因聚类簇数,对每个基因计算类内距离和类间距离,确定该基因归属的聚类;
聚类完成后,对每个基因聚类簇给予唯一的聚类编号信息,作为对应基因聚类簇中每个基因的基因组聚类编号。
作为优选,所述基因间相关性强弱一般通过计算每两条基因的多样本SNP位点串的相似度得到,常用方法包括Pearson相关系数、Spearman相关系数、欧式距离、余弦相似度、曼哈顿距离、汉明距离、编辑距离等;邻接边权重一般通过基因间相关性强弱或其归一化值表示。
作为优选,基因聚类方法包括空间聚类(Kmeans等)、密度聚类(DBSCAN等)、层次聚类(自底向上法和自顶向下法)、谱聚类等。
作为优选,确定基因聚类数的估算方法包括统计法、随机法、穷举法、迭代法等,其中迭代法主要指层次聚类中自底向上方或自顶向下迭代聚类确定的聚类数方法。
作为优选,其中的谱聚类法主要利用拉普拉斯矩阵等计算图的连通分量进行聚类;类内距离和类间距离的计算方法包括如前所述优选所述的基因相似度计算方法,及图连通性和邻域特征定义的类内和类间距离。
作为优选,基因聚类编号信息可由聚类方法本身给出,或通过随机方式、顺序方式给出。
作为优选基因等位信息和基因组聚类编号信息的融合方式为字符串串接方式。
作为优选,基因编码育种预测模型的结构包括基因聚类编码输入层、嵌入层、卷积层、池化层、全连接层、输出层等模块,以及提高神经网络泛化能力的策略,包括L1/L2正则化、Dropout等,优化学习算法包括Adam等。
作为优选,输入层包括步骤4中得到的基因聚类编码信息,或基因聚类编码信息附加基因位置信息,输出层包括目标任务相关的分类层或回归层,或者作为预训练的多任务分类和回归层。
作为优选,所述基因编码育种预测模型是两阶段学习训练获得,其中第一阶段学习中作为预训练的双胞胎网络,接受来自两个基因串的编码输入,并在输出层同时学习差分任务和加和任务;第二阶段学习中作为继续训练的前置固定权重网络层,参与目标任务的精调学习。
作为优选,筛选优质种子集合的方法为通过设置和优化合理阈值得到的优选种子集合及其相应亲本组合。
本发明的有益效果:与现有技术相比,本发明一种基于图聚类的基因编码育种预测方法,先收集精准分子育种所需的待预测生物表型信息、待编码的等位基因信息;然后基于基因间相关性强弱确定基因图谱和邻接边权重;再对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息;接着融合基因等位信息和基因组聚类编号信息,得到样本的基因聚类编码;最后基于基因聚类编码信息,或附加基因位置信息和待预测生物表型信息,构建深度卷积神经网络,以优化基因育种预测性能;该方法充分利用基因图谱蕴含的基因间相互作用关系网络,通过图聚类提取共同调控基因组聚类信息,以及新提出融合基因等位信息和基因图谱聚类信息的基因聚类编码方式,以及附加基因位置信息,并利用深度卷积神经网络的权值共享,能够有效提取用于控制生物表型输出的调控基因特征,解决经典模型输入编码层对基因图谱间基因相互作用关系编码不足的问题,保障生物表型的基因育种预测精准性。
附图说明
图1为本发明实施例提供的一种基于图聚类的基因编码育种预测方法的流程图;
图2为本发明实施例提供的一种基于图聚类的基因编码育种预测装置的云边端协同部署示意图;
图3为本发明实施例提供的收集基因信息的流程图;
图4为本发明实施例提供的单阶段深度卷积神经网络模型的架构图;
图5为本发明实施例提供的两阶段深度卷积神经网络模型的架构图;
图6为本发明实施例提供的基于图聚类的基因编码育种预测装置的结构框图。
图中:11-信息收集模块;12-基因聚类模块;13-编码预训练模块;14-继续训练模块;15-育种预测筛选模块
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明了,下面通过附图及实施例,对本发明 进行进一步详细说明。但是应该理解,此处所描述的具体实施例仅用于解释本发明,并不用于限制本发明的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要地混淆本发明的概念。
为便于对本实施例进行理解,首先对本发明实施例所公开的一种基于图聚类的基因编码育种预测方法进行详细介绍。
实施例一:
一种基于图聚类的基因编码育种预测方法,该方法结合图聚类对基因进行编码,再通过一基因编码育种预测模型进行预测获得待预测的子代的生物表型信息;基于预测的子代的生物表型信息,筛选优质种子集合。其中,所述基因编码育种预测模型是基于收集的数据集训练获得的,训练方法参考图1,具体包括:
S101、收集精准分子育种所需的每一样本的生物表型信息、基因型数据及基因位置信息,构建数据集。
本发明的实施例中,该方法的执行主体为计算育种中心。具体的,如果计算育种中心设置在计算云上,那么计算云就是该方法的执行主体;如果计算育种中心设置在计算端节点上,那么计算端节点就是该方法的执行主体。
具体的,在本发明实施例中,计算育种中心基于高维基因型数据预测生物表型,目的在于基于预测值对父本和母本进行筛选,从而产生优良的后代。而普通的基因编码,主要基于数值、数值映射或独热编码(One-hot Encoding),无法有效提取基因网络图谱中的多邻域结构特征。基于图聚类的基因编码可以有效提取这种结构信息,从而提高育种预测的能力。
在一个可选的实施方式中,参考图3,育种预测所需的表型和基因型信息收集包括:采集生物组织样本、提取DNA/RNA、样本制备建库和测序。
S102、对每一样本的基因型数据基于基因间相关性强弱构建无向图作为基因图谱,所述基因间相关性强弱可以通过计算基因型数据中两两基因的多SNP位点串的相似度得到,方法包括Pearson相关系数、Jaccard相关系数、Spearman相关系数、欧式距离、夹角余弦相似度、曼哈顿距离、汉明距离、编辑距离、切比雪夫距离、闵可夫斯基距离和信息熵等;计算得到的相似度即为无向图的邻接边权重。作为一种实施方案,基因间相关性强弱通过Pearson相关系数确定,计算公式如下:
Figure PCTCN2022121174-appb-000001
其中X,K分别为两个基因的向量表示,X i、K i分别为其中基因X和K的向量表示的第i 个分量,
Figure PCTCN2022121174-appb-000002
为对应基因的均值。N为基因向量的维度,其中,邻接边权重取基因间相关性中0到1的部分,不在区间的取0,认为不存在连边。
在一个可选的实施方式中,基于开发集中样本的基因串信息和基因相关性计算公式,计算各基因位点间的相关性,得到基因位点间相关性热力图,即基因位点邻接边权重和邻接矩阵。
S103、对步骤S102构建的基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息。
在构建基因图谱后,对该图谱进行无监督聚类,基因聚类方法包括空间聚类、密度聚类、层次聚类或谱聚类等,具体地,该步骤包括以下子步骤:
基于基因图谱空间分布特征估算共同调控基因组个数,即基因聚类簇数;估算共同调控基因组个数的方法为统计法、随机法、穷举法或迭代法,其中迭代法主要指层次聚类中自底向上方或自顶向下迭代聚类确定的聚类数方法。
根据估算的基因聚类簇数,对每个基因计算类内距离和类间距离,确定该基因归属的聚类;
聚类完成后,对每个基因聚类簇给予唯一的聚类编号信息,作为对应基因聚类簇中每个基因的基因组聚类编号。基因聚类编号信息可以由聚类方法本身给出,或通过随机方式、顺序方式给出。
在一个可选的实施方式中,基于前述所建的基因位点邻接矩阵所对应的基因图谱,应用连通子图算法得到基因位点聚类结果,连通孤立基因点一起,顺次编号,得到共同调控基因组个数和每个基因的基因组聚类编号信息。
S104、融合基因型数据中每个基因对应的等位基因信息和基因组聚类编号信息,串接得到样本的基因聚类编码。
获得基因聚类编号信息后,通过逐元素串接可得到融合后的样本的基因聚类编码。基因聚类编码方式如下所示:
Figure PCTCN2022121174-appb-000003
其中S ij为样本i的第j个分量的基因,SNP(S ij)为对应的等位基因特征,即基因型数据中原始该基因的编码,Group(S ij)为对应的聚类编号,
Figure PCTCN2022121174-appb-000004
为融合算子,代表符号串接操作。
S105、基于基因聚类编码信息、基因位置信息和生物表型信息,构建和训练基因编码育种预测模型,以优化基因育种预测性能。
样本的基因聚类编码完成后得到样本的基因聚类编码特征,可基于此特征,并附加基因 位置信息和待预测的生物表型信息,构建基于深度卷积神经网络的基因编码育种预测模型,并增加Dropout和L1/L2正则化策略优化,以优化基因育种预测性能。
在一个可选的实施方式中,参考图4,通过融合基因等位信息和基因聚类编号信息,以逐元素拼接方式构成深度卷积神经网络的输入和编码层(基因聚类编码输入层),而后依次接入嵌入层、SpatialDropout1D、1维卷积、1维最大池化、展平、全连接,最后接入和目标任务相关的分类/回归输出层及表型输出。
在一个可选的实施方式中,参考图5,模型构建和学习阶段分成两个阶段:阶段一为预训练阶段,通过融合基因等位信息和基因聚类编号信息和基因位置信息等共同构成双通道多任务的深度卷积神经网络的输入和编码层,而后接入作为共享双胞胎网络的深度卷积神经网络,输出层接差分任务和加和任务。其中,整个网络的输入为左右双通道,即包括两个不同样本的基因串的输入,输出为多任务输出,差分任务为左右通道基因串对应表型值的差值的正负极性判定任务,而加和任务为左右通道基因串对应表型值的和值的回归任务。其中差分任务和加和任务的目标损失函数分别为L -和L +,一般地,L +可取均方误差MSE,L -可取交叉熵。如下所示:
Figure PCTCN2022121174-appb-000005
Figure PCTCN2022121174-appb-000006
其中Y i1、Y i2为实际样本标签值,
Figure PCTCN2022121174-appb-000007
为样本预测值,I(X)为示性函数,当X为真时函数值为1,当X为假时函数值为0;M表示样本数量;σ为Sigmoid函数,计算公式如下所示:
Figure PCTCN2022121174-appb-000008
其中e为自然常数,其值约等于2.71828。
阶段一的多任务目标损失函数如下所示:
L 1=αL ++βL -
其中L +、L -、L分别为加和任务、差分任务和总任务的损失函数,α和β分别为加和任务和差分任务的损失函数的权重超参,可通过网格搜索法确定参数值。
阶段二为继续训练阶段,通过载入阶段一已预训练好的共享双胞胎网络并固定网络权重使之不参与继续训练阶段的网络权重调优,而后上接目标任务相关的分类/回归输出层及表型输出,从而构建继续训练阶段的深度卷积神经网络。基于开发集中的基因串样本数据和对应表型数据对目标的继续训练阶段模型进行学习调优,从而获得最终用于育种任务的目标预测模型。对于表型的回归预测任务,阶段二的目标损失函数取均方误差MSE;对于表型的分类 预测任务,阶段二的目标损失函数取交叉熵。以回归预测任务为例,阶段二的目标损失函数如下所示:
Figure PCTCN2022121174-appb-000009
S106、基于构建的基因育种预测模型,筛选优质种子集合,即优化最优亲本组合。
对于性状分类任务,基于构建的基因育种预测模型直接筛选预测为指定性状分类的种子池,即为最优亲本组合;对于性状回归即数值预测任务,基于构建的基因育种预测模型筛选预测值达到或超过指定阈值的种子池,即为最优亲本组合。对于性状回归任务,指定的筛选阈值可通过筛选比例和试验田规模综合优化得到。
现有的基因输入特征编码方式,通常考虑数值格式或独热格式编码,例如数值格式通常将0/0、0/1和1/1映射成-1、0和1的数值,或者映射到0、1和2,无法表达复杂的基因网络特征;而独热编码仅仅将原特征离散化,也没有增加特征的信息量。而基于图聚类的基因编码融合了原等位基因特征和共同调控基因组特征,从而捕捉到更上层的图邻域结构特征,对基于深度神经网络的模型的预测能力有很大的提高。该方法充分利用基因图谱蕴含的基因间相互作用关系网络,通过图聚类提取共同调控基因组聚类信息,以及新提出融合基因等位信息和基因图谱聚类信息的基因聚类编码方式,并利用双通道多任务的深度卷积神经网络的权值共享,能够有效提取用于控制生物表型输出的调控基因特征,解决经典模型输入编码层对基因图谱间基因相互作用关系编码不足的问题,保障生物表型的基因育种预测精准性。
实施例二:
一种基于图聚类的基因编码育种预测装置,参考图2和图6,该装置包括:
信息收集模块11,用于收集精准分子育种所需的待预测生物表型信息、基因型数据。此模块主要由智能终端完成。
基因聚类模块12,用于基于基因间相关性强弱确定基因图谱和邻接边权重、对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息。此模块由云边端协同计算完成。
编码预训练模块13,用于融合基因等位信息、基因组聚类编号信息和基因位置信息,并基于融合编码信息构建双通道多任务深度卷积神经网络,面向双基因串增强数据同时进行差分任务和加和任务的学习训练和网络权重调优,得到用于后续继续训练的共享双胞胎网络及其权重。此模块由云边端协同计算完成。
继续训练模块14,用于载入编码预训练模块得到的共享双胞胎网络及其权重并固化权重,并面向育种目标性状相关的分类/回归任务进行继续训练调优,得到用于育种目标性状相关的 预测模型。此模块由云边端协同计算完成。
育种预测筛选模块15,用于基于继续训练模块构建的基因育种预测模型,筛选优质种子集合,即优化最优亲本组合。此模块由云边端协同计算完成。
本发明实施例的基于图聚类的基因编码育种预测装置中,先收集用于基因育种预测所需的训练集和预测集数据,其中训练集数据包括精准分子育种所需的待预测生物表型信息、待编码的等位基因信息,预测集数据包括待预测的种子的等位基因信息;然后,计算不同基因位点间相关性,并基于相关性强弱构建基因邻接矩阵和图谱,对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息;接着,融合基因等位信息、基因组聚类编号信息和基因位置信息构建双通道多任务的深度卷积神经网络进行预训练,得到共享双胞胎网络及其权重;而后,载入共享双胞胎网络及其权重并固化权重,并面向育种目标性状相关的分类/回归任务进行继续训练调优,得到用于育种目标性状相关的预测模型;最后,基于基因育种预测模型筛选优质种子集合,即优化最优亲本组合。该装置能充分利用基因图谱蕴含的基因间相互作用关系网络,通过图聚类提取共同调控基因组聚类信息,以及新提出融合基因等位信息和基因图谱聚类信息的基因聚类编码方式,并利用双通道多任务深度卷积神经网络的权值共享,能够有效提取用于控制生物表型输出的调控基因特征,解决经典模型输入编码层对基因图谱间基因相互作用关系编码不足的问题,保障生物表型的基因育种预测精准性。
本发明实施例所提供的基于图聚类的基因编码育种预测方法及装置的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令可用于执行前面方法实施例中所述的方法,具体实现可参见方法实施例,在此不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换或改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于图聚类的基因编码育种预测方法,其特征在于,包括如下步骤:
    获取待预测的子代的基因型数据和基因位置信息;
    基于基因型数据中基因间相关性强弱构建无向图作为基因图谱;
    对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号;
    融合基因型数据中每个基因对应的等位基因信息和基因组聚类编号信息,串接得到样本的基因聚类编码;
    将基因聚类编码、基因位置信息输入至基因编码育种预测模型,获得待预测的子代的生物表型信息;基于预测的子代的生物表型信息,筛选优质种子集合。
    其中,所述基因编码育种预测模型是基于收集的数据集训练获得的,所述数据集的每一样本数据包括样本的基因聚类编码、基因位置信息和生物表型信息。
  2. 根据权利要求1所述的方法,其特征在于,所述基因间相关性强弱通过计算基因型数据中两两基因的多SNP位点串的相似度得到,方法包括Pearson相关系数、Jaccard相关系数、Spearman相关系数、欧式距离、夹角余弦相似度、曼哈顿距离、汉明距离、编辑距离、切比雪夫距离、闵可夫斯基距离和信息熵;计算得到的相似度作为邻接边权重构建无向图。
  3. 根据权利要求1所述的方法,其特征在于,对基因图谱进行聚类求解,得到共同调控基因组个数和每个基因的基因组聚类编号信息,具体如下:
    基于基因图谱空间分布特征估算共同调控基因组个数,即基因聚类簇数;
    根据估算的基因聚类簇数,对每个基因计算类内距离和类间距离,确定该基因归属的聚类;
    聚类完成后,对每个基因聚类簇给予唯一的聚类编号信息,作为对应基因聚类簇中每个基因的基因组聚类编号。
  4. 根据权利要求3所述的方法,其特征在于,基因聚类方法包括空间聚类、密度聚类、层次聚类或谱聚类。
  5. 根据权利要求3所述的方法,其特征在于,估算共同调控基因组个数的方法为统计法、随机法、穷举法或迭代法,其中迭代法主要指层次聚类中自底向上方或自顶向下迭代聚类确定的聚类数方法。
  6. 根据权利要求1所述的方法,其特征在于,基因聚类编号信息由聚类方法本身给出,或通过随机方式、顺序方式给出。
  7. 根据权利要求1所述的方法,其特征在于,所述生物表型信息包括目标表型相关的数量、质量、百分比或分类。
  8. 根据权利要求1所述的方法,其特征在于,所述基因编码育种预测模型包括基因聚类编码 输入层、嵌入层、卷积层、池化层、全连接层和输出层。
  9. 根据权利要求1所述的方法,其特征在于,所述基因编码育种预测模型通过两阶段训练获得,其中第一阶段基于共享桥接网络,具有双通道基因聚类编码输入层,分别接受来自两个样本的基因聚类编码输入,并在输出层同时学习差分任务和加和任务;第二阶段基于第一阶段训练的固定的网络参数,仅留一层基因聚类编码输入层接受来自一个样本的基因聚类编码和基因位置信息的输入,参与目标任务的精调学习,直至完成训练。
  10. 一种基于图聚类的基因编码育种预测装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1-9任一项所述的基于图聚类的基因编码育种预测方法。
PCT/CN2022/121174 2022-09-26 2022-09-26 一种基于图聚类的基因编码育种预测方法和装置 WO2024065070A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/121174 WO2024065070A1 (zh) 2022-09-26 2022-09-26 一种基于图聚类的基因编码育种预测方法和装置
US18/454,036 US20240119314A1 (en) 2022-09-26 2023-08-22 Gene coding breeding prediction method and device based on graph clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/121174 WO2024065070A1 (zh) 2022-09-26 2022-09-26 一种基于图聚类的基因编码育种预测方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/454,036 Continuation US20240119314A1 (en) 2022-09-26 2023-08-22 Gene coding breeding prediction method and device based on graph clustering

Publications (1)

Publication Number Publication Date
WO2024065070A1 true WO2024065070A1 (zh) 2024-04-04

Family

ID=90475104

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121174 WO2024065070A1 (zh) 2022-09-26 2022-09-26 一种基于图聚类的基因编码育种预测方法和装置

Country Status (2)

Country Link
US (1) US20240119314A1 (zh)
WO (1) WO2024065070A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596841A (zh) * 2018-04-08 2018-09-28 西安交通大学 一种并行实现图像超分辨率及去模糊的方法
CN112232413A (zh) * 2020-10-16 2021-01-15 东北大学 基于图神经网络与谱聚类的高维数据特征选择方法
CN113192556A (zh) * 2021-03-17 2021-07-30 西北工业大学 基于小样本的多组学数据中基因型与表型关联分析方法
CN114360651A (zh) * 2021-12-28 2022-04-15 中国海洋大学 一种基因组预测方法、预测系统及应用
CN115083511A (zh) * 2022-06-24 2022-09-20 西安电子科技大学 基于图表示学习与注意力的外围基因调控特征提取方法
US20220301658A1 (en) * 2021-03-19 2022-09-22 X Development Llc Machine learning driven gene discovery and gene editing in plants

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596841A (zh) * 2018-04-08 2018-09-28 西安交通大学 一种并行实现图像超分辨率及去模糊的方法
CN112232413A (zh) * 2020-10-16 2021-01-15 东北大学 基于图神经网络与谱聚类的高维数据特征选择方法
CN113192556A (zh) * 2021-03-17 2021-07-30 西北工业大学 基于小样本的多组学数据中基因型与表型关联分析方法
US20220301658A1 (en) * 2021-03-19 2022-09-22 X Development Llc Machine learning driven gene discovery and gene editing in plants
CN114360651A (zh) * 2021-12-28 2022-04-15 中国海洋大学 一种基因组预测方法、预测系统及应用
CN115083511A (zh) * 2022-06-24 2022-09-20 西安电子科技大学 基于图表示学习与注意力的外围基因调控特征提取方法

Also Published As

Publication number Publication date
US20240119314A1 (en) 2024-04-11

Similar Documents

Publication Publication Date Title
Koo et al. A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology
Wang et al. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants
WO2023217290A1 (zh) 基于图神经网络的基因表型预测
CN109448794B (zh) 一种基于遗传禁忌和贝叶斯网络的上位性位点挖掘方法
CN109727640B (zh) 基于自动机器学习技术的全基因组预测方法及装置
CN116168766A (zh) 基于集成学习的品种鉴定方法、系统及终端
CN103164631B (zh) 一种智能协同表达基因分析仪
CN105590039B (zh) 一种基于bso优化的蛋白质复合物识别方法
CN110796485A (zh) 一种提高预测模型的预测精度的方法及装置
CN115691661A (zh) 一种基于图聚类的基因编码育种预测方法和装置
WO2024065070A1 (zh) 一种基于图聚类的基因编码育种预测方法和装置
CN111584010B (zh) 一种基于胶囊神经网络和集成学习的关键蛋白质识别方法
CN111462812B (zh) 一种基于特征层次的多目标系统发育树构建方法
Nelson et al. Higher order interactions: detection of epistasis using machine learning and evolutionary computation
CN114639446B (zh) 一种基于mcp稀疏深层神经网络模型估计水产动物基因组育种值的方法
US20210392836A1 (en) Associating pedigree scores and similarity scores for plant feature prediction
CN114446384A (zh) 染色体拓扑关联结构域的预测方法及预测系统
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs
CN114512188B (zh) 基于改进蛋白质序列位置特异性矩阵的dna结合蛋白识别方法
CN116992098B (zh) 引文网络数据处理方法及系统
CN116994652B (zh) 基于神经网络的信息预测方法、装置及电子设备
CN113077038A (zh) 工业数据特征选择方法、装置、计算机设备和存储介质
Ding Machine learning for biological networks
TWI650664B (zh) 建立蛋白質功能缺失評估模型的方法以及利用上述模型的風險評估方法與系統
Wu et al. Residual network improves the prediction accuracy of genomic selection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22959722

Country of ref document: EP

Kind code of ref document: A1