CN115019885A - Pig whole genome SNP site screening method, device and storage medium - Google Patents

Pig whole genome SNP site screening method, device and storage medium Download PDF

Info

Publication number
CN115019885A
CN115019885A CN202210768898.1A CN202210768898A CN115019885A CN 115019885 A CN115019885 A CN 115019885A CN 202210768898 A CN202210768898 A CN 202210768898A CN 115019885 A CN115019885 A CN 115019885A
Authority
CN
China
Prior art keywords
snp
breeding value
breeding
feature selection
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210768898.1A
Other languages
Chinese (zh)
Inventor
马黎
张兴
丁偌楠
杨雨婷
牛安然
杨昌雨
张丽萍
荆晓燕
马康
陈康
徐泽玉
梁志刚
龚华忠
闫之春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinliu Agriculture And Animal Husbandry Technology Co ltd
Chengdu Xinjin Xinhao Agriculture And Animal Husbandry Co ltd
Gaotang Xinhao Agriculture And Animal Husbandry Co ltd
Jiangyou New Hope Haibo'er Pig Breeding Co ltd
Laibin New Hope Liuhe Agriculture And Animal Husbandry Technology Co ltd
New Hope Group Co ltd
New Hope Liuhe Co ltd Beijing Branch
Sichuan New Hope Liuhe Pig Breeding Technology Co ltd
Shandong New Hope Liuhe Group Co Ltd
New Hope Liuhe Co Ltd
Original Assignee
Beijing Xinliu Agriculture And Animal Husbandry Technology Co ltd
Chengdu Xinjin Xinhao Agriculture And Animal Husbandry Co ltd
Gaotang Xinhao Agriculture And Animal Husbandry Co ltd
Jiangyou New Hope Haibo'er Pig Breeding Co ltd
Laibin New Hope Liuhe Agriculture And Animal Husbandry Technology Co ltd
New Hope Group Co ltd
New Hope Liuhe Co ltd Beijing Branch
Sichuan New Hope Liuhe Pig Breeding Technology Co ltd
Shandong New Hope Liuhe Group Co Ltd
New Hope Liuhe Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xinliu Agriculture And Animal Husbandry Technology Co ltd, Chengdu Xinjin Xinhao Agriculture And Animal Husbandry Co ltd, Gaotang Xinhao Agriculture And Animal Husbandry Co ltd, Jiangyou New Hope Haibo'er Pig Breeding Co ltd, Laibin New Hope Liuhe Agriculture And Animal Husbandry Technology Co ltd, New Hope Group Co ltd, New Hope Liuhe Co ltd Beijing Branch, Sichuan New Hope Liuhe Pig Breeding Technology Co ltd, Shandong New Hope Liuhe Group Co Ltd, New Hope Liuhe Co Ltd filed Critical Beijing Xinliu Agriculture And Animal Husbandry Technology Co ltd
Priority to CN202210768898.1A priority Critical patent/CN115019885A/en
Publication of CN115019885A publication Critical patent/CN115019885A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The invention discloses a pig whole genome SNP locus screening method, a device and a storage medium, wherein the method comprises the following steps: extracting a plurality of sample sets from a gene data sample, and performing feature selection on all the sample sets to obtain an SNP subset of a preset locus; screening out a preset number of feature selection methods with the highest stability ranking level as optimal feature selection methods, and screening out a plurality of SNP subsets selected by the optimal feature selection methods as a plurality of machine learning model features; constructing a corresponding breeding value prediction model according to a specific machine learning algorithm; inputting the characteristics of the machine learning model into a breeding value prediction model, and outputting a predicted breeding value; determining a final SNP locus of prediction accuracy; and generating a pig whole genome SNP locus chip according to the final SNP locus. And (3) screening the porcine whole genome SNP locus by utilizing a machine algorithm to manufacture a low-cost porcine whole genome SNP locus chip.

Description

Pig whole genome SNP site screening method, device and storage medium
Technical Field
The invention relates to the field of pig genome selection, in particular to a pig whole genome SNP site screening method, a pig whole genome SNP site screening device and a storage medium.
Background
The whole genome selection is to estimate the breeding value of the genome by using a large number of genetic markers in the genome range, and has the advantages of shortening the epoch interval, accelerating the genetic progress and the like, but the application of the whole genome selection is limited by the high genotype sequencing cost.
The existing whole genome selection method most commonly has whole gene association analysis (GWAS) site screening, and a mixed linear model is adopted to carry out GAWS analysis on a reference population to detect SNP sites which are obviously related to characters, so that the nonlinear relation among the SNP sites cannot be well identified; for the phenotype traits which are not directly determined, GWAS may not be capable of screening out obvious SNP, and conversion is required according to other directly determined phenotype traits; because the quantitative trait is combined action of a plurality of micro-effect genes and major genes, a plurality of SNP sites are controlled, and when the candidate phenotypic trait is the quantitative trait, GWAS cannot fully consider the genetic characteristics of the quantitative trait, only a few SNPs can be screened; GWAS cannot fully consider the interaction effect of genes and the environment, and the stability of SNP screened by different groups is very poor; the principle of GWAS is based on the effect of genetic Linkage Disequilibrium (LD), and population stratification present in the sample population cannot be taken into full account.
Disclosure of Invention
The invention provides a pig whole genome SNP locus screening method, which aims to solve the technical problem of high cost of an SNP locus chip.
In order to solve the technical problems, the embodiment of the invention provides a pig whole genome SNP site screening method, which comprises the following steps:
extracting a plurality of sample sets from a gene data sample, and respectively selecting a plurality of feature selection methods to perform feature selection on all the sample sets to obtain a first SNP subset of a plurality of preset loci;
determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset locus; screening out the first N characteristic selection methods with the stability level sequence from high to low as the optimal characteristic selection method, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection method; wherein N is a positive integer;
respectively constructing a plurality of corresponding breeding value initial models according to a plurality of machine learning algorithms; adjusting parameters of each breeding value initial model according to a nested cross validation method, and determining optimal parameters of each breeding value initial model; constructing each corresponding breeding value prediction model according to the optimal parameters;
inputting the characteristics of each machine learning model into each breeding value prediction model respectively, and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model;
and calculating the average prediction precision value of each machine learning model feature in a plurality of breeding value prediction models according to the machine learning model feature quantity, the predicted breeding value and the real breeding value of each breeding value prediction model, and determining the final SNP site with the highest average prediction precision value.
Compared with the prior art, the optimized scheme selects the SNP subset from the SNP sites through feature selection in machine learning, the SNP site selection method can prevent overfitting of high-dimensional data, when the features have high redundancy, key SNP sites in gene data are selected through different feature selection methods, and the SNP sites with high stability are selected, so that the stability is high, the feature screening quantity is free, and the stability and the accuracy of the SNP sites are improved.
The method also utilizes a machine learning algorithm to construct a breeding value prediction model, the machine learning algorithm comprises diversified technologies and non-parameter models and is used for predicting and identifying large-scale data sets, and data distribution and models do not need to be assumed in advance in the algorithms, so that the linear and non-linear relations of the SNP sites can be well identified, and the accuracy of SNP site selection is improved.
Preferably, the method further comprises the following steps: gradually performing quality control, variance filtration and high collinearity operation on the gene data, deleting partial SNP loci and partial gene data, and obtaining screened SNP loci and gene data samples; wherein the screened SNP locus is the SNP locus on the gene sequence in the gene data sample.
Compared with the prior art, the optimized scheme has the advantages that before the SNP sites are screened through feature selection in machine learning, quality control, variance filtering and high collinearity operation are carried out on gene data, part of SNP sites and part of gene data which cause large errors to a sample are deleted according to the distribution condition of the gene data, errors caused by extreme distribution of the gene data are reduced, adverse effects on screening of the SNP sites through subsequent feature selection are reduced, and the accuracy of SNP site screening work is improved.
As a preferred scheme, a plurality of sample sets are extracted from a gene data sample, a plurality of feature selection methods are respectively selected to perform feature selection on all the sample sets, and a first SNP subset of a plurality of preset loci is obtained, specifically:
randomly extracting a plurality of samples from the gene data samples to serve as a sample set, and repeatedly and randomly extracting the samples for a plurality of times in a replacement mode to obtain a plurality of sample sets; and respectively carrying out feature selection on all sample sets by using a first filtering feature selection method, a second filtering feature selection method, a first embedded feature selection method, a second embedded feature selection method and a random feature selection method, so that each feature selection method selects a plurality of first SNP subsets of preset sites in each sample set.
Compared with the prior art, the optimal selection scheme randomly extracts a plurality of sample sets from the gene data, respectively performs characteristic selection on the plurality of sample sets by using different characteristic selection methods, and screens out key SNP sites from large-quantity SNP sites, so that the dimensionality of the SNP sites is greatly reduced, and the accuracy of screening the SNP sites is improved.
As a preferred scheme, the adjusting the parameters of each breeding value initial model according to the nested cross-validation method to determine the optimal parameters of each breeding value initial model specifically comprises:
dividing the data set of each breeding value initial model into a plurality of first training sets and a plurality of corresponding test sets, and dividing each first training set into a plurality of second training sets and a plurality of corresponding verification sets;
setting preset parameters of each breeding value initial model, and calculating the accurate value of each corresponding breeding value initial model by using the data set; and determining the optimal parameters of each breeding value initial model according to the accurate values of the breeding value initial models of different preset parameters.
Compared with the prior art, in the process of constructing the breeding value prediction model by the optimal scheme, the model is trained according to the model training set, and after the model is tested on the test set, the error on the test set is used for approximating the generalization error of the model in the real scene. And (4) for the interior of each model, the evaluation of the model and the continuous adjustment of the hyper-parameters are required until the optimal parameters of the model are selected, and the optimal model is trained. The error of the model for the data set is reduced, the model is suitable for the model data set, and the model prediction accuracy is improved.
Preferably, the step of inputting the characteristics of each machine learning model into each breeding value prediction model and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model includes:
calculating gene data, pedigree data and phenotypic value data to obtain breeding values of phenotypic characters, and using the breeding values of the phenotypic characters as prediction targets of a breeding value prediction model;
inputting the machine learning model features into the breeding value prediction model so that the breeding value prediction model outputs corresponding predicted breeding values according to the predicted targets.
Compared with the prior art, the optimal selection scheme combines gene data, pedigree data and phenotype data, calculates the breeding value of the phenotype character as a prediction target, screens out the key SNP locus strongly related to the phenotype shape through a breeding value prediction model, eliminates the influence caused by environmental effect, group layering and direct determination, and improves the accuracy and universality of screening the SNP locus.
Correspondingly, the application also provides a manufacturing method of the pig whole genome SNP locus chip, which comprises the following steps:
selecting key SNP sites to design probes for manufacturing SNP site chips;
the screening method of the key SNP locus uses the screening method of the porcine whole genome SNP locus.
Compared with the prior art, the optimized scheme selects the key SNP locus to manufacture the SNP chip, reduces the cost, can be applied to different pig groups, has close prediction effect and strong stability.
Correspondingly, this application has still provided a pig whole genome SNP locus sieving mechanism, includes: the system comprises a feature selection module and a model construction module;
the characteristic selection module is used for extracting a plurality of sample sets from a gene data sample, and respectively selecting a plurality of characteristic selection methods to perform characteristic selection on all the sample sets to obtain a first SNP subset of a plurality of preset sites;
determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset locus; screening out the first N characteristic selection methods with the stability level sequence from high to low as optimal characteristic selection methods, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection methods; wherein N is a positive integer;
the model construction module is used for respectively constructing a plurality of corresponding breeding value initial models according to a plurality of machine learning algorithms; adjusting parameters of each breeding value initial model according to a nested cross validation method, and determining optimal parameters of each breeding value initial model; constructing each corresponding breeding value prediction model according to the optimal parameters;
the locus determining module is used for respectively inputting the characteristics of each machine learning model into each breeding value prediction model and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model;
and calculating the average prediction precision value of each machine learning model feature in a plurality of breeding value prediction models according to the machine learning model feature quantity, the predicted breeding value and the real breeding value of each breeding value prediction model, and determining the final SNP site with the highest average prediction precision value.
Compared with the prior art, the feature selection module in the pig whole genome SNP site screening device in the preferred scheme selects SNP subsets from SNP sites through feature selection in machine learning, the SNP site selection method can prevent high-dimensional data from being over-fitted, when features have high redundancy, key SNP sites in gene data are selected through different feature selection methods, and SNP sites with high stability are selected, so that the stability is high, the feature screening quantity is free, and the stability and the accuracy of the SNP sites are improved.
The model construction module and the site determination module utilize a machine learning algorithm to construct a breeding value prediction model, and key SNP sites are determined through calculation of the model. The machine learning algorithm comprises diversified technologies and non-parametric models and is used for predicting and identifying large data sets, and data distribution and models do not need to be assumed in advance in the algorithms, so that the linear and non-linear relations of the SNP sites can be well identified, and the accuracy of SNP site selection is improved.
Preferably, the feature selection module includes: a site deletion unit and a site selection unit;
the site deletion unit is used for gradually carrying out quality control, variance filtration and high collinearity operation on the gene data, deleting partial SNP sites and partial gene data, and obtaining screened SNP sites and gene data samples; the screened SNP locus is an SNP locus on a gene sequence in the gene data sample;
randomly extracting a plurality of samples from the gene data samples to serve as a sample set, and repeatedly and randomly extracting the samples for a plurality of times in a replacement mode to obtain a plurality of sample sets; respectively carrying out feature selection on all sample sets by using a first filtering feature selection method, a second filtering feature selection method, a first embedded feature selection method, a second embedded feature selection method and a random feature selection method so that each feature selection method selects a plurality of first SNP subsets of preset sites in each sample set;
the site selection unit is used for determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset site; screening out the first N characteristic selection methods with the stability level sequence from high to low as the optimal characteristic selection method, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection method; wherein N is a positive integer.
Compared with the prior art, the site deletion unit of the feature selection module of the pig whole genome SNP site screening device in the preferred scheme also performs quality control, variance filtering and high collinearity operation on gene data before screening the SNP sites through feature selection in machine learning, deletes part of SNP sites and part of gene data which cause large errors on samples according to the distribution condition of the gene data, reduces errors caused by extreme distribution of the gene data, reduces adverse effects on subsequent feature selection screening of the SNP sites, and improves the accuracy of SNP site screening work.
The site selection unit respectively uses different feature selection methods to perform feature selection on a plurality of sample sets, and screens key SNP sites from SNP sites with huge number, thereby greatly reducing the dimensionality of the SNP sites and improving the accuracy of screening the SNP sites.
Preferably, the step of inputting the characteristics of each machine learning model into each breeding value prediction model and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model includes:
calculating gene data, pedigree data and phenotypic value data to obtain breeding values of phenotypic characters, and using the breeding values of the phenotypic characters as prediction targets of a breeding value prediction model;
inputting the characteristics of the machine learning model into the breeding value prediction model so that the breeding value prediction model outputs a corresponding predicted breeding value according to the predicted target;
compared with the prior art, the site determination module of the pig whole genome SNP site screening device in the preferred scheme calculates the breeding value of the phenotypic character as a prediction target by processing the gene data, the pedigree data and the phenotypic data, and outputs a prediction value according to the prediction target through a machine learning algorithm. The data processing means can eliminate the influence of phenotypic character environmental effect, population layering and direct determination, is suitable for the breeding value prediction and site screening of any phenotypic character, improves universality and improves the site screening accuracy.
Accordingly, the present application also proposes a computer-readable storage medium comprising a stored computer program; wherein the computer program controls the device on which the computer readable storage medium is located to execute the pig whole genome SNP site screening method according to any one of the embodiments.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the method for screening SNP sites of a pig whole genome provided by the invention;
FIG. 2 is a schematic flow chart of an embodiment of a method for manufacturing a pig genome-wide SNP site chip according to the invention;
FIG. 3 is a schematic structural diagram of an embodiment of the pig whole genome SNP site screening apparatus provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1, a method for screening a pig genome-wide SNP site provided by the embodiment of the present invention includes the following steps S101 to S105:
step S101: extracting a plurality of sample sets from a gene data sample, and respectively selecting a plurality of feature selection methods to perform feature selection on all the sample sets to obtain a first SNP subset of a plurality of preset loci;
in this embodiment, M samples are randomly extracted from the gene data samples as sample sets, and the random extraction is repeated X times with replacement to obtain a plurality of sample sets; feature selection is performed on all sample sets by using an f _ regression filtering feature selection method, a mutual _ info _ regression filtering feature selection method, a Lasso embedded feature selection method, an FR embedded feature selection method and a random feature selection method, so that each feature selection method selects a plurality of first SNP subsets with sites 250, 500, 750, 1000, 1250 and 1500 in each sample set. Wherein M and X are positive integers.
In this embodiment, the method further includes: gradually performing quality control, variance filtration and high collinearity operation on the gene data, deleting partial SNP loci and partial gene data, and obtaining screened SNP loci and gene data samples; wherein the screened SNP locus is the SNP locus on the gene sequence in the gene data sample.
Step S102: determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset locus; screening out the first N characteristic selection methods with the stability level sequence from high to low as the optimal characteristic selection method, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection method; wherein N is a positive integer;
in this embodiment, according to the plurality of first SNP subsets selected in each sample set by each feature selection method, the hamming distances of the plurality of first SNP subsets in each sample set are calculated;
the formula for the hamming distance is:
Figure BDA0003726654270000081
where h represents the hamming distance, N represents the number of sample sets, S represents the first SNP subset, and k represents the number of sites of the SNP subset. The value range of h is 0 to 1, and the closer to 0, the more stable the characteristic selection method is.
Step S103: respectively constructing a plurality of corresponding breeding value initial models according to a plurality of machine learning algorithms; adjusting parameters of each breeding value initial model according to a nested cross validation method, and determining optimal parameters of each breeding value initial model; constructing each corresponding breeding value prediction model according to the optimal parameters;
in this embodiment, the quality control, variance filtering and high collinearity operation are performed step by step on the gene data, specifically:
setting a call rate threshold value of 0.9, a maf threshold value of 0.05 and a hwep threshold value of 10-6 for quality control, and deleting partial SNP sites and partial gene data; setting a variance filtering parameter as 95/5, and carrying out variance filtering to delete partial SNP sites; and setting the high collinearity parameter cut _ off to 0.8, and deleting the high collinearity SNP locus. Quality control, variance filtering and high collinearity operation are prior art and will not be described further herein.
In this embodiment, a plurality of corresponding initial models of breeding values are respectively constructed according to a plurality of machine learning algorithms; the machine learning algorithm includes, but is not limited to, a ridge regression algorithm, a support vector machine algorithm, a gradient boosting tree algorithm, and a limit gradient boosting algorithm.
In this embodiment, the adjusting the parameters of each breeding value initial model according to the nested cross-validation method to determine the optimal parameters of each breeding value initial model specifically includes:
dividing the data set of each breeding value initial model into a plurality of first training sets and a plurality of corresponding test sets, and dividing each first training set into a plurality of second training sets and a plurality of corresponding verification sets; and realizing double circulation, wherein a plurality of first training sets are outer circulation and provide data for inner circulation, a plurality of second training sets are inner circulation, and the optimal parameters of each algorithm are determined by circulating a plurality of second training sets for verification.
Setting preset parameters of each breeding value initial model, and calculating the accurate value of each corresponding breeding value initial model by using the data set; and determining the optimal parameters of each breeding value initial model according to the accurate values of the breeding value initial models of different preset parameters.
Step S104: inputting the characteristics of each machine learning model into each breeding value prediction model respectively, and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model;
step S105: and calculating the average prediction precision value of each machine learning model feature in a plurality of breeding value prediction models according to the machine learning model feature quantity, the predicted breeding value and the real breeding value of each breeding value prediction model, and determining the final SNP site with the highest average prediction precision value.
In this embodiment, the method for calculating the average precision prediction value of each machine learning model feature in a plurality of breeding value prediction models according to the number of machine learning model features, the predicted breeding value and the true breeding value of each breeding value prediction model specifically comprises:
calculating the determination coefficient R 2 And root mean square error, RMSE;
the formula for determining the coefficient R2 and the root mean square error RMSE is;
Figure BDA0003726654270000101
wherein m represents the number of features of the machine learning model,
Figure BDA0003726654270000102
represents the predicted breeding value, y represents the average value of the breeding values, and y represents the true breeding value.
R 2 The value range is 0 to 1, the closer to 1, the higher the prediction accuracy is, the value range of RMSE is 0 to infinity, and the closer to 0, the higher the prediction accuracy is.
The embodiment of the invention has the following effects:
according to the method, the high-dimensional data can be prevented from being over-fitted through feature selection in machine learning, namely, SNP subsets are selected from SNP sites, the SNP site selection method can prevent the high-dimensional data from being over-fitted, when the features have high redundancy, key SNP sites in the gene data are selected through different feature selection methods, and the SNP sites with high stability are selected, so that the stability is high, the feature screening quantity is free, and the stability and the accuracy of the SNP sites are improved. A plurality of sample sets are randomly extracted from gene data, different feature selection methods are respectively used for feature selection of the sample sets, and the key SNP sites are screened out from the SNP sites with huge number, so that the dimensionality of the SNP sites is greatly reduced, and the accuracy of screening the SNP sites is improved.
The method also utilizes a machine learning algorithm to construct a breeding value prediction model, the machine learning algorithm comprises diversified technologies and non-parameter models and is used for predicting and identifying large-scale data sets, and data distribution and models do not need to be assumed in advance in the algorithms, so that the linear and non-linear relations of the SNP sites can be well identified, and the accuracy of SNP site selection is improved. According to the method, genetic data, pedigree data and phenotype data are combined, breeding values of phenotype characters are calculated to serve as prediction targets, key SNP sites strongly related to phenotype shapes are screened out through a breeding value prediction model, influences caused by environmental effects, group layering and direct determination are eliminated, and the accuracy and universality of screening the SNP sites are improved.
Before the SNP sites are screened through feature selection in machine learning, quality control, variance filtering and high collinearity operation are carried out on gene data, and partial SNP sites and partial gene data causing large errors to a sample are deleted according to the distribution condition of the gene data, so that errors caused by extreme distribution of the gene data are reduced, adverse effects on screening of the SNP sites through subsequent feature selection are reduced, and the accuracy of SNP site screening work is improved.
Example two
As shown in FIG. 2, the invention also provides a manufacturing method of the pig whole genome SNP locus chip, which comprises the following steps:
step S201: selecting key SNP sites to design probes.
Step S202: and manufacturing an SNP site chip according to the probe.
The screening method of the key SNP locus uses the screening method of the porcine whole genome SNP locus as described in the example I.
The embodiment of the invention has the following effects:
the method selects the key SNP locus to manufacture the SNP chip, reduces the cost, can be applied to different pig groups, has approximate prediction effect and strong stability.
EXAMPLE III
As shown in fig. 3, the present invention also provides a pig whole genome SNP site screening device, including: a feature selection module 301, a model construction module 302, and a site determination module 303.
The characteristic selection module 301 is configured to extract a plurality of sample sets from a gene data sample, and select a plurality of characteristic selection methods to perform characteristic selection on all the sample sets respectively to obtain a first SNP subset of a plurality of preset sites; determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset locus; screening out the first N characteristic selection methods with the stability level sequence from high to low as the optimal characteristic selection method, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection method; wherein N is a positive integer;
the model construction module 302 is used for respectively constructing a plurality of corresponding breeding value initial models according to a plurality of machine learning algorithms; adjusting parameters of each breeding value initial model according to a nested cross validation method, and determining optimal parameters of each breeding value initial model; constructing each corresponding breeding value prediction model according to the optimal parameters;
the locus determining module 303 is configured to input each of the machine learning model features into each of the breeding value prediction models, and output a plurality of predicted breeding values corresponding to each of the breeding value prediction models; and calculating the average prediction precision value of each machine learning model feature in a plurality of breeding value prediction models according to the machine learning model feature quantity, the predicted breeding value and the real breeding value of each breeding value prediction model, and determining the final SNP site with the highest average prediction precision value.
Wherein, the characteristic selection module includes: a site deletion unit and a site selection unit;
the site deleting unit is used for gradually carrying out quality control, variance filtering and high collinearity operation on the gene data, deleting partial SNP sites and partial gene data, and obtaining screened SNP sites and gene data samples; the screened SNP locus is an SNP locus on a gene sequence in the gene data sample;
randomly extracting a plurality of samples from the gene data samples to serve as a sample set, and repeatedly and randomly extracting the samples for a plurality of times in a replacement mode to obtain a plurality of sample sets; respectively carrying out feature selection on all sample sets by using a first filtering feature selection method, a second filtering feature selection method, a first embedded feature selection method, a second embedded feature selection method and a random feature selection method so as to obtain SNP subsets corresponding to five preset sites in each sample set;
the site selection unit is used for determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset site; screening out the first N characteristic selection methods with the stability level sequence from high to low as the optimal characteristic selection method, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection method; wherein N is a positive integer.
The locus determining module is also used for calculating gene data, pedigree data and phenotypic value data to obtain a breeding value of the phenotypic character, and the breeding value of the phenotypic character is used as a prediction target of a breeding value prediction model; inputting the machine learning model features into the breeding value prediction model so that the breeding value prediction model outputs corresponding predicted breeding values according to the predicted targets.
The pig whole genome SNP site screening device can implement the pig whole genome SNP site screening method of the method embodiment. The alternatives in the above-described method embodiments are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the contents of the above method embodiments, and in this embodiment, details are not described again.
The embodiment of the invention has the following effects:
the feature selection module in the pig whole genome SNP site screening device selects SNP subsets from SNP sites through feature selection in machine learning, the SNP site selection method can prevent high-dimensional data from being over-fitted, when features have high redundancy, key SNP sites in gene data are selected through different feature selection methods, and SNP sites with high stability are selected, so that the stability is high, the feature screening quantity is free, and the stability and the accuracy of the SNP sites are improved.
The model construction module and the site determination module construct a breeding value prediction model by utilizing a machine learning algorithm, and determine key SNP sites through calculation of the model. The machine learning algorithm comprises diversified technologies and non-parametric models and is used for predicting and identifying large data sets, and data distribution and models do not need to be assumed in advance in the algorithms, so that the linear and non-linear relations of the SNP sites can be well identified, and the accuracy of SNP site selection is improved.
The site deletion unit of the feature selection module of the pig whole genome SNP site screening device further performs quality control, variance filtering and high collinearity operation on gene data before screening the SNP sites through feature selection in machine learning, partial SNP sites and partial gene data causing large errors to samples are deleted according to the distribution situation of the gene data, errors caused by extreme distribution of the gene data are reduced, adverse effects on subsequent feature selection screening of the SNP sites are reduced, and the accuracy of SNP site screening work is improved.
The site selection unit respectively uses different feature selection methods to perform feature selection on a plurality of sample sets, and screens key SNP sites from SNP sites with huge number, thereby greatly reducing the dimensionality of the SNP sites and improving the accuracy of screening the SNP sites.
Example four
Accordingly, the present invention also provides a computer readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus on which the computer readable storage medium is located is controlled to execute the pig genome-wide SNP site screening method according to any one of the above embodiments.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is the control center of the terminal device and connects the various parts of the whole terminal device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the terminal device by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the terminal device integrated module/unit can be stored in a computer readable storage medium if it is implemented in the form of software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the invention, may occur to those skilled in the art and are intended to be included within the scope of the invention.

Claims (10)

1. A pig whole genome SNP site screening method is characterized by comprising the following steps:
extracting a plurality of sample sets from a gene data sample, and respectively selecting a plurality of feature selection methods to perform feature selection on all the sample sets to obtain a first SNP subset of a plurality of preset loci;
determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset locus; screening out the first N characteristic selection methods with the stability level sequence from high to low as optimal characteristic selection methods, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection methods; wherein N is a positive integer;
respectively constructing a plurality of corresponding breeding value initial models according to a plurality of machine learning algorithms; adjusting parameters of each breeding value initial model according to a nested cross validation method, and determining optimal parameters of each breeding value initial model; constructing each corresponding breeding value prediction model according to the optimal parameters;
inputting the characteristics of each machine learning model into each breeding value prediction model respectively, and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model;
and calculating the average prediction precision value of each machine learning model feature in a plurality of breeding value prediction models according to the machine learning model feature quantity, the predicted breeding value and the real breeding value of each breeding value prediction model, and determining the final SNP site with the highest average prediction precision value.
2. The method for screening the SNP loci of the whole genome of the pig as claimed in claim 1, further comprising:
gradually performing quality control, variance filtering and high collinearity operation on the gene data, deleting partial SNP loci and partial gene data, and obtaining screened SNP loci and gene data samples; and screening the SNP loci on the gene sequences in the gene data sample.
3. The method for screening the SNP loci of the whole genome of a pig as claimed in claim 1, wherein a plurality of sample sets are extracted from the gene data sample, a plurality of feature selection methods are respectively selected to perform feature selection on all the sample sets, and a first SNP subset of a plurality of preset loci is obtained, specifically:
randomly extracting a plurality of samples from the gene data samples to serve as a sample set, and repeatedly and randomly extracting the samples for a plurality of times in a replacement mode to obtain a plurality of sample sets; and respectively carrying out feature selection on all sample sets by using a first filtering feature selection method, a second filtering feature selection method, a first embedded feature selection method, a second embedded feature selection method and a random feature selection method, so that each feature selection method selects a plurality of first SNP subsets of preset sites in each sample set.
4. The method for screening the SNP loci of the whole genome of the pigs according to claim 1, wherein the parameters of each breeding value initial model are adjusted according to a nested cross validation method, and the optimal parameters of each breeding value initial model are determined, specifically:
dividing the data set of each breeding value initial model into a plurality of first training sets and a plurality of corresponding test sets, and dividing each first training set into a plurality of second training sets and a plurality of corresponding verification sets;
setting preset parameters of each breeding value initial model, and calculating the accurate value of each corresponding breeding value initial model by using the data set; and determining the optimal parameters of each breeding value initial model according to the accurate values of the breeding value initial models of different preset parameters.
5. The method for screening the SNP loci of the whole pig genome according to claim 4, wherein the method comprises the steps of inputting the characteristics of each machine learning model into each breeding value prediction model, and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model, specifically:
calculating gene data, pedigree data and phenotypic value data to obtain breeding values of phenotypic characters, and using the breeding values of the phenotypic characters as prediction targets of a breeding value prediction model;
inputting the machine learning model features into the breeding value prediction model so that the breeding value prediction model outputs corresponding predicted breeding values according to the predicted targets.
6. A manufacturing method of a pig whole genome SNP locus chip is characterized by comprising the following steps:
selecting key SNP sites to design probes for manufacturing SNP site chips;
the method for screening the key SNP sites uses the method for screening the porcine whole genome SNP sites as set forth in any one of claims 1 to 5.
7. A pig whole genome SNP locus sieving mechanism which is characterized by comprising: the system comprises a feature selection module, a model construction module and a site determination module;
the characteristic selection module is used for extracting a plurality of sample sets from a gene data sample, and respectively selecting a plurality of characteristic selection methods to perform characteristic selection on all the sample sets to obtain a first SNP subset of a plurality of preset sites;
determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset locus; screening out the first N characteristic selection methods with the stability level sequence from high to low as the optimal characteristic selection method, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection method; wherein N is a positive integer;
the model construction module is used for respectively constructing a plurality of corresponding breeding value initial models according to a plurality of machine learning algorithms; adjusting parameters of each breeding value initial model according to a nested cross validation method, and determining optimal parameters of each breeding value initial model; constructing each corresponding breeding value prediction model according to the optimal parameters;
the locus determining module is used for respectively inputting the characteristics of each machine learning model into each breeding value prediction model and outputting a plurality of predicted breeding values corresponding to each breeding value prediction model;
and calculating the average prediction precision value of each machine learning model characteristic in a plurality of breeding value prediction models according to the machine learning model characteristic quantity, the predicted breeding value and the real breeding value of each breeding value prediction model, and determining the final SNP locus with the highest average prediction precision value.
8. The apparatus of claim 7, wherein the characteristic selection module comprises: a site deletion unit and a site selection unit;
the site deletion unit is used for gradually carrying out quality control, variance filtration and high collinearity operation on the gene data, deleting partial SNP sites and partial gene data, and obtaining screened SNP sites and gene data samples; the screened SNP locus is an SNP locus on a gene sequence in the gene data sample;
randomly extracting a plurality of samples from the gene data samples to serve as a sample set, and repeatedly and randomly extracting the samples for a plurality of times in a replacement mode to obtain a plurality of sample sets; respectively carrying out feature selection on all sample sets by using a first filtering feature selection method, a second filtering feature selection method, a first embedded feature selection method, a second embedded feature selection method and a random feature selection method so that each feature selection method selects a plurality of first SNP subsets of preset sites in each sample set;
the site selection unit is used for determining the stability level of each feature selection method by calculating the Hamming distance of the first SNP subset of each preset site; screening out the first N characteristic selection methods with the stability level sequence from high to low as optimal characteristic selection methods, and selecting a plurality of second SNP subsets from a plurality of sample sets as a plurality of machine learning model characteristics according to the optimal characteristic selection methods; wherein N is a positive integer.
9. The apparatus for screening the SNP loci of the whole pig genome according to claim 7, wherein the characteristics of each machine learning model are inputted into each breeding value prediction model, and a plurality of predicted breeding values corresponding to each breeding value prediction model are outputted, specifically:
calculating gene data, pedigree data and phenotypic value data to obtain breeding values of phenotypic characters, and using the breeding values of the phenotypic characters as prediction targets of a breeding value prediction model;
inputting the machine learning model features into the breeding value prediction model so that the breeding value prediction model outputs corresponding predicted breeding values according to the predicted targets.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program; wherein the computer program controls the device on which the computer readable storage medium is located to execute the pig whole genome SNP site screening method according to any one of claims 1 to 5 when being executed.
CN202210768898.1A 2022-07-01 2022-07-01 Pig whole genome SNP site screening method, device and storage medium Pending CN115019885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210768898.1A CN115019885A (en) 2022-07-01 2022-07-01 Pig whole genome SNP site screening method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210768898.1A CN115019885A (en) 2022-07-01 2022-07-01 Pig whole genome SNP site screening method, device and storage medium

Publications (1)

Publication Number Publication Date
CN115019885A true CN115019885A (en) 2022-09-06

Family

ID=83078174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210768898.1A Pending CN115019885A (en) 2022-07-01 2022-07-01 Pig whole genome SNP site screening method, device and storage medium

Country Status (1)

Country Link
CN (1) CN115019885A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579060A (en) * 2022-12-08 2023-01-06 国家超级计算天津中心 Gene locus detection method, device, equipment and medium
CN116072226A (en) * 2023-01-17 2023-05-05 中国农业大学 Machine learning method and system for selecting laying hen egg-laying character genome

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579060A (en) * 2022-12-08 2023-01-06 国家超级计算天津中心 Gene locus detection method, device, equipment and medium
CN116072226A (en) * 2023-01-17 2023-05-05 中国农业大学 Machine learning method and system for selecting laying hen egg-laying character genome

Similar Documents

Publication Publication Date Title
CN115019885A (en) Pig whole genome SNP site screening method, device and storage medium
CN109657805B (en) Hyper-parameter determination method, device, electronic equipment and computer readable medium
Pavlidis et al. A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans
Yuan et al. IntSIM: an integrated simulator of next-generation sequencing data
Holland et al. Accuracy of ancestral state reconstruction for non-neutral traits
CN115394358B (en) Single-cell sequencing gene expression data interpolation method and system based on deep learning
CN112579462B (en) Test case acquisition method, system, equipment and computer readable storage medium
CN112559374A (en) Test case sequencing method and electronic equipment
CN111028884A (en) Filling method and device for genotype data deletion and server
Li et al. SRHiC: a deep learning model to enhance the resolution of Hi-C data
CN116401555A (en) Method, system and storage medium for constructing double-cell recognition model
Bisschop et al. Sweeps in time: leveraging the joint distribution of branch lengths
CN111582315A (en) Sample data processing method and device and electronic equipment
Heydari et al. Deep learning in spatial transcriptomics: Learning from the next next-generation sequencing
CN112017730B (en) Cell screening method and device based on expression quantity prediction model
CN111666991A (en) Convolutional neural network-based pattern recognition method and device and computer equipment
Sun et al. Two stages biclustering with three populations
CN112259161A (en) Disease risk assessment system, method, device and storage medium
CN113782092B (en) Method and device for generating lifetime prediction model and storage medium
CN115423159A (en) Photovoltaic power generation prediction method and device and terminal equipment
EP4027271A1 (en) Information processing apparatus, information processing method, and information processing program
CN115691664A (en) Conservative calculation method of plant phosphate loci
CN113868939A (en) Wind power probability density evaluation method, device, equipment and medium
Ramachandran et al. Deep learning for better variant calling for cancer diagnosis and treatment
CN117632770B (en) Multipath coverage test case generation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination