CN117275575A - Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification - Google Patents

Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification Download PDF

Info

Publication number
CN117275575A
CN117275575A CN202311444798.4A CN202311444798A CN117275575A CN 117275575 A CN117275575 A CN 117275575A CN 202311444798 A CN202311444798 A CN 202311444798A CN 117275575 A CN117275575 A CN 117275575A
Authority
CN
China
Prior art keywords
pig
variety
snp
model
pigs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311444798.4A
Other languages
Chinese (zh)
Inventor
叶伟健
张嘉楠
周靖航
黄菁菁
杜永旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shijiazhuang Breeding Biotechnology Co ltd
Original Assignee
Shijiazhuang Breeding Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shijiazhuang Breeding Biotechnology Co ltd filed Critical Shijiazhuang Breeding Biotechnology Co ltd
Priority to CN202311444798.4A priority Critical patent/CN117275575A/en
Publication of CN117275575A publication Critical patent/CN117275575A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a liquid-phase chip-versus-SNP-based method for deeply learning and distinguishing pig breeds, and particularly relates to the field of intelligent distinguishing of SNP data. Through modeling by using a machine learning algorithm, the optimal number of SNP locus combinations are optimally selected, a 1K liquid-phase SNP chip is designed and formed, the identification accuracy of pure pigs and hybrid pigs is improved, the identification accuracy of long white pigs, large white pigs and hybrid varieties thereof is particularly ensured, the variety identification cost can be remarkably reduced, and the sales behaviors of counterfeit pigs are effectively prevented and beaten.

Description

Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification
Technical Field
The invention relates to the field of intelligent discrimination of SNP data, in particular to a liquid-phase chip-pair SNP-based method for deeply learning and discriminating pig variety identification.
Background
The variety identification and the blood system identification of pigs are necessary links in breeding work and live pig trade, and are also the basic requirements of industrialized application. The market value of different pig breeds has larger difference, and the market also has the individual phenomenon of using the binary sow to impersonate the stock pig for sale. Some binary pigs have very similar body type and appearance characteristics to those of pure pigs, so that farmers with insufficient experience are easily confused. For example, the binary offspring of long white pigs and large white pigs have high similarity to the appearance characteristics of their parents and are relatively difficult to distinguish. In addition to the counterfeit breeding pig cases on the market, false breeding plans are made and executed occasionally due to misjudgment of varieties in actual breeding, expected breeding plans are affected, and further population genetic progress is delayed.
Existing variety identification methods include pedigree identification, appearance identification or DNA molecular marker identification methods, and all of the methods have some disadvantages and shortcomings. For example, the pedigree recording requires manual recording, and in practice, there are often cases where the recording is incomplete or inaccurate. During the trade of the breeding pigs, the seller can not provide the genealogy record information or the faked genealogy, and under the conditions, the variety of the pigs can not be accurately judged through the genealogy file; the appearance identification mode has relatively high requirements on the practical experience and professional quality of the discriminator, and has great influence on subjective evaluation. Especially when the pig breeds with similar appearance characteristics are encountered, accurate judgment is difficult to be made if the practical experience is insufficient; in addition, there are also related academic research literature utilizing genomic information to determine the classification of species by identifying and aligning DNA molecular markers in pigs. However, in such researches, the applied marker sites are fewer, the variety identification accuracy is lower, the blood system proportion of the binary pigs cannot be effectively distinguished, and the chip development and detection cost is higher and cannot reach the commercial application standard.
Disclosure of Invention
The invention aims at identifying breeder pigs, and mainly distinguishing big white pigs, long white pigs, duroc, binary pigs and ternary pigs. In order to solve the problems of low accuracy, high requirements on practical experience and professional level of a discriminator, high production cost of other molecular identification technologies and the like in practical production for identifying pure pigs and binary pigs.
The method for deeply learning and distinguishing the pig variety identification based on the SNP by the liquid phase chip comprises the following steps:
s1, collecting the characteristic information and the genotype detection result of the breeds of the pure pigs and the filial generations;
s2, genotype data filtering quality control;
s3, building a deep learning model;
s4, debugging and optimizing model parameters, and determining an optimal SNP locus set for variety identification;
s5, verifying test group results.
Further, the specific steps of the step S1 are as follows: and collecting a plurality of individuals with variety information and genotype data as training groups of a variety identification model, wherein the variety information at least comprises big white pigs, long white pigs, duroc, long binary pigs and Dulong ternary pigs.
Further, in the step S2, the filtering step specifically includes SNP site detection rate filtering and sample detection rate filtering;
wherein, the SNP locus detection rate filtering comprises the following steps:
s201, deleting a site with the minimum allele frequency less than 0.05;
s202, deleting the sites with the Hash temperature balance threshold less than 0.0001;
s203, deleting linkage disequilibrium R 2 Sites less than 0.5;
the sample detection rate filtering specifically comprises the following steps: deleting samples with SNP locus detection rate less than 0.9.
Further, the step S3 specifically includes the following substeps:
s301, genotype data encoding, namely encoding genotypes consisting of ATCG into genotypes of 0, 1 or 2;
s302, vectorizing phenotype label features, namely replacing variety information or phenotypes reflecting the variety features with numerical values as label values, and vectorizing;
s303, converting the genotype characteristic value into a digital code by using a target code TargetEncoder, and replacing the classification characteristic value by using a label average value of the characteristic value;
s304, preliminarily constructing a classification model by using an XGBoost method in a decision tree algorithm and the coded tag value and feature value, and listing the features of all the sites, and sorting according to the importance degree of the features;
s305, selecting the front set points as a set of points of the training model.
Further, in the step S4, the optimizing step specifically includes the following substeps:
s401, constructing a training model by using a DataLoader, and splitting a test group into a training set and a verification set according to the proportion of 8:2;
s402, establishing a deep learning network layer;
s403, using Adam as an optimizer tuning parameter to minimize a model loss function, wherein the loss function is evaluated by using cross entropy;
s404, setting a model Epoch as 100, acquiring a loss value of a training set and a loss value of a verification set of each Epoch, and optimizing the model performance through the loss values.
Further, the step S5 specifically includes: and (3) taking genotype data of each pig variety as a reference comparison database, and estimating the blood system proportion of the individuals to be tested.
Further, the specific steps of estimating the blood system proportion of the individual to be tested are as follows: if the blood lineage ratio of the sample in a certain variety is more than or equal to 75%, judging the sample as the pure-bred pig of the variety; otherwise, further judging whether the blood system proportion of the sample in a certain variety is less than 20%, if yes, judging the sample as a binary pig, otherwise, judging the sample as a ternary pig.
The beneficial effects of the invention are as follows:
the invention provides a machine learning algorithm method for an optimal SNP combination for reducing dimension and optimizing variety identification, which comprises a set of SNP locus combination for pure breed and hybrid pig variety and blood lineage identification. Through modeling by using a machine learning algorithm, the optimal number of SNP locus combinations are optimally selected, a 1K liquid-phase SNP chip is designed and formed, the identification accuracy of pure pigs and hybrid pigs is improved, the identification accuracy of long white pigs, large white pigs and hybrid varieties thereof is particularly ensured, the variety identification cost can be remarkably reduced, and the sales behaviors of counterfeit pigs are effectively prevented and beaten.
Drawings
FIG. 1 is a schematic flow chart of a method for deeply learning and distinguishing pig breeds based on SNP identification by a liquid phase chip;
FIG. 2 is a data schematic diagram of the importance degree of features of a method for determining the quality of a pig breed based on SNP by a liquid phase chip according to the embodiment of the invention;
FIG. 3 is a network structure diagram of a deep learning network layer of a method for discriminating the pig breeds based on SNP identification by a liquid phase chip according to the embodiment of the invention;
fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a product for implementing a method for deeply learning and discriminating a pig breed based on liquid phase chip pair SNP according to an embodiment of the invention.
Detailed Description
The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description.
For the purpose of making the technical solution and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention. It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The features and capabilities of the present invention are described in further detail below in connection with the examples.
As shown in FIG. 1, the method for deeply learning and distinguishing the pig breeds based on the liquid phase chip-to-SNP comprises the following steps:
s1, collecting the characteristic information and the genotype detection result of the breeds of the pure pigs and the filial generations;
s2, genotype data filtering quality control;
s3, building a deep learning model;
s4, debugging and optimizing model parameters, and determining an optimal SNP locus set for variety identification;
s5, verifying test group results.
Further, the specific steps of the step S1 are as follows: and collecting a plurality of individuals with variety information and genotype data as training groups of a variety identification model, wherein the variety information at least comprises big white pigs, long white pigs, duroc, long binary pigs and Dulong ternary pigs.
Further, in the step S2, the filtering step specifically includes SNP site detection rate filtering and sample detection rate filtering;
wherein, the SNP locus detection rate filtering comprises the following steps:
s201, deleting a site with the minimum allele frequency less than 0.05; in this process, there are two or more different alleles (alleles) at a given SNP site; the minimum allele frequency refers to the frequency of a particular allele in all samples under study. In particular, loci with a minimum allele frequency of less than 0.05 will be deleted for removing SNPs that are very rare in the sample, as this data may not have sufficient information value in the analysis, and may also be the result of measurement error introduction.
S202, deleting the sites with the Hash temperature balance threshold less than 0.0001; in this process, hardy-Weinberg equilibrium, HWE is a genetic concept describing the state of equilibrium of allele frequencies in a steady-state population. In a HWE, the allele frequency does not change. Specifically, when the Hash temperature balance threshold (typically calculated using a statistical test) for a locus is less than 0.0001, it is suggested that the allele frequency for that locus is inconsistent with HWE, possibly due to selection bias, crowd structure, or other factors.
S203, deleting linkage disequilibrium R 2 Sites less than 0.5; linkage disequilibrium (Linkage Disequilibrium, LD) refers to the genotype association between two or more loci, i.e., whether or not they are transmitted together to offspring. R represents a statistic that measures the LD intensity between two sites, ranging from 0 to 1,0 indicating no LD, and 1 indicating complete LD. Specifically, sites with R < 0.5 will be deleted for removing highly correlated sites to ensure that the sites under analysis are relatively independent from one another, thereby avoiding redundant information.
The sample detection rate filtering specifically comprises the following steps: deleting samples with SNP locus detection rate less than 0.9.
Further, the step S3 specifically includes the following substeps:
s301, genotype data encoding, namely encoding genotypes consisting of ATCG into genotypes of 0, 1 or 2; in this process, genotype data is typically represented by letters such as ATCG (adenine, cytosine, guanine), which represent the different genotypes. In particular, genotype data is encoded into values, typically with different genotypes represented by 0, 1 or 2. Illustratively, 0 may be used to represent AA type, 1 AA type, and 2 AA type. This encoding allows genotype data to be processed by the mathematical model.
S302, vectorizing phenotype label features, namely replacing variety information or phenotypes reflecting the variety features with numerical values as label values, and vectorizing; in this process, the phenotypic data includes breed information or other data reflecting the characteristics of the breed. In machine learning, classification algorithms typically require numerical input, so in this step, the phenotype data is vectorized, representing the variety information or other features with numerical values. Illustratively, the classification features are converted into binary vector form by One-Hot Encoding (One-Hot Encoding) method.
S303, converting the genotype characteristic value into a digital code by using a target code TargetEncoder, and replacing the classification characteristic value by using a label average value of the characteristic value;
s304, preliminarily constructing a classification model by using an XGBoost method in a decision tree algorithm and the coded tag value and feature value, and listing the features of all the sites, and sorting according to the importance degree of the features;
s305, initially constructing a classification model by a decision tree algorithm XGBoost method, screening out characteristics of a target set, and sorting according to importance degrees of the characteristics, as shown in FIG. 2; XGBoost is a gradient-lifting tree algorithm that is commonly used for classification and regression problems. In this step, a preliminary classification model is constructed using the XGBoost algorithm for preliminary feature selection and feature importance ranking;
s306, selecting the front set points as a set of points of the training model. In this process, in genomics and genetics research, a part of sites is usually selected from a large number of sites as a site set of a training model. This selection is based on the feature importance ranking using the XGBoost model in S305, selecting the top-ranked sites as the final set of sites for constructing the final classification model.
Further, in the step S4, the optimizing step specifically includes the following substeps:
s401, constructing a training model by using a DataLoader, and splitting a test group into a training set and a verification set according to the proportion of 8:2;
s402, establishing a deep learning network layer by using a torch, as shown in FIG. 3;
s403, using Adam as an optimizer, and cross entropy cross EntropyLoss as a loss function; specifically, in this process, the cross entropy is used to measure the performance of the model. Specifically, for a multi-category classification problem, assuming that there are C categories (C is the number of categories), the formula of the cross entropy loss function is as follows:
wherein,the probability distribution representing the true class label is usually represented as a one-hot coded vector, wherein only one element is 1, representing the true class of the sample, and the other elements are all 0; said->A predicted class probability distribution representing the model, typically expressed as a vector, wherein each element represents the model's predicted probability for each class; said->An index representing a category; said->Representing categoriesIs a sum of (3). The cross entropy loss is calculated by distributing the predictive probability of the model +.>Probability distribution with real class labels->The comparison is then done and then penalizing is done on the probability differences for each category. If the prediction of the model is completely consistent with the real label, the cross entropy loss is 0; if there is no agreement, the loss value will be greater than 0, indicating that the greater the prediction error of the model, the higher the loss value. In deep learning, an optimization algorithm (e.g., adam) is generally used to minimize cross entropy loss, so as to adjust parameters of the model, and make predictions of the model gradually approach to real class label probability distribution, thereby improving classification performance of the model.
S404, setting a model Epoch as 100, acquiring a loss value of a training set and a loss value of a verification set of each Epoch, and optimizing the model performance through the loss values. Specifically, in this process, the neural network uses all samples in the dataset for training within one Epoch, and updates the network parameters by calculating gradients. Doing so may help the network gradually learn patterns and features in the dataset, thereby improving the performance of the network. Through multiple epochs, the network has the opportunity to see the whole data set for multiple times, so that the data can be better adapted, and the training error is reduced. After each Epoch, the parameters of the network are fine-tuned according to the training data to improve the performance of the model. And multiple epochs can help the network generalize better to new data. Overfitting is a problem in that a model performs well on training data, but performs poorly on test data. The fitting degree of the model on training data can be controlled through multiple epochs, and the generalization capability of the model is improved. Training the neural network is an iterative process, and multiple epochs can help the network gradually converge to an optimal solution, i.e., where the loss function reaches a minimum, thereby improving the performance of the model. Through the training of the Epoch for many times, the accuracy and the performance of the model can be gradually improved, and as the Epoch increases, the loss value of the training set and the loss value of the verification set are continuously reduced until the Epoch times are increased, preferably, the loss value is reduced to the lowest value or the reduction amplitude of the loss value approaches 0.
The loss values of the training set and the verification set in each Epoch can help monitor the training progress and performance of the model; it is generally desirable that the loss values on the training set decrease gradually within each Epoch, while the loss values on the validation set also decrease to some extent, but not so fast as to avoid overfitting. For example, a case is proposed in which the loss value varies with Epoch:
the loss of the Epoch 1-training set is 0.5, and the loss of the verification set is 0.6;
epoch 2-training set loss 0.4, validation set loss 0.55.
Epoch 50-training set loss 0.1, validation set loss 0.15.
Epoch 100-training set loss 0.05, validation set loss 0.12.
In the above example, as the Epoch increases, the training set loss and the validation set loss gradually decrease. This is an ideal case, indicating that the model is gradually learning the pattern of data, and that the generalization performance over the validation set is superior.
Further, the step S5 specifically includes: and (3) taking genotype data of each pig variety as a reference comparison database, and estimating the blood system proportion of the individuals to be tested.
Further, the specific steps of estimating the blood system proportion of the individual to be tested are as follows: if the blood lineage ratio of the sample in a certain variety is more than or equal to 75%, judging the sample as the pure-bred pig of the variety; otherwise, further judging whether the blood system proportion of the sample in a certain variety is less than 20%, if yes, judging the sample as a binary pig, otherwise, judging the sample as a ternary pig.
In the evaluation result, each sample has the genome blood system proportion of Duroc, changbai pigs and Dabai pigs, and if the blood system proportion of the sample in a certain variety is more than or equal to 75%, the sample is judged to be the pure-bred pig of the variety; otherwise, further judging whether the blood system proportion of the sample in a certain variety is less than 20%, if yes, judging the sample as a binary pig, otherwise, judging the sample as a ternary pig.
Furthermore, a specific implementation mode of a liquid-phase chip-versus-SNP-based pig variety identification deep learning discrimination method is provided, and the implementation mode is used for showing that the embodiment shows higher prediction accuracy in a blind test result. Blind test samples were provided from four pig farms containing 1197 well-defined breeds of breeding pigs. After the 1K liquid phase chip customized in the invention is used for detecting the blind pig group, the blood system proportion and variety information of the individual are estimated by using a prediction model, and after the blind test result is compared with the actual phenotype record, the chip is combined with a model to predict that the predicted result is slightly 99.16% consistent with the actual result. And the method can accurately judge the pure pig, binary pig and ternary pig. Specific:
1. a model training population is selected.
8243 individuals with definite variety information and genotype data are selected as training groups of a variety identification model, and related varieties comprise big white pigs, long white pigs, duroc, long binary pigs and Dulong ternary pigs, and each sample has 50697 SNP loci.
2. Genotype data filtering
Genotype data filtering includes two parts, SNP site detection rate and sample detection rate. Among them, the filtering criteria for SNP loci include:
1) Deleting the locus with a minimal allele frequency of less than 0.05;
2) Deleting the sites with the Hash temperature balance threshold less than 0.0001;
3) Deleting a site of linkage disequilibrium R2 less than 0.5;
and deleting samples with the site detection rate smaller than 0.9 in the sample detection rate filtering process.
3. Group structure analysis
Based on the filtered genotype data, performing dimension reduction analysis on the genetic information of the test population by using a principal component analysis method, and observing the layered structure of the test population through a principal component analysis result. In this link, the obviously outlier individuals need to be removed, and the subsequent model building process is not included.
4. Creation of variety identification site model
The construction target of the model selects the least SNP locus information through characteristic engineering on the basis of the filtered genotype data so as to obtain a high-accuracy distinguishing model for pig variety blood system identification. The method comprises the following specific steps:
1) Genotype data encoding, encoding a genotype consisting of ATCG into a genotype of 0, 1 or 2;
2) Vectorizing the phenotype label features, replacing variety information or phenotypes reflecting the variety features with numerical values, and vectorizing;
3) Using target code TargetEncoder, and using label average value of feature value to replace classified feature value;
4) Through a decision tree algorithm, a classification model is initially constructed by using an XGBoost method, characteristics of a target set are screened out, sorting is carried out according to importance degrees of the characteristics, and finally, the first 1200-dimensional loci are selected to be used as a locus set of a training model.
5. Optimization of variety identification site models
Based on the analysis, the DataLoader was used for model training, mainly as follows:
1) Constructing a training model by using a DataLoader, and splitting a test group into a training set and a verification set according to the proportion of 8:2; the method comprises the steps of carrying out a first treatment on the surface of the
2) And establishing a deep learning network layer. The network structure is a three-layer full-connection layer, and the network structure diagram is as follows;
3) Using Adam as an optimizer tuning parameter to minimize a model loss function, the loss function being evaluated using cross entropy cross entropyloss;
4) A complete data set is passed through the neural network once and a return calculation is made, a process called an Epoch. The model Epoch is set to 100;
6. the blood lineage proportion of the population to be tested is estimated, and specific operation steps are needed.
In the analysis process, genotype data of pigs of various varieties are used as a reference comparison database, and the blood lineage proportion of individuals to be detected is estimated.
Further, a specific implementation mode of a deep learning discrimination method for identifying pig breeds based on SNP by liquid phase chip is provided, the implementation mode is used for pig breeds identification technology, four pig farms provide 1197 head of breeding pigs with definite breeds information, the 1K liquid phase chip is adopted for genotyping detection, after genotyping results are obtained, a prediction model for optimizing and determining genome reference population in the invention is adopted, the blood lineage proportion of each test sample containing each pure pig breeds is estimated, and breeds identification report is provided according to the estimation results. The agreement rate between the identification result and the real result is 99.16%. Proved by the method, the pure pig, binary pig and ternary pig can be accurately judged. The specific data results are shown in Table 1.
Sample number White pig Duroc (Duroc) Big white pig Model judgment result
10200077 0.31% 99.47% 0.21% Duroc (Duroc)
10200082 0.34% 99.24% 0.42% Duroc (Duroc)
10200084 0.08% 99.87% 0.04% Duroc (Duroc)
10200087 0.03% 99.91% 0.06% Duroc (Duroc)
10200088 0.18% 99.43% 0.39% Duroc (Duroc)
10200089 0.38% 99.29% 0.32% Duroc (Duroc)
10200094 0.20% 99.33% 0.47% Duroc (Duroc)
10200115 23.10% 49.98% 26.92% Ternary pig
10200181 17.34% 53.79% 28.88% Ternary pig
20300990 98.88% 0.05% 1.08% White pig
20300992 99.50% 0.02% 0.49% White pig
20300993 98.75% 0.05% 1.19% White pig
20300994 99.85% 0.00% 0.14% White pig
20300995 99.09% 0.07% 0.84% White pig
20300996 99.78% 0.01% 0.22% White pig
20300997 99.81% 0.00% 0.19% White pig
BY015635 0.26% 0.05% 99.69% Big white pig
CN902334G 0.85% 0.33% 98.81% Big white pig
CN903247G 3.57% 0.22% 96.21% Big white pig
CN903405G 0.07% 0.01% 99.92% Big white pig
CN903593G 0.50% 0.08% 99.42% Big white pig
CN903879G 0.50% 0.04% 99.46% Big white pig
CN904088G 1.63% 0.10% 98.27% Big white pig
HB59327E 37.90% 3.33% 58.77% Binary pig
TABLE 1 partial pure sample lineage proportion distribution table
Further, as a preferred implementation manner of this example, a system for intelligently discriminating SNP data based on a liquid phase chip is provided, which includes:
the data acquisition module is used for collecting the characteristic information and the genotype detection result of the breeds of the pure pigs and the filial generation;
the data filtering module is used for filtering genotype data and controlling quality;
the model building module is used for building a deep learning model;
the data optimization module is used for debugging and optimizing model parameters and determining a variety identification optimal SNP locus set;
and the data verification module is used for verifying the test group results.
Further, the data filtering module further comprises a SNP locus detection rate filtering unit and a sample detection rate filtering unit;
wherein, SNP site detection rate filter unit includes:
a minimal allele frequency filter subunit, deleting a locus having a minimal allele frequency less than 0.05;
a Hash temperature balance threshold filtering subunit, deleting the sites with the Hash temperature balance threshold less than 0.0001;
linkage disequilibrium filtering subunit, deleting linkage disequilibrium R 2 Sites less than 0.5;
the sample detection rate filtering unit is used for deleting samples with SNP locus detection rate smaller than 0.9.
Further, the model building module further includes:
a genotype data coding unit that codes a genotype consisting of ATCG into a genotype of 0, 1 or 2;
the phenotype label feature vectorization unit is used for replacing the variety information or the phenotype reflecting the variety features with numerical values and carrying out vectorization treatment;
the feature classification unit uses target coding TargetEncoder and uses label average value of feature values to replace classification feature values;
the feature ordering unit is used for initially constructing a classification model through a decision tree algorithm XGBoost method, screening out features of a target set and ordering according to the importance degree of the features;
and the locus set selecting unit selects the front locus as a locus set of the training model.
Further, in the step S4, the optimizing step specifically includes the following substeps:
the model construction unit is used for constructing a training model by using the DataLoader, and dividing the test population into a training set and a verification set according to the proportion of 8:2; establishing a deep learning network layer;
the loss function selecting unit uses Adam as an optimizer adjustment parameter minimization model loss function, and the loss function is evaluated by using cross entropy cross EntropyLoss;
and the Epoch parameter optimizing unit is used for setting a model Epoch as 100, acquiring a loss value of a training set and a loss value of a verification set of each Epoch, and optimizing the model performance through the loss values.
Further, as a preferred implementation manner of this example, a terminal device for intelligent discrimination of SNP data based on liquid phase chip is provided, as shown in fig. 4, the terminal device 200 includes at least one memory 210, at least one processor 220, and a bus 230 connected to different platform systems.
Memory 210 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 211 and/or cache memory 212, and may further include Read Only Memory (ROM) 213.
The memory 210 further stores a computer program, and the computer program may be executed by the processor 220, so that the processor 220 executes any one of the above-mentioned deep learning and distinguishing methods for identifying pig breeds based on SNPs in the embodiments of the application, and the specific implementation manner of the method is consistent with the implementation manner and the achieved technical effects described in the above-mentioned embodiments, and some of the details are not repeated. Memory 210 may also include a program/utility 214 having a set (at least one) of program modules 215 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Accordingly, the processor 220 may execute the computer programs described above, as well as the program/utility 214.
Bus 230 may be a local bus representing one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or using any of a variety of bus architectures.
Terminal device 200 can also communicate with one or more external devices 240, such as a keyboard, pointing device, bluetooth device, etc., as well as one or more devices capable of interacting with the terminal device 200, and/or with any device (e.g., router, modem, etc.) that enables the terminal device 200 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 250. Also, terminal device 200 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 260. Network adapter 260 may communicate with other modules of terminal device 200 via bus 230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with terminal device 200, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.
Further, as a preferred implementation manner of this example, a computer readable storage medium based on intelligent discrimination of SNP data by a liquid phase chip is provided, and instructions are stored on the computer readable storage medium, and when the instructions are executed by a processor, the method for deep learning discrimination of pig breeds based on SNP by the liquid phase chip is implemented. The specific implementation manner of the method is consistent with the implementation manner and the achieved technical effect described in the above embodiments, and some of the details are not repeated.
Fig. 5 shows a program product 300 provided by the present embodiment for implementing the above method, which may employ a portable compact disc read-only memory (CD-ROM) and comprise program code, and may be run on a terminal device, such as a personal computer. However, the program product 300 of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Program product 300 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.
The present application describes functional improvements and usage elements that are emphasized by the patent laws, and the above description and drawings are merely preferred embodiments of the present application and not limiting the present application, and therefore, all structures, devices, features, etc. that are similar and identical to those of the present application, i.e. all equivalents and modifications made by the patent application are intended to be within the scope of protection of the patent application of the present application.
Through the embodiment, based on genotype data of pure-bred long white pigs, large white pigs, duroc and binary and ternary hybridization offspring, an optimal locus set containing 1200 SNP is found by utilizing a computer deep learning algorithm, and the blood lineage proportion of the pig breeds and hybridization individuals can be accurately distinguished. The detection kit of the site set is customized through a GBTS liquid phase chip technology system, and the liquid phase chip can be used for accurately distinguishing Duroc, changbai and Dabai varieties and accurately estimating the blood system proportion of the filial generation based on a background gene library and an optimized machine learning model.
The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims (7)

1. The method for deeply learning and distinguishing the pig variety identification based on the SNP by the liquid phase chip is characterized by comprising the following steps:
s1, collecting the characteristic information and the genotype detection result of the breeds of the pure pigs and the filial generations;
s2, genotype data filtering quality control;
s3, building a deep learning model;
s4, debugging and optimizing model parameters, and determining an optimal SNP locus set for variety identification;
s5, verifying test group results.
2. The method for deeply learning and distinguishing the pig breeds based on the liquid phase chip pair SNP according to claim 1, wherein the specific steps of the step S1 are as follows: and collecting a plurality of individuals with variety information and genotype data as training groups of a variety identification model, wherein the variety information at least comprises big white pigs, long white pigs, duroc, long binary pigs and Dulong ternary pigs.
3. The method for deeply learning and distinguishing the pig breeds identification based on the liquid phase chip pair SNP according to claim 1, wherein in the step S2, the filtering step specifically comprises SNP locus detection rate filtering and sample detection rate filtering;
wherein, the SNP locus detection rate filtering comprises the following steps:
s201, deleting a site with the minimum allele frequency less than 0.05;
s202, deleting the sites with the Hash temperature balance threshold less than 0.0001;
s203, deleting linkage disequilibrium R 2 Sites less than 0.5;
the sample detection rate filtering specifically comprises the following steps: deleting samples with SNP locus detection rate less than 0.9.
4. The method for deeply learning and distinguishing the pig breeds identification based on the liquid phase chip pair SNP according to claim 1, wherein the step S3 specifically comprises the following substeps:
s301, genotype data encoding, namely encoding genotypes consisting of ATCG into genotypes of 0, 1 or 2;
s302, vectorizing phenotype label features, namely replacing variety information or phenotypes reflecting the variety features with numerical values as label values, and vectorizing;
s303, converting the genotype characteristic value into a digital code by using a target code TargetEncoder, and replacing the classification characteristic value by using a label average value of the characteristic value;
s304, preliminarily constructing a classification model by using an XGBoost method in a decision tree algorithm and the coded tag value and feature value, and listing the features of all the sites, and sorting according to the importance degree of the features;
s305, selecting the front set points as a set of points of the training model.
5. The method for deeply learning and discriminating pig breeds identification based on liquid phase chip pair SNP according to claim 1, wherein in the step S4, the optimizing step specifically comprises the following substeps:
s401, constructing a training model by using a DataLoader, and splitting a test group into a training set and a verification set according to the proportion of 8:2;
s402, establishing a deep learning network layer;
s403, using Adam as an optimizer tuning parameter to minimize a model loss function, wherein the loss function is evaluated by using cross entropy;
s404, setting a model Epoch as 100, acquiring a loss value of a training set and a loss value of a verification set of each Epoch, and optimizing the model performance through the loss values.
6. The method for deeply learning and distinguishing the pig breeds based on the liquid phase chip pair SNP according to claim 1, wherein the step S5 is specifically as follows: and (3) taking genotype data of each pig variety as a reference comparison database, and estimating the blood system proportion of the individuals to be tested.
7. The method for deeply learning and distinguishing the pig breeds identification based on the liquid phase chip pair SNP according to claim 6, wherein the specific steps of estimating the blood lineage proportion of the individual to be detected are as follows: if the blood lineage ratio of the sample in a certain variety is more than or equal to 75%, judging the sample as the pure-bred pig of the variety; otherwise, further judging whether the blood system proportion of the sample in a certain variety is less than 20%, if yes, judging the sample as a binary pig, otherwise, judging the sample as a ternary pig.
CN202311444798.4A 2023-11-02 2023-11-02 Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification Pending CN117275575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311444798.4A CN117275575A (en) 2023-11-02 2023-11-02 Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311444798.4A CN117275575A (en) 2023-11-02 2023-11-02 Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification

Publications (1)

Publication Number Publication Date
CN117275575A true CN117275575A (en) 2023-12-22

Family

ID=89214483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311444798.4A Pending CN117275575A (en) 2023-11-02 2023-11-02 Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification

Country Status (1)

Country Link
CN (1) CN117275575A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967409A (en) * 2017-11-24 2018-04-27 中国农业大学 One boar full-length genome low-density SNP chip and preparation method thereof and application
KR20220091223A (en) * 2020-12-23 2022-06-30 (주)인실리코젠 System for selection of genetic markers and method for breed identification
CN115651986A (en) * 2022-10-31 2023-01-31 华中农业大学 Method for rapidly identifying pig breeds by utilizing whole genome SNP information and application thereof
CN115838808A (en) * 2022-07-29 2023-03-24 江苏省家禽科学研究所科技创新有限公司 Molecular marker for identifying Wenshang Luhua chicken variety and application thereof
CN116597894A (en) * 2023-03-31 2023-08-15 之江实验室 XGBoost feature selection and deep learning combined gene-to-phenotype prediction method and system
CN116694776A (en) * 2023-06-09 2023-09-05 山西农业大学 Gene affecting reproductive performance of sow and screening method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967409A (en) * 2017-11-24 2018-04-27 中国农业大学 One boar full-length genome low-density SNP chip and preparation method thereof and application
KR20220091223A (en) * 2020-12-23 2022-06-30 (주)인실리코젠 System for selection of genetic markers and method for breed identification
CN115838808A (en) * 2022-07-29 2023-03-24 江苏省家禽科学研究所科技创新有限公司 Molecular marker for identifying Wenshang Luhua chicken variety and application thereof
CN115651986A (en) * 2022-10-31 2023-01-31 华中农业大学 Method for rapidly identifying pig breeds by utilizing whole genome SNP information and application thereof
CN116597894A (en) * 2023-03-31 2023-08-15 之江实验室 XGBoost feature selection and deep learning combined gene-to-phenotype prediction method and system
CN116694776A (en) * 2023-06-09 2023-09-05 山西农业大学 Gene affecting reproductive performance of sow and screening method thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
’HUMZ: "SNP基因数据质控调研", Retrieved from the Internet <URL:https://blog.csdn.net/weixin_43025542/article/details/106995745> *
六六六呀: "PLINK 阈值过滤汇总", Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/347695615> *
粟涛: "基于机器学习的猪育种表型预测及基因芯片位点筛选研究", 1 June 2023 (2023-06-01) *
要快乐_更要经历山河: "利用plink对snp质控的指标和基本流程", Retrieved from the Internet <URL:https://www.jianshu.com/p/6ba349af668b> *

Similar Documents

Publication Publication Date Title
Speed et al. Relatedness in the post-genomic era: is it still useful?
CN107810502B (en) Method and system for copy number variation detection
CN113519028B (en) Methods and compositions for estimating or predicting genotypes and phenotypes
Hess et al. Fixed-length haplotypes can improve genomic prediction accuracy in an admixed dairy cattle population
KR20200011471A (en) Variant Classifiers Based on Deep Neural Networks
Nevado et al. Resequencing studies of nonmodel organisms using closely related reference genomes: optimal experimental designs and bioinformatics approaches for population genomics
CA2490766A1 (en) A system and method for snp genotype clustering
CN109727641B (en) Whole genome prediction method and device
CN105404793B (en) The method for quickly finding phenotype correlation gene based on probabilistic framework and weight sequencing technologies
Zych et al. FitTetra 2.0–improved genotype calling for tetraploids with multiple population and parental data support
KR20180116309A (en) Method and system for detecting abnormal karyotypes
CN109727640B (en) Whole genome prediction method and device based on automatic machine learning technology
Han et al. Heuristic hyperparameter optimization of deep learning models for genomic prediction
Pool Genetic mapping by bulk segregant analysis in Drosophila: experimental design and simulation-based inference
Mancin et al. Accounting for population structure and phenotypes from relatives in association mapping for farm animals: A simulation study
CN109727642B (en) Whole genome prediction method and device based on random forest model
Winn et al. Profiling of Fusarium head blight resistance QTL haplotypes through molecular markers, genotyping-by-sequencing, and machine learning
Mollandin et al. An evaluation of the predictive performance and mapping power of the BayesR model for genomic prediction
Garreta et al. MultiGWAS: An integrative tool for Genome Wide Association Studies in tetraploid organisms
CN112233722A (en) Method for identifying variety, and method and device for constructing prediction model thereof
EP3929928A1 (en) Associating pedigree scores and similarity scores for plant feature prediction
CN117275575A (en) Liquid phase chip pair SNP-based deep learning discrimination method for pig variety identification
Rodríguez‐Ramilo et al. Networks of inbreeding coefficients in a selected population of rabbits
Rosa et al. Applications of graphical models in quantitative genetics and genomics
CN112102880A (en) Method for identifying variety, and method and device for constructing prediction model thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination