CN112233722B

CN112233722B - Variety identification method, and method and device for constructing prediction model thereof

Info

Publication number: CN112233722B
Application number: CN202011120761.2A
Authority: CN
Inventors: 陈志强; 梁齐齐; 吴俊�; 曹志生; 李瑞强
Original assignee: Beijing Novogene Technology Co ltd
Current assignee: Beijing Novogene Technology Co ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2024-01-30
Anticipated expiration: 2040-10-19
Also published as: CN112233722A

Abstract

The invention provides a variety identification method, a method and a device for constructing a prediction model thereof. The construction method comprises the following steps: acquiring a SNP data set; preprocessing an SNP data set to obtain an SNP data matrix; filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set; taking the SNP reduced set as a characteristic value, taking the population name of the sample as a target value, and carrying out model training and model parameter adjustment to obtain a primary model; and evaluating the preliminary model to obtain a predictive model for variety identification. The method adopts a gradient lifting method to carry out dimension reduction treatment on the SNP data set for modeling, thereby reducing the operation complexity and the operation quantity and improving the operation speed.

Description

Variety identification method, and method and device for constructing prediction model thereof

Technical Field

The invention relates to the field of variety identification, in particular to a variety identification method, a prediction model construction method and a prediction model construction device.

Background

The variety refers to a certain group of a certain species bred according to the needs of human beings under certain ecological conditions and economic conditions, has relatively stable genetic characteristics, has relatively consistent biological, morphological and economic properties, and is characterized by distinguishing from other groups of the same species, namely specificity. The biological variety identification has wide and important application value. Variety identification facilitates efficient management of genetic information; laying a good foundation for the establishment and implementation of a breeding strategy; providing effective information for authentication of the product of the bill of materials; opens up a new way for solving the food safety problem.

Early variety identification mainly depends on phenotype identification, but with popularization of cross breeding, individual phenotypes of populations subjected to several generations of cross breeding are highly similar to parents, so that variety identification by simply utilizing phenotypic traits is not accurate, comprehensive and scientific enough. Subsequently, the variety identification work is developed from the traditional phenotype identification to a DNA molecular marking technology, and the application of the DNA molecular marking can provide an accurate and rapid channel for variety identification. Early DNA molecular marker technologies used microsatellite, AFLP, etc. markers for variety identification. The general process of variety identification using microsatellites includes: a) Extracting DNA of a sample to be detected; b) Modifying the microsatellite primer by a fluorescent group; c) Landing PCR amplification; d) And (3) reading genotype information of each sample, calculating genetic distances of individuals by using genetic software, and drawing a cluster map according to the genetic distances so as to identify varieties.

However, the above method has the following disadvantages: a) The versatility is not strong, and specific primers are required. Because of the different flanking sequences of microsatellites in different species, it is often necessary to design specific primers that are time-consuming and laborious for different species. b) The resulting error is high. Either a homeotype (microsatellite repeats are identical but the PCR products are of different lengths) or a heterotype (microsatellite repeats are different but the PCR products are of the same length) may occur and investigation using only fragments of the PCR products may yield erroneous results. In addition, PCR amplification is affected by a number of factors, such as the occurrence of mutations in the 3' end of the primer that result in the inability of some alleles to be amplified, can seriously affect PCR efficiency and thus accuracy of variety identification results. c) The sensitivity is low. Because of the high error, the error of the detection method masks the difference between two varieties when the difference between the two varieties is small.

Some students use different statistical methods to screen for SNP loci in combination with genetic information. Pfaff et al uses the delta method to classify by taking the absolute difference in allele frequencies between two species as a criterion, and Weir et al uses the Wright's FST method to maximize the difference in allele frequencies between two species, which is predefined. Delta and Wright's FST can only be used for discrimination of two populations and there is no clear statistical definition. To solve the problem of discrimination of two or more varieties, rosenberg et al propose a method of correlation measurement, which uses mutual information (In) to describe the correlation, thereby representing the relationship between FSTs of different varieties. However, these methods are not only computationally intensive and difficult, but also difficult to obtain useful SNP sites.

Disclosure of Invention

The invention mainly aims to provide a variety identification method, a prediction model construction method and a device thereof, so as to realize simple, high-throughput and automatic variety identification.

In order to achieve the above object, according to one aspect of the present invention, there is provided a construction method of a variety identification prediction model, the construction method comprising: acquiring a SNP data set; preprocessing an SNP data set to obtain an SNP data matrix; filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set; taking the SNP reduced set as a characteristic value, taking the population name of the sample as a target value, and carrying out model training and model parameter adjustment to obtain a primary model; and evaluating the preliminary model to obtain a predictive model for variety identification.

Further, preprocessing the SNP data set to obtain a SNP data matrix comprises: removing SNP loci with deletion and/or minimum allele frequency lower than 5% in the SNP data set to obtain effective SNP loci; carrying out digital coding conversion on genotypes of total n effective SNP loci of m samples to be identified to obtain an SNP data matrix, and recording the SNP data matrix as a data matrix X _m×n The method comprises the steps of carrying out a first treatment on the surface of the Wherein the wild homozygous genotype AA is marked as 0, the heterozygous genotype AB is marked as 1, the mutant homozygous genotype BB is marked as 2, m and n are natural numbers respectively, preferably m and n are natural numbers respectively equal to or greater than 2.

Further, filtering the SNP data matrix by adopting a gradient lifting method, and obtaining the SNP reduction set comprises the following steps: a. non-put-back extraction from SNP data matricesj samples of data forming a first data matrix K _m×j The method comprises the steps of carrying out a first treatment on the surface of the b. Taking the group name of the sample as a target value, and utilizing a gradient lifting method to perform a first data matrix K _m×j Scoring the contribution degree of each SNP locus in the sequence; c. for the first data matrix K _m×j The SNP loci in the array are sequenced according to the scoring height of the contribution degree, the contribution degree is accumulated according to the contribution degree, the related SNP loci with the contribution degree larger than p after accumulation are reserved, and a data matrix K is generated _m×i The method comprises the steps of carrying out a first treatment on the surface of the d. Judging whether all SNP loci are traversed, if so, accumulating the related SNP loci with the contribution degree larger than p for the last time as an SNP reduction set; if not all SNP loci have been traversed, the data are then matrix K _m×i As a data matrix K _m×j At the same time from the data matrix X _m×n Is not put back again to extract K _m× (j- _i ) The data form a second data matrix K _m×j Repeating the steps b) and c) until all SNP loci are traversed, and obtaining a SNP reduction set.

Further, in the step of performing model training and adjusting model parameters, the adjustment is performed by adopting a grid search method.

Further, evaluating the preliminary model to obtain a predictive model for variety identification includes: dividing the SNP reduction set into a training set and a testing set; evaluating the preliminary model by performing five-fold cross validation on the training set and outputting an AUC value on the test set; if the evaluation result meets the preset standard, taking the preliminary model as a prediction model; if the evaluation result does not accord with the preset standard, returning to the preliminary model, and repeatedly executing the model training and the model parameter adjusting steps until the evaluation result accords with the preset standard.

Further, the construction method obtains a predictive model for variety identification, and simultaneously comprises the following steps: and exporting and storing the prediction model under the cluster path, sequencing the importance of each SNP locus returned by the prediction model, and exporting and storing the importance of each SNP locus under the cluster path.

According to a second aspect of the present application, there is provided a method of variety identification, the method comprising: sequentially preprocessing an SNP data set of a sample to be identified and filtering the SNP data set by a gradient lifting method to obtain an SNP reduced set of the sample to be identified; and (3) introducing the SNP reduced set of the sample to be identified into the prediction model constructed by any one of the construction methods to predict, thereby obtaining the population to which the sample to be identified belongs.

Further, the pretreatment is carried out according to the pretreatment step in the construction method; the filtering treatment is performed according to the filtering treatment steps in the construction method.

According to a third aspect of the present application, there is provided a construction apparatus of a variety identification prediction model, the construction apparatus comprising: the SNP acquisition module is used for acquiring an SNP data set; the preprocessing module is used for preprocessing the SNP data set to obtain an SNP data matrix; the SNP filtering module is used for filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set; the model training module is used for carrying out model training and model parameter adjustment by taking the SNP reduced set as a characteristic value and the population name of the sample as a target value to obtain a primary model; and the evaluation determination module is used for evaluating the preliminary model to obtain a prediction model for variety identification.

Further, the preprocessing module includes: the locus screening module is used for removing SNP loci with deletion and/or minimum allele frequency lower than 5% in the SNP data set to obtain effective SNP loci; the coding conversion module is used for carrying out digital coding conversion on genotypes of n effective SNP loci in total of m samples to be identified to obtain an SNP data matrix, and marking the SNP data matrix as a data matrix X _m×n The method comprises the steps of carrying out a first treatment on the surface of the Wherein the wild homozygous genotype AA is marked as 0, the heterozygous genotype AB is marked as 1, the mutant homozygous genotype BB is marked as 2, m and n are natural numbers respectively, preferably m and n are natural numbers respectively equal to or greater than 2.

Further, the filter module includes: a first extraction module for extracting j samples of data from the SNP data matrix without replacement to form a first data matrix K _m×j The method comprises the steps of carrying out a first treatment on the surface of the A contribution degree scoring module for scoring the first data matrix K by using the gradient lifting method with the population name of the sample as a target value _m×j Scoring the contribution degree of each SNP locus in the sequence; sequencing accumulationA selection module for selecting a first data matrix K _m×j The SNP loci in the array are sequenced according to the scoring height of the contribution degree, the contribution degree is accumulated according to the contribution degree, the related SNP loci with the contribution degree larger than p after accumulation are reserved, and a data matrix K is generated _m×i The method comprises the steps of carrying out a first treatment on the surface of the The judging and traversing module is used for judging whether all SNP loci are traversed, and if all SNP loci are traversed, the relevant SNP loci with the last accumulated contribution degree being greater than p are used as an SNP reduction set; if not all SNP loci have been traversed, the data are then matrix K _m×i As a data matrix K _m×j At the same time from the data matrix X _m×n Is not put back again to extract K _m×(j-i) The data form a second data matrix K _m×j Repeating the steps b) and c) until all SNP loci are traversed, and obtaining a SNP reduction set.

Further, in the model training module, a grid searching method is adopted to adjust model parameters.

Further, the evaluation determination module includes: the evaluation module is used for performing five-fold cross validation on the training set by dividing the SNP reduction set into the training set and the testing set and evaluating the preliminary model in a mode of outputting an AUC value on the testing set; the first determining module is used for taking the preliminary model as a prediction model when the evaluation result meets a preset standard; and the second determining module is used for returning to the preliminary model when the evaluation result does not accord with the preset standard, and repeatedly executing model training and model parameter adjustment until the evaluation result accords with the preset standard.

Further, the construction apparatus further includes: and the export storage module is used for exporting and storing the prediction model under the cluster path, sequencing the importance of each SNP locus returned by the prediction model, and exporting and storing the importance of each SNP locus under the cluster path.

According to a fourth aspect of the present application, there is provided an apparatus for variety identification, the apparatus comprising: and a device for constructing any variety identification prediction model.

According to a fifth aspect of the present application, there is provided a storage medium, the storage medium including a stored program, wherein the apparatus in which the storage medium is controlled to execute the method for constructing any of the variety identification prediction models described above when the program runs.

According to a sixth aspect of the present application, there is provided a processor for running a program, wherein the program executes a method for constructing any of the variety identification prediction models described above.

By applying the technical scheme of the invention, the SNP matrix subjected to pretreatment is filtered by adopting a gradient lifting method, so that SNP reduced sets with large contribution and greatly reduced number are obtained by screening according to the SNP contribution, a preliminary model is obtained by carrying out model training and parameter adjustment by using the SNP reduced sets, and finally, the prediction accuracy of the preliminary model is further evaluated and verified according to a training set, a testing set and the like, thereby obtaining a prediction model meeting expected standards. The method carries out dimension reduction treatment on the SNP data set for modeling, thereby reducing the operation complexity and the operation amount, improving the operation speed, and realizing rapid, high-flux and automatic variety identification by utilizing the method and the prediction model established by the method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flow chart showing a construction method of a variety identification prediction model according to embodiment 1 of the present invention;

FIG. 2 is a detailed flow chart showing a construction method of a variety identification prediction model according to embodiment 2 of the present invention;

FIG. 3 shows the accuracy results of predictions at different numbers of trees (n_evators) according to embodiment 4 of the present invention;

FIG. 4 shows the results of different sets of cross-validation accuracies according to embodiment 4 of the invention;

FIG. 5 is a view showing the results of SNP site contribution (importance) according to embodiment 4 of the invention;

fig. 6 is a schematic diagram showing the construction apparatus of a variety identification prediction model according to embodiment 6 of the present invention.

Detailed Description

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present invention will be described in detail with reference to examples.

It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As mentioned in the background art, in order to improve the situation that the variety identification method in the prior art is difficult to realize rapid and high-throughput automatic identification of a plurality of varieties, in a preferred embodiment of the present application, a method for constructing a predictive model for variety identification and a method for identifying varieties using the method are provided.

Example 1

The embodiment provides a method for constructing a prediction model for variety identification, and fig. 1 shows a flow diagram of the method for constructing the prediction model. The construction method comprises the following steps:

step S101, acquiring SNP data sets;

step S102, preprocessing an SNP data set to obtain an SNP data matrix;

step S103, filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set;

step S104, model training and model parameter adjustment are carried out by taking the SNP reduced set as a characteristic value and the population name of a sample as a target value, so as to obtain a primary model;

and step S105, evaluating the preliminary model to obtain a predictive model for variety identification.

Filtering the SNP matrix subjected to pretreatment by adopting a gradient lifting method, so that SNP reduced sets with large contribution and greatly reduced number are obtained by screening according to the SNP contribution, a preliminary model is obtained by carrying out model training and parameter adjustment by using the SNP reduced sets, and finally, the prediction accuracy of the preliminary model is further evaluated and verified according to a training set, a testing set and the like, thereby obtaining a prediction model meeting expected standards. The method carries out dimension reduction treatment on the SNP data set for modeling, thereby reducing the operation complexity and the operation amount, improving the operation speed, and realizing rapid, high-flux and automatic variety identification by utilizing the method and the prediction model established by the method.

In the step of obtaining the SNP dataset, the specific source of the SNP data may be a SNP molecular marker obtained by comparing the whole genome re-sequencing data with the reference genome (see fig. 1 for specific steps, firstly extracting genomic DNA, then performing DNA sequencing, quality controlling the sequencing data, and comparing the sequencing data with the reference genome to obtain SNP data of the detection sample), or may be a SNP molecular marker obtained by a SNP chip. The population to which the samples of these SNP data belong is known.

SNP (Single nucleotide polymorphism) refers to DNA sequence polymorphism caused by single nucleotide variation, and includes single base transition, transversion, insertion, deletion and other forms. Compared with other DNA molecular markers, SNP markers are widely applied to various biological related analyses with the advantages of high throughput, high integration, microminiaturization, automation and the like.

The SNP has the following characteristics: (1) SNP number is large, density is high, and distribution is wide. In the human genome, on average, 1 SNP site is present per 1 kb; (2) is representative. SNPs partially located in a gene coding region are likely to change gene functions or influence gene expression, so that individual traits are influenced, and a certain theoretical basis is provided for genetic research of the traits; (3) genetic stability. The probability of gene mutation of SNP is small, especially SNP of the coding region is highly stable, and the repeatability of genetic analysis is high; the typing of SNP (4) is easy to be automated.

Because of the possible incomplete data, such as deletions, of certain SNP sites during the sequencing process. Or the minimal allele frequency of some sites is low, such as less than 0.05, which indicates that the mutation frequency of the site is low, and the site is difficult to detect the allele condition under the condition of small sample size, so that if the site is contained, the statistical efficiency is easily reduced, and the false negative result is caused. Furthermore, the genotypes of the different SNP sites are the same or different, but based on modeling considerations cannot be represented by the specific genotypes of the respective sites, but need to be converted into numbers to characterize the different genotypes. The genotypes of different SNP loci can be different, but the genotype of each SNP can only be one of three types AA, AB or BB, so the genotype is marked as 0 of the wild homozygosity type, the heterozygosity type is marked as 1, and the mutation homozygosity type is marked as 2, and the model of variety prediction of each genotype can be realized through digital coding conversion.

Thus, in a preferred embodiment, the preprocessing of the SNP data set to obtain the SNP data matrix includes: removing SNP loci with deletion and/or minimum allele frequency lower than 5% in the SNP data set to obtain effective SNP loci; the genotypes of the total n effective SNP loci of m samples are subjected to digital coding conversion to obtain an SNP data matrix which is recorded as a data matrix X _m×n The method comprises the steps of carrying out a first treatment on the surface of the Wherein genotype AA is marked as 0, genotype AB is marked as 1, homozygous genotype BB is marked as 2, m and n are natural numbers respectively, preferably m and n are natural numbers respectively equal to or greater than 2.

The step of filtering the SNP data matrix by adopting the gradient lifting method to obtain the SNP reduction set comprises the following steps of:

a. extracting j samples of data from SNP data matrix without replacement to form a first data matrix K _m×j ；

b. Taking the group name of the sample as a target value, and utilizing a gradient lifting method to perform a first data matrix K _m×j Scoring the contribution degree of each SNP locus in the sequence;

c. for the first data matrix K _m×j The SNP loci in the array are sequenced according to the scoring height of the contribution degree, the contribution degree is accumulated according to the contribution degree, the related SNP loci with the contribution degree larger than p after accumulation are reserved, and a data matrix K is generated _m×i ；

d. Judging whether all SNP loci are traversed, if so, accumulating the related SNP loci with the contribution degree larger than p for the last time as an SNP reduction set; if not all SNP loci have been traversed, the data are then matrix K _m×i As a data matrix K _m×j At the same time from the data matrix X _m×n Is not put back again to extract K _m× (j- _i ) The data form a second data matrix K _m×j Repeating the steps b) and c) until all SNP loci are traversed, and obtaining a SNP reduction set.

Compared with other filtering methods, the filtering method has the advantages that the used operation and storage resources are very small, and the calculation can be realized on a personal computer platform (the traditional filtering methods are based on some filtering methods proposed for SNP loci, including deletion value filtering, allele frequency filtering or linkage disequilibrium value filtering, and the like, and although some loci can be filtered by the filtering methods, the filtering methods have no purposeful filtering and the residual SNP loci after filtering are still many, and the SNP loci with the order of magnitude are put in a random forest model to be trained, and have extremely large calculation amount, so that the training is not possible to be put on the personal computer platform.

The innovation point of the method is that a gradient lifting method is used for purposefully reserving a small number of sites, and a large amount of computation resource requirements can be reduced by filtering only part of sites at a time without replacement.

In the step of performing model training and model parameter adjustment, the adjustment is preferably performed by a grid search method.

And evaluating the preliminary model obtained after the model training and the model parameter adjustment, namely checking the accuracy of the model prediction. The specific evaluation method can adopt the existing model evaluation method. The evaluation of the model is mainly divided into 2 angles, one is the distinguishing degree or the prediction precision of the model, and the indexes of the evaluation comprise AUC (AUC: the most commonly used index in classification problems, the larger the AUC value, the better the classification, and the higher the accuracy). The other is the goodness of fit or the degree of calibration. Generally, the ability to distinguish between the regions should be considered first, and the degree of distinction evaluates the accuracy of the model prediction result, i.e. the ability to classify correctly.

In a preferred embodiment, the evaluating the preliminary model to obtain a predictive model for variety identification includes: dividing the SNP reduction set into a training set and a testing set, and evaluating the preliminary model by performing five-fold cross validation on the training set and outputting an AUC value on the testing set; if the evaluation result meets the preset standard, taking the preliminary model as a prediction model; if the evaluation result does not accord with the preset standard, returning to the preliminary model, and repeatedly executing the model training and the model parameter adjusting steps until the evaluation result accords with the preset standard.

The predetermined criteria may vary depending on the species of the classified sample, for example, the predetermined criteria may be that the prediction accuracy is up to 90%, more preferably up to 95%, for example, 96%, 97%, 98%, 99% or even up to 100%.

The training set and the test set are usually split according to a ratio of 8:2, and of course, the training set and the test set can be adjusted within a range of 5:5-9:1 as required, for example, 8:2, 7:3, 6:4 or 5:5.

The construction method of the application obtains a predictive model for variety identification and simultaneously comprises the following steps: the prediction model is exported and stored under the cluster path, and meanwhile, the importance degree (the importance degree refers to the size given by a random forest model in machine learning) of each feature (namely SNP locus) returned by the prediction model is sequenced, and the importance degree of each feature (namely SNP locus) is exported and stored under the cluster path. Because the prediction model is used for sorting scores of different SNP loci on corresponding chromosomes during construction, the SNP loci with different importance degrees related to each variety can be obtained according to the requirements of different purposes, thereby providing reference value for subsequent variety identification.

Example 2

The embodiment provides a more specific method for constructing a variety identification prediction model, as shown in fig. 2, and the detailed steps are as follows:

SNP acquisition is generally carried out by comparing whole genome resequencing with a reference genome to obtain SNP molecular markers, or SNP molecular markers are obtained by SNP chips.

Preprocessing the SNP data set, deleting SNP loci with deletion and minimum allele frequency lower than 5%, and converting the data into digital types according to the genotype of the SNP, wherein genotype AA codes to 0, genotype AB codes to 1, and genotype BB codes to 2. The preprocessed and coded SNP locus data form a matrix X _m×n M represents the number of samples to be identified, and n represents the total number of SNP loci.

c. Filtering SNP data by gradient lifting method, 1) filtering SNP data from matrix X in step b _m×n Non-replacement extracted partial data matrix K _m×j Wherein m represents the number of samples to be identified, j represents the number of SNP loci extracted, 2) using the population name of the sample as a target value, and using gradient lifting method to matrix K _m×j Scoring the contribution (importance) of each SNP in (a); 3) Pair matrix K _m×j The SNP loci in the matrix are sequenced according to the contribution degree (importance degree), the contribution degree is accumulated according to the contribution degree, the loci with the accumulated contribution degree larger than p are reserved, and a matrix M is generated _m×i The accumulated contribution p is generally set to 0.99, i.e. relevant SNP loci with the accumulated contribution of 99% are reserved; 4) Judging whether all SNPs are traversed, if so, marking the relevant SNP loci with the last accumulated contribution degree larger than p as a SNP reduction set to be used for training a reduced candidate random forest model; if not all SNP loci have been traversed, matrix M is then _m×i As a matrix K _m×j At the same time from the matrix X _m×n The extraction of K is carried out without replacement _m×(j-i) Data form a new matrix K _m×j Repeating b),c) And (3) traversing all SNP loci to obtain a SNP reduction set which is used for training a reduction candidate random forest model.

The numbers m, j and i representing the number of samples are natural numbers, and i < j < m, n represents the number of SNP sites, which are also natural numbers.

d. C, training and parameter tuning, namely selecting a random forest model in a machine learning algorithm, taking the reduced set in the step c as a characteristic X, taking the group name of a sample as a target value Y, and adjusting model parameters through a parameter tuning method of grid search to obtain a preliminary model;

e. model evaluation and output, the preliminary model is evaluated by performing five-fold cross validation on the training set and outputting auc values on the test set. And if the result meets the preset standard, the preliminary model is taken as a prediction model, exported and stored under the cluster path, meanwhile, the contribution degree (importance value, namely importance degree) of each feature returned by the model is sequenced, and the contribution degree (importance) of each feature is exported and stored under the cluster path. And d, if the result does not accord with the preset standard, returning to the step d, and retraining and parameter adjustment are carried out on the preliminary model.

Example 3

In this embodiment, for data sets with 100 total samples of 4 different red deer varieties, each data set includes 11343245 original SNP data sets, a reduced set of 1080 SNP sites is obtained after pretreatment and filtering treatment by a gradient lifting method, and model training and parameter adjustment are performed by using the reduced set of 100 samples to obtain a preliminary model.

Then a training set is adopted: the test set size ratio was 8:2, evaluating the preliminary model, wherein the evaluation result shows that: the accuracy of the preliminary model is 85%, and the accuracy does not meet the preset standard by more than 90%. Therefore, the model training and parameter adjusting steps are returned until the evaluation result shows that the model prediction accuracy reaches more than 98%. The preliminary model at this time is referred to as a predictive model.

Example 4

This example provides a method for identifying a breed of red deer, and the specific method is the same as example 2. Wherein figure 3 shows the accuracy results of predictions under different n_estimators. FIG. 4 shows the results of cross-validation of different sets of accuracy, and FIG. 5 shows the results of SNP site importance (i.e., importance value) (top 30).

FIG. 3 shows that as the number of trees n increases, the accuracy of the model increases, and the number of trees at which the accuracy no longer increases is fixed as a subsequent analysis parameter;

FIG. 4 shows that repeated cross-validation shows that the accuracy remains at a high level, indicating that model training and parameter setting are not problematic, and that the trained model can be saved for subsequent analysis.

FIG. 5 shows that the importance of different SNP sites to the model is different, and the higher the importance, the greater the contribution to the model.

Example 5

The embodiment provides a variety identification method, which comprises the following steps: sequentially preprocessing an SNP data set of a sample to be identified and filtering the SNP data set by a gradient lifting method to obtain an SNP reduced set of the sample to be identified; and (3) introducing the SNP reduced set of the sample to be identified into the prediction model constructed in the embodiment 1 to predict, thereby obtaining the population to which the sample to be identified belongs.

The pretreatment method of the SNP data set of the sample to be identified is the same as the pretreatment operation and the gradient lifting method filtration treatment operation of the SNP data set as the training set in example 1.

From the description of the above embodiments, it can be seen that the variety identification method of the present application has the following advantages:

a) The method can realize high-throughput and automatic variety identification flow, and can realize automatic variety identification of a large number of samples after a trained model is obtained;

b) Compared with other methods, the method can calculate the contribution degree of each SNP locus in the model training process, and the locus with high contribution degree can be used as a reference basis for the subsequent genetic breeding;

c) The self-learning is realized, the accuracy is continuously improved, the machine learning is increased along with the increase of test samples, and the accuracy of model training is higher and higher;

d) The extraction part SNP loci without replacement are filtered by a gradient lifting method, so that the calculation and storage resources are very few, and the calculation on a personal computer platform can be realized.

Example 6

The present embodiment provides a construction apparatus of a variety identification prediction model, as shown in fig. 6, the construction apparatus includes: the system comprises an SNP acquisition module 10, a preprocessing module 20, an SNP filtering module 30, a model training module 40 and an evaluation determination module 50, wherein the SNP acquisition module 10 is used for acquiring an SNP data set;

a preprocessing module 20, configured to preprocess the SNP data set to obtain a SNP data matrix;

the SNP filtering module 30 is used for filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set;

the model training module 40 is configured to perform model training and model parameter adjustment by using the SNP reduction set as a feature value and using the population name to which the sample belongs as a target value, so as to obtain a preliminary model;

the evaluation determining module 50 is configured to evaluate the preliminary model to obtain a prediction model for variety identification.

Preferably, the preprocessing module includes: the locus screening module is used for removing SNP loci with deletion and/or minimum allele frequency lower than 5% in the SNP data set to obtain effective SNP loci; the coding conversion module is used for carrying out digital coding conversion on genotypes of n effective SNP loci in total of m samples to be identified to obtain an SNP data matrix X _m×n The method comprises the steps of carrying out a first treatment on the surface of the Wherein the wild homozygous genotype AA is marked as 0, the heterozygous genotype AB is marked as 1, the mutant homozygous genotype BB is marked as 2, m and n are natural numbers respectively, preferably m and n are natural numbers respectively equal to or greater than 2.

Preferably, the filtering module comprises: a first extraction module for extracting j samples of data from the SNP data matrix without replacement to form a first data matrix K _m×j The method comprises the steps of carrying out a first treatment on the surface of the A contribution degree scoring module for scoring the first data matrix K by using the gradient lifting method with the population name of the sample as a target value _m×j Each of (3)Scoring contribution degree of SNP loci; a sorting accumulation selection module for selecting the first data matrix K _m×j The SNP loci in the array are sequenced according to the scoring height of the contribution degree, the contribution degree is accumulated according to the contribution degree, the related SNP loci with the contribution degree larger than p after accumulation are reserved, and a data matrix K is generated _m×i The method comprises the steps of carrying out a first treatment on the surface of the The judging and traversing module is used for judging whether all SNP loci are traversed, and if all SNP loci are traversed, the relevant SNP loci with the last accumulated contribution degree being greater than p are used as an SNP reduction set; if not all SNP loci have been traversed, the data are then matrix K _m×i As a data matrix K _m×j At the same time from the data matrix X _m×n Is not put back again to extract K _m×(j-i) The data form a second data matrix K _m×j Repeating the steps b) and c) until all SNP loci are traversed, and obtaining a SNP reduction set.

Preferably, in the model training module, a grid searching method is adopted for model parameter adjustment.

Preferably, the evaluation determination module includes: the evaluation module is used for performing five-fold cross validation on the training set by dividing the SNP reduction set into the training set and the testing set and evaluating the preliminary model in a mode of outputting an AUC value on the testing set; the first determining module is used for taking the preliminary model as a prediction model when the evaluation result meets a preset standard; and the second determining module is used for returning to the preliminary model when the evaluation result does not accord with the preset standard, and repeatedly executing model training and model parameter adjustment until the evaluation result accords with the preset standard.

Preferably, the building device further comprises: the deriving and storing module is used for deriving and storing the prediction model under the cluster path, sequencing the importance of each feature (namely SNP locus) returned by the prediction model, deriving and storing the importance of each feature (namely SNP locus) under the cluster path, and providing reference value for subsequent variety identification.

Example 7

The embodiment provides a device for variety identification, which comprises: the variety identification prediction model construction device.

The embodiment also provides a storage medium, which comprises a stored program, wherein the device where the storage medium is controlled to execute the method for constructing any variety identification prediction model when the program runs.

The embodiment also provides a processor, which is used for running a program, wherein the program executes the construction method of any variety identification prediction model.

The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.

The order of the embodiments of the application described above does not represent a benefit or disadvantage of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments. In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners.

The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The construction method of the variety identification prediction model is characterized by comprising the following steps:

acquiring a SNP data set;

preprocessing the SNP data set to obtain an SNP data matrix;

filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set;

taking the SNP reduced set as a characteristic value, taking the population name of a sample as a target value, and carrying out model training and model parameter adjustment to obtain a primary model;

evaluating the preliminary model to obtain a prediction model for variety identification;

preprocessing the SNP data set to obtain an SNP data matrix, wherein the SNP data matrix comprises the following steps:

removing SNP loci with deletion and/or minimum allele frequency lower than 5% in the SNP data set to obtain effective SNP loci;

carrying out digital coding conversion on genotypes of n effective SNP loci in total of m samples to be identified to obtain the SNP data matrix, and marking the SNP data matrix as a data matrix X _m×n ；

Wherein, the wild homozygous genotype AA is marked as 0, the heterozygous genotype AB is marked as 1, the mutant homozygous genotype BB is marked as 2, and m and n are natural numbers respectively;

filtering the SNP data matrix by adopting a gradient lifting method, wherein obtaining the SNP reduction set comprises the following steps:

a. extracting j samples of data from the SNP data matrix without replacement to form a first data matrix K _m×j ；

b. Taking the population name of the sample as a target value, and utilizing the gradient lifting method to perform a gradient lifting method on the first data matrix K _m×j Scoring the contribution degree of each SNP locus in the sequence;

c. for the first data matrix K _m×j The SNP sites in the array are sequenced according to the scoring height of the contribution degree, the contribution degree is accumulated according to the contribution degree, the related SNP sites with the accumulated contribution degree larger than p are reserved, and a data matrix K is generated _m×i ；

d. Judging whether all SNP loci are traversed, if so, using the related SNP loci with the last accumulated contribution degree larger than p as the SNP reduction set; if not all SNP loci have been traversed, the data are then matrix K _m×i As the data matrix K _m×j At the same time from the data matrix X _m×n Is not put back again to extract K _m×(j-i) The data form a second data matrix K _m×j Repeating the steps b) and c) until all SNP loci are traversed, and obtaining the SNP reduction set.

2. The construction method according to claim 1, wherein the m and the n are natural numbers of 2 or more, respectively.

3. The method of claim 1, wherein in the step of performing model training and adjusting model parameters, the adjustment is performed by using a grid search method.

4. A construction method according to any one of claims 1 to 3, wherein evaluating the preliminary model to obtain a predictive model of the breed identification comprises:

dividing the SNP reduced set into a training set and a testing set;

evaluating the preliminary model by performing five-fold cross-validation on the training set and outputting an AUC value on the test set;

if the evaluation result meets the preset standard, the preliminary model is used as the prediction model;

and if the evaluation result does not accord with the preset standard, returning to the preliminary model, and repeatedly executing the model training and model parameter adjusting steps until the evaluation result accords with the preset standard.

5. The method according to claim 4, wherein the method for constructing, in addition to obtaining the predictive model for variety identification, further comprises:

and exporting and storing the prediction model under a cluster path, sequencing the importance of each SNP locus returned by the prediction model, and exporting and storing the importance of each SNP locus under the cluster path.

6. A method of variety identification, the method comprising:

sequentially preprocessing an SNP data set of a sample to be identified and filtering the SNP data set by a gradient lifting method to obtain an SNP reduced set of the sample to be identified;

introducing the SNP reduced set of the sample to be identified into the prediction model constructed by the construction method according to any one of claims 1 to 5 for prediction, thereby obtaining a population to which the sample to be identified belongs;

the pretreatment is performed according to the pretreatment step in the construction method of claim 1;

the filtering treatment is performed according to the filtering treatment step in the construction method of claim 1.

7. A construction apparatus of a variety identification prediction model, characterized in that the construction apparatus includes:

the SNP acquisition module is used for acquiring an SNP data set;

the preprocessing module is used for preprocessing the SNP data set to obtain an SNP data matrix;

the SNP filtering module is used for filtering the SNP data matrix by adopting a gradient lifting method to obtain an SNP reduction set;

the model training module is used for carrying out model training and model parameter adjustment by taking the SNP reduced set as a characteristic value and the population name of a sample as a target value to obtain a primary model;

the evaluation determining module is used for evaluating the preliminary model to obtain a prediction model of the variety identification;

the preprocessing module comprises:

the locus screening module is used for removing SNP loci with deletion and/or minimum allele frequency lower than 5% in the SNP data set to obtain effective SNP loci;

the coding conversion module is used for carrying out digital coding conversion on genotypes of n effective SNP loci in total of m samples to be identified to obtain the SNP data matrix, and recording the SNP data matrix as a data matrix X _m×n ；

the filter module includes:

a first extraction module for extracting j samples of data from the SNP data matrix without replacement to form a first data matrix K _m×j ；

A contribution degree scoring module for scoring the first data matrix K by using the gradient lifting method with the population name of the sample as a target value _m×j Scoring the contribution degree of each SNP locus in the sequence;

a sorting accumulation selection module for selecting the first data matrix K _m×j The SNP sites in the array are sequenced according to the scoring height of the contribution degree, the contribution degree is accumulated according to the contribution degree, the related SNP sites with the accumulated contribution degree larger than p are reserved, and a data matrix K is generated _m×i ；

The judging and traversing module is used for judging whether all SNP loci are traversed, and if all SNP loci are traversed, the relevant SNP loci with the last accumulated contribution degree being greater than p are used as the SNP reduction set; if not all SNP loci have been traversed, the data are then matrix K _m×i As the data matrix K _m×j At the same time from the data matrix X _m×n Is not put back again to extract K _m×(j-i) The data form a second data matrix K _m×j Repeating the steps b) and c) until all SNP loci are traversed, and obtaining the SNP reduction set.

8. The building apparatus according to claim 7, wherein the m and the n are natural numbers of 2 or more, respectively.

9. The construction device according to claim 7, wherein the model training module performs model parameter adjustment by using a grid search method.

10. The building apparatus according to any one of claims 7 to 9, wherein the evaluation determination module comprises:

the evaluation module is used for performing five-fold cross validation on the training set by dividing the SNP reduction set into the training set and the test set and evaluating the preliminary model in a mode of outputting an AUC value on the test set;

the first determining module is used for taking the preliminary model as the prediction model when the evaluation result meets a preset standard;

and the second determining module is used for returning to the preliminary model when the evaluation result does not meet the preset standard, and repeatedly executing the model training and the model parameter adjustment until the evaluation result meets the preset standard.

11. The build apparatus of claim 10, wherein the build apparatus further comprises:

and the export storage module is used for exporting and storing the prediction model under a cluster path, sequencing the importance of each SNP locus returned by the prediction model, and exporting and storing the importance of each SNP locus under the cluster path.

12. An apparatus for variety identification, the apparatus comprising: the build apparatus of any one of claims 7 to 11.

13. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to execute the method of constructing the variety identification prediction model of any one of claims 1 to 5.

14. A processor for running a program, wherein the program, when run, performs the method of constructing the variety identification prediction model of any one of claims 1 to 5.