CN116168766A

CN116168766A - Variety identification method, system and terminal based on integrated learning

Info

Publication number: CN116168766A
Application number: CN202211598660.5A
Authority: CN
Inventors: 产天龙; 袁箐; 韩峻松; 徐祎春; 丁岩汀
Original assignee: SHANGHAI BIOCHIP CO Ltd
Current assignee: SHANGHAI BIOCHIP CO Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-05-26

Abstract

According to the variety identification method, system and terminal based on integrated learning, the characteristic data matrix is obtained by preprocessing the molecular genetic marker data of all varieties in the variety database, the classifier is constructed and screened based on the characteristic data matrix and the varieties to which each variety belongs, and the prediction model is constructed based on each screened classifier; and finally, variety identification and pedigree inference are completed based on the constructed prediction model. The invention can realize high accuracy, high speed, high flux, parallelization, automation and migratable variety identification and strain inference.

Description

Variety identification method, system and terminal based on ensemble learning

Technical Field

The invention relates to the field of variety identification, in particular to a variety identification method, system and terminal based on ensemble learning.

Background

The variety identification refers to the process of evaluating the artificially bred population with certain morphological characteristics and production traits. Has extremely important effects in plant genetic breeding, seedling breeding, cat and dog pure breed cultivation and the like.

Early variety identification mainly depends on phenotype, and mainly adopts naked eyes or instruments to measure external characteristics of a sample to be detected, takes morphological characters or physiological character waiting signs as genetic markers, and researches and identifies varieties of the sample to be detected. However, this method makes it difficult to distinguish between phenotypically close varieties or to judge hybrid varieties on the one hand; on the other hand, the judgment is too subjective, and some are difficult to provide objective scientific basis; finally, it is also difficult to quantitatively measure the degree of purity of the sample to be measured.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a variety identification method, system and terminal based on ensemble learning, which are used for solving the above technical problems in the prior art.

To achieve the above and other related objects, the present invention provides a variety identification method based on ensemble learning, the method comprising: preprocessing molecular genetic marker data of all varieties in the variety database to obtain a characteristic data matrix; wherein the feature data matrix comprises: sample feature matrix of each sample corresponding to each variety; constructing and screening a classifier based on the characteristic data matrix and varieties to which each sample belongs; respectively constructing an integrated classifier for each molecular genetic marker locus based on each classifier to be screened so as to construct a prediction model; based on the constructed prediction model, preprocessing molecular genetic marker data of the sample to be detected, and outputting a corresponding variety identification result according to a sample feature matrix obtained by preprocessing.

In an embodiment of the present invention, the sample feature data matrix includes: genotype characteristic matrix corresponding to each molecular genetic marker locus; wherein the genotype characteristics matrix comprises: characteristic values corresponding to genotypes of the current molecular genetic marker loci; wherein the types of genotypes include known types and unknown/undetected types; wherein the characteristic value of a genotype of a known type is a number comprising the corresponding genotype; the characteristic value of the genotype of the unknown/unmeasured type is set based on a setting rule; and wherein the setting rule includes: taking the first threshold value as a mean value, and conforming to a specific distribution, wherein the maximum value does not exceed the non-zero natural number of the second threshold value; and if the characteristic value exceeds the second threshold value, changing the characteristic value into the second threshold value.

In an embodiment of the present invention, the constructing and screening the classifier based on the feature data matrix and the varieties to which each sample belongs includes: based on the characteristic data matrix and varieties to which each sample belongs, respectively constructing classifiers for each selected algorithm; based on the bidirectional selection rule, each classifier is used as a classifier to be screened, and one or more classifiers are screened according to the prediction evaluation result of each classifier.

In an embodiment of the present invention, the constructing the integrated classifier for each molecular genetic marker locus based on the screening of each classifier includes: based on the algorithm corresponding to each classifier and the corresponding weighting coefficient, respectively constructing an integrated classifier for each molecular genetic marker locus; wherein the integrated classifier comprises: sub-classifiers constructed respectively corresponding to the algorithms corresponding to the selected classifiers; and determining the integrated classifier corresponding to each molecular genetic marker locus as an integrated classifier for constructing a prediction model so as to construct the prediction model.

In an embodiment of the present invention, the constructing the integrated classifier for each molecular genetic marker locus based on the screening of each classifier includes: based on the algorithm corresponding to each classifier and the corresponding weighting coefficient, respectively constructing an integrated classifier for each molecular genetic marker locus; wherein the integrated classifier comprises: sub-classifiers constructed respectively corresponding to the algorithms corresponding to the selected classifiers; based on a bidirectional selection rule, each integrated classifier is used as a classifier to be screened, and the integrated classifier corresponding to a plurality of molecular genetic marker loci is screened according to the prediction evaluation result of the integrated classifier corresponding to each molecular genetic marker locus to be used as a determined integrated classifier for constructing a prediction model so as to construct the prediction model.

In one embodiment of the present invention, the prediction model includes: the locus classification module comprises all the determined integrated classifiers and is used for outputting classification results of all varieties of corresponding samples according to genotype feature matrixes corresponding to the corresponding molecular genetic marker loci of the samples; if the value of the corresponding classification result is zero, giving the result a non-zero constant smaller than one-half of s; wherein s is the total number of varieties in the variety database; the fusion module is connected with the locus classification module and is used for carrying out accumulation multiplication on classification results of the samples corresponding to the varieties output by the integrated classifiers to obtain L values of the samples corresponding to the varieties; and the identification result output module is connected with the fusion module and is used for calculating the probability value of each corresponding variety according to the L value of each variety of the corresponding sample so as to output the corresponding variety identification result.

In an embodiment of the present invention, the bidirectional selection rule includes: step 1: sorting the classifier to be screened according to the prediction capacity determined according to the corresponding prediction evaluation result to form a model set to be selected; step 2: selecting two classifier integration optimization model sets to be screened with optimal prediction capability; step 3: selecting and adding the to-be-screened classifier with optimal unselected prediction capability after removing all to-be-screened classifiers in the preferred model set in the to-be-selected model set into the current preferred model set to integrate a new model set; step 4: comparing the predictive capabilities of the current preferred model set and the new model set; if the prediction capability of the current preferred model set is better than that of the new model set, the preferred model set is unchanged; otherwise, the new model set is used as a preferred model set; step 5: sequentially removing all classifiers to be screened from the current preferred model set, reconstructing a series of new classification integrated model sets, and selecting the classification integrated model set with optimal prediction capability and unselected classification integrated model sets; comparing the prediction capacity of the current preferred model set and the selected classification integrated model set with the optimal prediction capacity; if the prediction capability of the current preferred model set is better than that of the classification integrated model set, the preferred model set is unchanged; otherwise, the classification integrated model set is used as a preferred model set; until all the classification integrated model sets are selected and the preferred model set remains unchanged; step 6: judging whether the classifiers to be screened in the model set to be selected are all selected; if yes, taking all the classifiers to be screened in the current preferred model set as screened screening classifiers; if not, returning to the step 3 until all the classifiers to be screened in the set of the models to be selected are selected, and taking each classifier to be screened in the current preferred model set as a screened screening classifier.

In an embodiment of the invention, the method further comprises: filtering and optimizing the variety database in the following modes: if more than half of the constructed classifiers exist in the variety database, or the classification result predicted by the constructed integrated classifier is smaller than a sample of a set threshold value, rejecting the sample; if the number of samples of a variety is less than the sample threshold, or more than half of the samples are considered to be the culling, all samples of the variety are culled.

To achieve the above and other related objects, the present invention provides an ensemble learning-based variety identification system including: the data preprocessing module is used for preprocessing the molecular genetic marker data of all the samples of all the varieties in the variety database to obtain a characteristic data matrix; wherein the feature data matrix comprises: sample feature matrix of each sample corresponding to each variety; the classifier selecting module is connected with the data preprocessing module and is used for constructing and screening a classifier based on the characteristic data matrix and varieties to which each sample belongs; the model construction module is connected with the classifier selection module and is used for respectively constructing an integrated classifier for each molecular genetic marker locus based on the screened classifiers so as to construct a prediction model; and the variety identification module is connected with the model construction module and is used for preprocessing molecular genetic marker data of the sample to be detected based on the constructed prediction model and outputting a corresponding variety identification result according to a corresponding sample feature matrix obtained by preprocessing.

To achieve the above and other related objects, the present invention provides a variety identification terminal based on ensemble learning, including: one or more memories and one or more processors; the one or more memories are used for storing computer programs; the one or more processors are coupled to the memory for executing the computer program to perform the ensemble learning based race identification method.

As described above, the invention relates to a variety identification method, a system and a terminal based on ensemble learning, which have the following beneficial effects: the method comprises the steps of preprocessing molecular genetic marker data of all varieties in a variety database to obtain a characteristic data matrix, constructing and screening classifiers based on the characteristic data matrix and the varieties to which the varieties belong, and constructing a prediction model based on the screened classifiers; and finally, variety identification and pedigree inference are completed based on the constructed prediction model. The invention can realize high accuracy, high speed, high flux, parallelization, automation and migratable variety identification and strain inference.

Drawings

Fig. 1 is a schematic flow chart of a variety identification method based on ensemble learning according to an embodiment of the present invention.

Fig. 2 is a flow chart of a bidirectional selection rule according to an embodiment of the invention.

Fig. 3 is a schematic flow chart of a variety identification method based on ensemble learning according to an embodiment of the present invention.

Fig. 4 is a flow chart of a bidirectional selection rule according to an embodiment of the invention.

FIG. 5 is a diagram showing the variety identification results in an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a variety identification system based on ensemble learning according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of a variety identification terminal based on ensemble learning according to an embodiment of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

In the following description, reference is made to the accompanying drawings, which illustrate several embodiments of the invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "above," "upper," and the like, may be used herein to facilitate a description of one element or feature as illustrated in the figures relative to another element or feature.

Throughout the specification, when a portion is said to be "connected" to another portion, this includes not only the case of "direct connection" but also the case of "indirect connection" with other elements interposed therebetween. In addition, when a certain component is said to be "included" in a certain section, unless otherwise stated, other components are not excluded, but it is meant that other components may be included.

The first, second, and third terms are used herein to describe various portions, components, regions, layers and/or sections, but are not limited thereto. These terms are only used to distinguish one portion, component, region, layer or section from another portion, component, region, layer or section. Thus, a first portion, component, region, layer or section discussed below could be termed a second portion, component, region, layer or section without departing from the scope of the present invention.

Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions or operations are in some way inherently mutually exclusive.

Molecular genetic markers based on nucleotide sequence variations within the genetic material between individuals are a direct reflection of genetic polymorphisms at the DNA level. Has the following advantages: genomic variation is extremely abundant, and the number of molecular markers is almost unlimited; DNA from different tissues can be used for marker analysis at different stages of biological development; the detection means is simple and rapid. The method can scientifically, accurately, efficiently and rapidly carry out large-scale variety identification.

Ensemble learning refers to a technique that uses multiple compatible learning algorithms/models to perform a single task in order to achieve better predictive performance. The integrated learning mainly combines a plurality of basic learners to obtain superior generalization performance than a single learner.

Therefore, the invention provides a variety identification method, a system and a terminal based on ensemble learning, which are characterized in that a characteristic data matrix is obtained by preprocessing molecular genetic marker data of all varieties in a variety database, and classifier is constructed and screened based on the characteristic data matrix and the varieties to which each variety belongs, and a prediction model is constructed based on each screened classifier; and finally, variety identification and pedigree inference are completed based on the constructed prediction model. The invention can realize high accuracy, high speed, high flux, parallelization, automation and migratable variety identification and strain inference.

The embodiments of the present invention will be described in detail below with reference to the attached drawings so that those skilled in the art to which the present invention pertains can easily implement the present invention. This invention may be embodied in many different forms and is not limited to the embodiments described herein.

Fig. 1 shows a schematic flow chart of a variety identification method based on ensemble learning in an embodiment of the invention.

The method comprises the following steps:

step S1: preprocessing molecular genetic marker data of all varieties in the variety database to obtain a characteristic data matrix.

In detail, the variety database has a plurality of variety samples; the feature data matrix includes: sample feature matrix of each sample corresponding to each variety.

In an embodiment, the sample feature data matrix comprises: genotype characteristic matrix corresponding to each molecular genetic marker locus; wherein the genotype characteristics matrix comprises: characteristic values corresponding to genotypes of the current molecular genetic marker loci; wherein the types of genotypes include known types and unknown/undetected types; wherein the characteristic value of a genotype of a known type is a number comprising the corresponding genotype; the characteristic value of the genotype of the unknown/unmeasured type is set based on a setting rule; and wherein the setting rule includes: taking the first threshold value as a mean value, and conforming to a specific distribution, wherein the maximum value does not exceed the non-zero natural number of the second threshold value; and if the characteristic value exceeds the second threshold value, changing the characteristic value into the second threshold value.

Specifically, the mode of preprocessing the molecular genetic marker data of the sample to obtain a matrix comprises the following steps:

obtaining molecular genetic marker data, converting the molecular genetic marker data into a matrix form, wherein each known polymorphism type of each independent molecular genetic marker locus is an independent characteristic; wherein the matrix value indicates the number of features that are included for each sample, and 0 indicates that the sample does not include the features.

In view of the existence of unknown and/or low-frequency polymorphism types at each molecular genetic marker locus, a combination of polymorphism types with characteristics representing unknown and/or low-frequency is added to each molecular genetic marker locus, wherein the value is a nonzero natural number which is in average value, accords with a specific distribution X and has a maximum value not exceeding m, and the value exceeds m instead of m.

For example, a non-0 natural number is satisfied with a normal distribution of 0.01 as a mean value and 0.003 as a standard deviation, and the maximum value is not more than 0.1.

Step S2: and constructing and screening a classifier based on the characteristic data matrix and varieties to which each sample belongs.

In one embodiment, step S2 includes: based on the characteristic data matrix and varieties to which each sample belongs, respectively constructing classifiers for each selected algorithm; specifically, the feature data matrix is used as a feature value, the variety to which each sample belongs is used as a target value, and an algorithm capable of outputting each type of probability value is selected for model training. Constructing a classifier by using a class of algorithms; the classifier performs cross-validation to prevent overfitting and determines model parameters. And evaluating the classifier according to the prediction result to obtain a prediction evaluation result.

Based on the bidirectional selection rule, each classifier is used as a classifier to be screened, and one or more classifiers are screened according to the prediction evaluation result of each classifier.

For example, algorithms such as SVM, random forest, gradientBoosting, extreme random tree, naive Bayes (polynomial distribution, bernoulli), LDA, logistic regression (L1, L2 regularization), XGBoost, nearest neighbor, neural network, etc. are selected to construct the classifier.

In one embodiment, the probability of the final screened optimal classifier (algorithm) combination predicting the correct variety is calculated, and the duty ratio of the sum of the correct probabilities of each classifier is obtained, namely, the respective weighting coefficient is calculated.

Step S3: and respectively constructing an integrated classifier for each molecular genetic marker locus based on each classifier of the screening to construct a prediction model.

In one embodiment, step S13: based on the algorithm corresponding to each classifier and the corresponding weighting coefficient, respectively constructing an integrated classifier for each molecular genetic marker locus; wherein the integrated classifier comprises: sub-classifiers constructed respectively corresponding to the algorithms corresponding to the selected classifiers; the integrated model prediction result is obtained by adopting a weighted average method or stacking through weighting coefficients of all the screened classifiers.

And determining the integrated classifier corresponding to each molecular genetic marker locus as an integrated classifier for constructing a prediction model so as to construct the prediction model.

In this embodiment, the prediction model includes:

the locus classification module comprises an integrated classifier respectively constructed by each molecular genetic marker locus, and is used for outputting classification results of each variety of the corresponding sample according to the genotype characteristic matrix corresponding to each molecular genetic marker locus of the sample; if the corresponding classification result is zero, giving the result a non-zero value constant smaller than one-half of s; wherein s is the total number of varieties in the variety database;

the fusion module is connected with the locus classification module and is used for carrying out accumulation multiplication on classification results of the samples corresponding to the varieties output by the integrated classifiers to obtain L values of the samples corresponding to the varieties;

the identification result output module is connected with the fusion module and is used for calculating the probability value of each corresponding variety according to the L value of each variety of the corresponding sample so as to output the corresponding variety identification result;

the molecular genetic marker loci are independent of each other and accord with Mendelian's law of inheritance, and the multi-locus results are accumulated and multiplied, so that each locus has no genetic linkage and no concomitant genetic phenomenon.

Preferably, for each variety, the probability value of the sample predicted as that variety is the L value of that variety divided by the sum of L values of all predicted varieties; according to the probability value of each variety predicted by the measured sample, judging which pure species the measured sample belongs to and the corresponding pure degree; or judging which varieties the sample is crossed and the proportion of the line crosses, thereby deducing the potential pedigree.

In one embodiment, each molecular genetic marker locus selected can be modeled either individually as a single dataset and then integrated together, or all loci can be modeled together as a dataset.

To save cost, screening as few STR sites as possible from the target varieties on the premise of ensuring accurate discrimination of the target varieties. The following examples will explain the implementation and procedure.

In one embodiment, the constructing the integrated classifier for each molecular genetic marker locus based on the screening classifier comprises:

based on the algorithm corresponding to each classifier and the corresponding weighting coefficient, respectively constructing an integrated classifier for each molecular genetic marker locus; wherein the integrated classifier comprises: sub-classifiers constructed respectively corresponding to the algorithms corresponding to the selected classifiers;

Based on a bidirectional selection rule, each integrated classifier is used as a classifier to be screened, and the integrated classifier corresponding to a plurality of molecular genetic marker loci is screened according to the prediction evaluation result of the integrated classifier corresponding to each molecular genetic marker locus to be used as a determined integrated classifier for constructing a prediction model so as to construct the prediction model.

In this embodiment, the prediction model includes:

the locus classification module comprises an integrated classifier respectively constructed by each molecular genetic marker locus screened, and is used for outputting classification results of each variety of the corresponding sample according to the genotype characteristic matrix corresponding to the corresponding molecular genetic marker locus of the sample; if the corresponding classification result is zero, giving the result a non-zero value constant smaller than one-half of s; wherein s is the total number of varieties in the variety database;

and the identification result output module is connected with the fusion module and is used for calculating the probability value of each corresponding variety according to the L value of each variety of the corresponding sample so as to output the corresponding variety identification result.

In one embodiment, as shown in fig. 2, the bidirectional selection rule includes:

step 1: sorting the classifier to be screened according to the prediction capacity determined according to the corresponding prediction evaluation result to form a model set to be selected;

step 2: selecting two classifier integration optimization model sets to be screened with optimal prediction capability;

step 3: selecting and adding the to-be-screened classifier with optimal unselected prediction capability after removing all to-be-screened classifiers in the preferred model set in the to-be-selected model set into the current preferred model set to integrate a new model set;

step 4: comparing the predictive capabilities of the current preferred model set and the new model set; if the prediction capability of the current preferred model set is better than that of the new model set, the preferred model set is unchanged; otherwise, the new model set is used as a preferred model set;

step 5: comprising the following steps:

step 51: sequentially removing all classifiers to be screened from the current preferred model set, and reconstructing a series of new classification integrated model sets;

step 52: selecting a classification integrated model set with optimal prediction capability and unselected classification integrated model set;

Step 53: comparing the prediction capacity of the current preferred model set and the selected classification integrated model set with the optimal prediction capacity; if the prediction capability of the current preferred model set is better than that of the classification integrated model set, the preferred model set is unchanged; otherwise, the classification integrated model set is used as a preferred model set;

step 54: judging whether the integrated model sets to be classified are all selected, if so, executing the following steps; if not, step 52 is performed until all the classification integrated model sets are selected and the preferred model set remains unchanged;

step 6: judging whether the classifiers to be screened in the model set to be selected are all selected; if yes, taking all the classifiers to be screened in the current preferred model set as screened screening classifiers; if not, returning to the step 3 until all the classifiers to be screened in the set of the models to be selected are selected, and taking each classifier to be screened in the current preferred model set as a screened screening classifier.

Step S4: based on the constructed prediction model, outputting a corresponding variety identification result according to a corresponding sample feature matrix obtained by preprocessing molecular genetic marker data of the sample to be tested.

Specifically, preprocessing molecular genetic data of a sample to be detected according to S1 to obtain a corresponding sample feature matrix; inputting the matrix into a prediction model to obtain a prediction result of each variety corresponding to the sample; specifically, the obtained results (probability values) of all the sites are multiplied to obtain a final value L; for each variety, the probability value that a sample predicts as that variety is the L value of that variety divided by the sum of the L values of all predicted varieties. Thereby identifying the attribution of the varieties. According to the probability value of each variety predicted by the measured sample, judging which pure species the measured sample belongs to and the corresponding pure degree; or judging which varieties the sample is crossed and the proportion of the line crosses, thereby deducing the potential pedigree.

In an embodiment, the method further comprises:

filtering and optimizing the variety database in the following modes:

if more than half of the constructed classifiers exist in the variety database, or the classification result predicted by the constructed integrated classifier is smaller than a sample of a set threshold value, rejecting the sample; specifically, for a single sample of the database, if it is misclassified in more than half (including half) of the classifiers, or if its probability value for predicting the correct variety in the integrated classifier is less than a set threshold, the sample is culled.

If the number of samples of a variety is less than the sample threshold, or more than half of the samples are considered to be the culling, all samples of the variety are culled. In particular, for different varieties in the database. If the number of samples of the variety is less than 4, or more than half of the samples are deemed to be culled, the variety is culled.

In order to better describe a client platform building and communication service method based on personal mobile equipment, the following specific embodiments are provided for explanation;

example 1: a variety identification method based on ensemble learning; fig. 3 is a flow chart of a variety identification method based on ensemble learning in the present embodiment.

The breed database used to construct the model in this example is an STR (microsatellite) database (haplotype) for a pure pet dog, with 15 STR sites (246 genotypes), 123 breeds (1772 samples).

Step 11: pretreatment of molecular genetic marker data.

To assist those skilled in the art in better understanding the technical scheme of this step, the following description is given with reference to table 1. Table 1 is a matrix partial display of the pretreatment of STR locus STR-1 of the database of pure breeds of pet dogs. STR site STR-1 has 9 total genotypes (G1, G2, G3, G4, G5, G6, G7, G8, G9), 29 rows of the table representing 29 samples, the number of the genotypes in each sample being indicated by the number; other means other low frequency or unknown genotypes, which conform to non-0 natural numbers with normal distribution of 0.01 as mean and 0.003 as standard deviation and maximum not exceeding 0.1. For example, sample 0 indicates its genotype 044052; sample 1 indicated genotypes were all 052.

Table 1: STR data preprocessing

Step 12: selection of a classifier (algorithm).

Step 11, obtaining a matrix of 1772×264 as a characteristic value, and taking the varieties to which 1772 samples belong as target values; algorithms such as SVM, random forest, gradientBoosting, extreme random tree, naive Bayes (polynomial distribution, bernoulli), LDA, logistic regression (L1, L2 regularization), XGBoost, nearest neighbor, neural network and the like are respectively selected to construct a classifier model set. As shown in fig. 4, the screening is performed according to a rule of bidirectional selection. The sub-models included in model a are the final screened classifiers (algorithms). Finally, the optimal classifier (algorithm) combination is obtained and is SVM, LDA and Bernoulli naive Bayes. Their probability sums of predicting correct varieties are 774.473509, 1589.794006 and 1700.910500 respectively, and the duty ratio of the probability sums of correct for each classifier is 0.1905140, 0.3910761 and 0.4184098 respectively, that is, the weighting coefficients are 0.1905140, 0.3910761 and 0.4184098 respectively.

Step 13: and (5) constructing a prediction model.

As described in step 12, for each of the 15 STR sites, an integrated classifier (each class of algorithm constructs a sub-classifier), and the prediction result of the integrated model adopts a weighted average method, and the weighting coefficient is obtained in step S12. For the case of zero values that may occur for the results (probability values) for each site, values 5.643341e-08 are assigned. For 123 varieties, the results (probability values) of 15 loci are multiplied by one another to obtain a final value L.

Step 14: and (5) finishing variety identification.

And (3) preprocessing the molecular genetic data of the sample to be detected according to the step (11). The prediction results of 123 varieties are obtained by inputting the prediction results into the integrated model, as shown in fig. 5. The variety identified by the sample is GoldenReceriever; the purity degree reaches 95.83 percent.

Example 2: an optimization method of STR locus.

This example is mainly for distinguishing 3 dogs JackRussellTerrier, australianShepherd, ratTerrier. There are 15STR sites (sample number 73) with a prediction accuracy of 73/73. To save cost, screening as few STR sites as possible from the target varieties on the premise of ensuring accurate discrimination of the target varieties.

For 15STR sites, each site builds an ensemble learning model as a sub-model, and screening is performed according to the rules of bidirectional selection, as shown in fig. 4. The site sub-model contained in model a corresponds to the final screening site. Finally, the simplest 6 sites are obtained. Their probability of predicting the correct variety is 73/73.

As shown in table 2, the prediction column indicates the number of samples for which 73 sample varieties are predicted to be correct, and the proba column indicates the sum of probability values for which 73 samples are predicted to be correct varieties in the model; 15STRs and 6STRs are shown as predicted results of 15STR sites and 6STR sites selected, respectively. 6STRs-x (x is a number) represents the predicted result of randomly selecting 6STR sites at 15STRs sites. Table 2 shows that the discrimination ability of the screened 6STR locus pair JackRussellTerrier, australianShepherd, ratTerrier is similar to that of 15STR loci and is significantly better than the random result.

Table 2:6-STR &15-STR prediction result comparison

name	predict	proba
			15STRs	73	72.99922092
6STRs	73	72.24644694
			6STRs-1	64	61.654921
6STRs-2	61	58.108301
			6STRs-3	62	60.017926
6STRs-4	61	57.855817
			6STRs-5	68	62.808373
6STRs-6	63	58.412947
			6STRs-7	67	65.870364
6STRs-8	69	63.265737
			6STRs-9	69	66.151941
6STRs-10	63	60.046227
			6STRs-11	62	58.025687
6STRs-12	63	60.824249
			6STRs-13	69	65.842423
6STRs-14	59	54.627643
			6STRs-15	60	59.257195
6STRs-16	69	62.773011
			6STRs-17	60	57.787691
6STRs-18	60	58.768493
			6STRs-19	61	55.49531
6STRs-20	60	55.848688

The method combined with the embodiment can perform operations of optimizing a pure-breed database and screening reliable molecular genetic markers in the process of constructing the model, so that the model has better prediction capability and generalization capability, simultaneously reduces the operation amount and improves the operation speed, and the model constructed by the method can realize high-accuracy, high-speed, high-throughput, parallelization, automation and migratability for variety identification and strain inference.

Similar to the principles of the embodiments described above, the present invention provides an ensemble learning-based variety identification system.

Specific embodiments are provided below with reference to the accompanying drawings:

fig. 6 shows a schematic structural diagram of a variety identification system based on ensemble learning in an embodiment of the present invention.

The system comprises:

the data preprocessing module 61 is configured to preprocess molecular genetic marker data of each sample of all varieties in the variety database to obtain a feature data matrix, and obtain a feature data matrix; wherein the feature data matrix comprises: sample feature matrix of each sample corresponding to each variety;

the classifier selecting module 62 is connected with the data preprocessing module 61 and is used for constructing and screening a classifier based on the characteristic data matrix and varieties to which each sample belongs;

The model construction module 63 is connected with the classifier selection module 62 and is used for respectively constructing an integrated classifier for each molecular genetic marker locus based on the screened classifiers so as to construct a prediction model;

the variety identification module 64 is connected to the model construction module 63, and is configured to output a corresponding variety identification result according to a corresponding sample feature matrix obtained by preprocessing molecular genetic marker data of the sample to be tested based on the constructed prediction model.

It should be noted that, it should be understood that the division of the modules in the embodiment of the system of fig. 6 is merely a division of logic functions, and may be fully or partially integrated into a physical entity or may be physically separated. And these units may all be implemented in the form of software calls through the processing element; or can be realized in hardware; the method can also be realized in a form that a part of units are called by processing elements to be software, and the other part of units are realized in a form of hardware.

Since the implementation principle of the variety identification system based on ensemble learning has been described in the foregoing embodiments, a detailed description is omitted here.

In one embodiment, the classifier selecting module 62 is configured to construct a classifier for each selected algorithm based on the feature data matrix and the variety to which each sample belongs; based on the bidirectional selection rule, each classifier is used as a classifier to be screened, and one or more classifiers are screened according to the prediction evaluation result of each classifier.

In an embodiment, the model building module 63 is configured to build an integrated classifier for each molecular genetic marker locus based on the algorithm corresponding to each classifier of the screening and the corresponding weighting coefficient; wherein the integrated classifier comprises: sub-classifiers constructed respectively corresponding to the algorithms corresponding to the selected classifiers; and determining the integrated classifier corresponding to each molecular genetic marker locus as an integrated classifier for constructing a prediction model so as to construct the prediction model.

In an embodiment, the model building module 63 is configured to build an integrated classifier for each molecular genetic marker locus based on the algorithm corresponding to each classifier of the screening and the corresponding weighting coefficient; wherein the integrated classifier comprises: sub-classifiers constructed respectively corresponding to the algorithms corresponding to the selected classifiers; based on a bidirectional selection rule, each integrated classifier is used as a classifier to be screened, and the integrated classifier corresponding to a plurality of molecular genetic marker loci is screened according to the prediction evaluation result of the integrated classifier corresponding to each molecular genetic marker locus to be used as a determined integrated classifier for constructing a prediction model so as to construct the prediction model.

In an embodiment, the predictive model includes: the locus classification module comprises all the determined integrated classifiers and is used for outputting classification results of all varieties of corresponding samples according to genotype feature matrixes corresponding to the corresponding molecular genetic marker loci of the samples; if the corresponding classification result is zero, giving the result a non-zero value constant smaller than one-half of s; wherein s is the total number of varieties in the variety database; the fusion module is connected with the locus classification module and is used for carrying out accumulation multiplication on classification results of the samples corresponding to the varieties output by the integrated classifiers to obtain L values of the samples corresponding to the varieties; and the identification result output module is connected with the fusion module and is used for calculating the probability value of each corresponding variety according to the L value of each variety of the corresponding sample so as to output the corresponding variety identification result.

In an embodiment, the bidirectional selection rule includes: step 1: sorting the classifier to be screened according to the prediction capacity determined according to the corresponding prediction evaluation result to form a model set to be selected; step 2: selecting two classifier integration optimization model sets to be screened with optimal prediction capability; step 3: selecting and adding the to-be-screened classifier with optimal unselected prediction capability after removing all to-be-screened classifiers in the preferred model set in the to-be-selected model set into the current preferred model set to integrate a new model set; step 4: comparing the predictive capabilities of the current preferred model set and the new model set; if the prediction capability of the current preferred model set is better than that of the new model set, the preferred model set is unchanged; otherwise, the new model set is used as a preferred model set; step 5: sequentially removing all classifiers to be screened from the current preferred model set, reconstructing a series of new classification integrated model sets, and selecting the classification integrated model set with optimal prediction capability and unselected classification integrated model sets; comparing the prediction capacity of the current preferred model set and the selected classification integrated model set with the optimal prediction capacity; if the prediction capability of the current preferred model set is better than that of the classification integrated model set, the preferred model set is unchanged; otherwise, the classification integrated model set is used as a preferred model set; until all the classification integrated model sets are selected and the preferred model set remains unchanged; step 6: judging whether the classifiers to be screened in the model set to be selected are all selected; if yes, taking all the classifiers to be screened in the current preferred model set as screened screening classifiers; if not, returning to the step 3 until all the classifiers to be screened in the set of the models to be selected are selected, and taking each classifier to be screened in the current preferred model set as a screened screening classifier.

In one embodiment, the identification system is further configured to filter and optimize the variety database by: if more than half of the constructed classifiers exist in the variety database, or the classification result predicted by the constructed integrated classifier is smaller than a sample of a set threshold value, rejecting the sample; if the number of samples of a variety is less than the sample threshold, or more than half of the samples are considered to be the culling, all samples of the variety are culled.

Fig. 7 shows a schematic structural diagram of the variety identification terminal 10 based on ensemble learning in the embodiment of the present invention.

The ensemble learning-based variety identification terminal 70 includes: a memory 71 and a processor 72, the memory 71 for storing a computer program; the processor 72 runs a computer program to implement the ensemble learning-based race identification method as described in fig. 1.

Alternatively, the number of the memories 71 may be one or more, and the number of the processors 72 may be one or more, and one is taken as an example in fig. 7.

Optionally, the processor 72 in the learning-integrated variety identification terminal 70 loads one or more instructions corresponding to the process of the application program into the memory 71 according to the steps as described in fig. 1, and the processor 72 executes the application program stored in the first memory 71, thereby implementing various functions in the learning-integrated variety identification method as described in fig. 1.

Optionally, the memory 71 may include, but is not limited to, high speed random access memory, nonvolatile memory. Such as one or more disk storage devices, flash memory devices, or other non-volatile solid-state storage devices; the processor 72 may include, but is not limited to, a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

Alternatively, the processor 72 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The present invention also provides a computer-readable storage medium storing a computer program that, when run, implements the ensemble learning-based variety identification method shown in fig. 1. The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be an article of manufacture that is not accessed by a computer device or may be a component used by an accessed computer device.

In summary, according to the variety identification system based on ensemble learning, the feature data matrix is obtained by preprocessing the molecular genetic marker data of each sample of all varieties in the variety database, the classifier is constructed and screened based on the feature data matrix and the varieties to which each sample belongs, and the prediction model is constructed based on each screened classifier; and finally, variety identification and pedigree inference are completed based on the constructed prediction model. The invention can realize high accuracy, high speed, high flux, parallelization, automation and migratable variety identification and strain inference. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.

Claims

1. A variety identification method based on ensemble learning, the method comprising:

preprocessing molecular genetic marker data of all varieties in the variety database to obtain a characteristic data matrix; wherein the feature data matrix comprises: sample feature matrix of each sample corresponding to each variety;

constructing and screening a classifier based on the characteristic data matrix and varieties to which each sample belongs;

respectively constructing an integrated classifier for each molecular genetic marker locus based on each classifier to be screened so as to construct a prediction model;

based on the constructed prediction model, preprocessing molecular genetic marker data of the sample to be detected, and outputting a corresponding variety identification result according to a sample feature matrix obtained by preprocessing.

2. The ensemble learning based variety identification method as claimed in claim 1, wherein the sample feature data matrix includes:

genotype characteristic matrix corresponding to each molecular genetic marker locus; wherein the genotype characteristics matrix comprises: characteristic values of various genotypes corresponding to the current molecular genetic marker locus;

wherein the types of genotypes include known types and unknown/undetected types; wherein the characteristic value of a genotype of a known type is a number comprising the corresponding genotype; the characteristic value of the genotype of the unknown/unmeasured type is set based on a setting rule;

and wherein the setting rule includes: taking the first threshold value as a mean value, and conforming to a specific distribution, wherein the maximum value does not exceed the non-zero natural number of the second threshold value; and if the characteristic value exceeds the second threshold value, changing the characteristic value into the second threshold value.

3. The ensemble learning-based variety identification method as set forth in claim 1, wherein constructing and screening a classifier based on the feature data matrix and the variety to which each sample belongs includes:

based on the characteristic data matrix and varieties to which each sample belongs, respectively constructing classifiers for each selected algorithm;

4. The ensemble-based variety identification method as claimed in claim 3, wherein said each classifier based on screening constructs an ensemble classifier for each molecular genetic marker locus separately, so as to construct a prediction model comprising:

5. The ensemble-based variety identification method as claimed in claim 3, wherein said each classifier based on screening constructs an ensemble classifier for each molecular genetic marker locus separately, so as to construct a prediction model comprising:

6. The ensemble-based variety identification method as claimed in claim 4 or 5, wherein said prediction model includes:

the locus classification module comprises all the determined integrated classifiers and is used for outputting classification results of all varieties of corresponding samples according to genotype feature matrixes corresponding to the corresponding molecular genetic marker loci of the samples; if the value of the corresponding classification result is zero, giving the result as a non-zero constant smaller than one-half of s; wherein s is the total number of varieties in the variety database;

7. The ensemble-based item identification method as claimed in claim 4 or 5, wherein said bidirectional selection rule includes:

step 5: sequentially removing all classifiers to be screened from the current preferred model set, reconstructing a series of new classification integrated model sets, and selecting the classification integrated model set with optimal prediction capability and unselected classification integrated model sets; comparing the prediction capacity of the current preferred model set and the selected classification integrated model set with the optimal prediction capacity; if the prediction capability of the current preferred model set is better than that of the classification integrated model set, the preferred model set is unchanged; otherwise, the classification integrated model set is used as a preferred model set; until all the classification integrated model sets are selected and the preferred model set remains unchanged;

8. The ensemble-based variety identification method as claimed in claim 1, wherein said method further comprises:

filtering and optimizing the variety database in the following modes:

if more than half of the constructed classifiers exist in the variety database, or the classification result predicted by the constructed integrated classifier is smaller than a sample of a set threshold value, rejecting the sample;

if the number of samples of a variety is less than the sample threshold, or more than half of the samples are considered to be the culling, all samples of the variety are culled.

9. An ensemble learning-based variety identification system, the system comprising:

the data preprocessing module is used for preprocessing the molecular genetic marker data of all the samples of all the varieties in the variety database to obtain a characteristic data matrix; wherein the feature data matrix comprises: sample feature matrix of each sample corresponding to each variety;

The classifier selecting module is connected with the data preprocessing module and is used for constructing and screening a classifier based on the characteristic data matrix and varieties to which each sample belongs;

the model construction module is connected with the classifier selection module and is used for respectively constructing an integrated classifier for each molecular genetic marker locus based on the screened classifiers so as to construct a prediction model;

and the variety identification module is connected with the model construction module and is used for preprocessing molecular genetic marker data of the sample to be detected based on the constructed prediction model and outputting a corresponding variety identification result according to a sample feature matrix obtained by preprocessing.

10. A variety identification terminal based on ensemble learning, comprising: one or more memories and one or more processors;

the one or more memories are used for storing computer programs;

the one or more processors being connected to the memory for running the computer program to perform the method of any one of claims 1 to 8.