CN111191884A - Method for evaluating sample set partition quality based on data set distance - Google Patents

Method for evaluating sample set partition quality based on data set distance Download PDF

Info

Publication number
CN111191884A
CN111191884A CN201911300236.6A CN201911300236A CN111191884A CN 111191884 A CN111191884 A CN 111191884A CN 201911300236 A CN201911300236 A CN 201911300236A CN 111191884 A CN111191884 A CN 111191884A
Authority
CN
China
Prior art keywords
sample
distance
samples
calculating
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911300236.6A
Other languages
Chinese (zh)
Inventor
林兆洲
王大仟
张金霞
关竹君
姜迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital University of Medicine Sciences
Original Assignee
Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital University of Medicine Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital University of Medicine Sciences filed Critical Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital University of Medicine Sciences
Priority to CN201911300236.6A priority Critical patent/CN111191884A/en
Publication of CN111191884A publication Critical patent/CN111191884A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Economics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for evaluating sample set partition quality based on data set distance, which can overcome the defects of quantization and difficult evaluation of the conventional error analysis, tightly grasp the basic assumption that a training set and a test set need to be mutually independent and come from the same distribution, estimate the mean value and the variance of the sample set by decomposing a distance matrix between samples, and calculate the distance between the two distributions of the training set and the test set. And performing probability distribution estimation by using distance distribution obtained by random sampling, calculating the probability of different partitions, and evaluating the quality of data partition or the adaptability of the partition method to specific data by using an exact quantization index. On the basis of simplicity and practicality, the method provides the evaluation of the effectiveness of the sample set partitioning method, and provides a suitable method for helping researchers in the biomedical field to select a proper data partitioning method and determining the real generalization performance of the modeling method.

Description

Method for evaluating sample set partition quality based on data set distance
Technical Field
The invention relates to the technical field of biomedicine, in particular to a method for evaluating sample set partition quality based on data set distance.
Background
Sample division plays an important role in the biomedical field. The objective is to generate a test set to estimate the generalization capability of the model. The model is a form expression of relations in data, and prediction of unknown samples can be achieved through the model. The training error represents the learning ability of the modeling method, and the prediction ability of the unknown sample is the target pursued by modeling. The data set is therefore typically required to be large enough to cover the range of future applications when performing model training. And when the generalization capability of the model is estimated, the training set and the training range of the model are required to be kept consistent, namely, the two satisfy independent and same distribution. However, for actual data, the distribution is difficult to estimate, so the prediction performance of the common test set and the comparison between the prediction performance and the prediction capability of the training set indirectly characterize the quality of sample set division. But the reduction of the predictive performance of the test set cannot be completely attributed to the distribution difference of the training set and the test set, and the error value size shows larger difference with the difference of the data sets. Therefore, a method capable of directly representing the dividing quality of the sample set needs to be established, and the dividing quality is objectively evaluated.
The division of the sample set is usually based on the characteristics of the actual data, the analysis requirements and the mechanism of the method to select the available method. Methods commonly used for sample set partitioning (data splitting) are numerous, including random partitioning (RS), Kennard-stone (ks) method, SPXY method, DUPLEX method, and the like. The method has strong selected experience and no uniform quantization index, so the applicability evaluation of the division method or the optimization is generally carried out according to the errors (statistics) of training sets and test sets generated by different division methods, namely, the error analysis is carried out.
Therefore, how to establish an objective quality evaluation method and assist in selecting an appropriate partition method is an urgent problem to be solved by practitioners of the same industry.
Disclosure of Invention
In view of the above problems, the present invention provides a method for evaluating sample set partition quality based on data set distance, which at least solves some of the above technical problems, and can achieve the purpose of evaluating sample partition quality.
The embodiment of the invention provides a method for evaluating sample set partition quality based on data set distance, which comprises the following steps:
1) according to a sample division method, dividing a sample set into a first training set and a first testing set which are two independent and non-crossed sample subsets; the sample partition method does not include a random partition method;
2) calculating the distance between the first training set and the first testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;
3) dividing the data set into a second training set and a second testing set by adopting a random division method;
4) calculating the distance between the second training set and the second testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;
5) repeatedly executing the steps 3) and 4) for preset times to obtain probability distribution of randomly dividing the data set distance;
6) calculating the probability of the distance in the step 2) according to the probability distribution obtained in the step 5), and taking a calculation result P as an evaluation index of the dividing quality; the smaller P represents the higher dividing quality of the KS method or the SPXY method of the sample division.
Further, when the step 1) includes:
1.1) setting a first training set as Tr, juxtaposing the first training set as an empty set, and placing all samples into a candidate sample set and recording the samples as Tr _ cand;
1.2) calculating Euclidean distance between samples in Tr _ cand, selecting two samples with the farthest Euclidean distance in Tr _ cand into a first training set Tr, and deleting the two samples from Tr _ cand;
1.3) respectively calculating the distance between each residual sample in Tr _ cand and the selected sample in Tr, selecting the minimum distance value between each residual sample and the selected sample, selecting the sample with the maximum distance from all residual samples with the minimum distance value into Tr, and deleting the sample from Tr _ cand;
1.4) repeating the steps 1.1) to 1.3) until the number of samples in Tr reaches the set sample amount, and taking the rest samples as a first test set Te.
Further, the step 2) comprises:
calculating the similarity between the samples in the first training set Tr and the first test set Te; the similarity includes: the similarity k (Tr, Tr) of the samples in Tr, the similarity k (Te, Te) of the samples in Te and the similarity k (Tr, Te) of the samples between Tr and Te;
the kernel function of the similarity calculation selects a polynomial kernel function or a radial basis kernel function;
sorting all the sample similarities into a Graham matrix;
calculating a weight vector of each similarity matrix and calculating a centralized matrix;
centralizing the gram matrix, and defining the distance between the training set and the test set as follows:
Figure BDA0002320283380000031
in the formula (I), the compound is shown in the specification,
Figure BDA0002320283380000032
the method comprises the following steps of (1) approximating covariance of projection vectors of a sample set j on a mean value i and a mean value k, wherein i, j, k belongs to {1,2},1 represents Tr, and 2 represents Te;
Figure BDA0002320283380000033
to reconstruct the covariance of the gram matrix.
Further, the step 3) comprises:
calculating the sample size K in the data set, and generating a random number sequence with the maximum value of K, no repetition and the length of K;
and selecting the first n values of the random sequence as the serial number of the training set, extracting corresponding samples from the first n values of the random sequence to form the training set, and forming the test set by the rest samples.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the method for evaluating the sample set division quality based on the data set distance can overcome the defects of quantization and difficult evaluation of the conventional error analysis, tightly grasps the basic assumption that the training set and the test set need to be mutually independent and come from the same distribution, estimates the mean value and the variance of the sample set by decomposing the distance matrix between the samples, and calculates the distance between the two distributions of the training set and the test set. And performing probability distribution estimation by using distance distribution obtained by random sampling, calculating the probability of different partitions, and evaluating the quality of data partition or the adaptability of the partition method to specific data by using an exact quantization index. On the basis of simplicity and practicality, the method provides the evaluation of the effectiveness of the sample set partitioning method, and provides a suitable method for helping researchers in the biomedical field to select a proper data partitioning method and determining the real generalization performance of the modeling method.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a flowchart of a method for evaluating sample set partition quality based on data set distance according to an embodiment of the present invention;
fig. 2 is a process diagram of step 1) provided in the embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Referring to fig. 1, a method for evaluating sample set partition quality based on data set distance according to an embodiment of the present invention includes:
1) according to a sample division method, dividing a sample set into a first training set and a first testing set which are two independent and non-crossed sample subsets; the sample partition method does not include a random partition method;
2) calculating the distance between the first training set and the first testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;
3) dividing the data set into a second training set and a second testing set by adopting a random division method;
4) calculating the distance between the second training set and the second testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;
5) repeatedly executing the steps 3) and 4) for preset times to obtain probability distribution of randomly dividing the data set distance;
6) calculating the probability of the distance in the step 2) according to the probability distribution obtained in the step 5), and taking a calculation result P as an evaluation index of the dividing quality; a smaller P indicates a higher sample partition quality.
To avoid ambiguity, the training set and the test set in steps 2) and 4) are added with the first and the second differences before the names.
The above steps are described in detail below:
in the step 1), the sample division method may be, for example, a KS method, a SPXY method, an OS, or a DUPLEX, and as shown in fig. 2, the sample division includes:
1.1) setting a first training set as Tr, juxtaposing the first training set as an empty set, and placing all samples into a candidate sample set and recording the samples as Tr _ cand;
1.2) calculating Euclidean distance between samples in Tr _ cand, selecting two samples with the farthest Euclidean distance in Tr _ cand into a first training set Tr, and deleting the two samples from Tr _ cand;
1.3) respectively calculating the distance between each residual sample in Tr _ cand and the selected sample in Tr, selecting the minimum distance value between each residual sample and the selected sample, selecting the sample with the maximum distance from all residual samples with the minimum distance value into Tr, and deleting the sample from Tr _ cand;
1.4) repeating the steps 1.1) to 1.3) until the number of samples in Tr reaches the set sample amount, and taking the rest samples as a first test set Te.
In this embodiment, let the training set be Tr, and juxtapose it as an empty set, put all samples into the candidate sample set and record as Tr _ cand; firstly, calculating Euclidean distance between samples in Tr _ cand, then selecting two samples with the farthest Euclidean distance in Tr _ cand into a training set Tr, and deleting the two samples from the Tr _ cand; thereafter, respectively calculating the distance between each remaining sample in Tr _ cand and the selected sample in Tr, selecting the minimum distance between each remaining sample and the selected sample, and selecting the sample with the maximum distance into Tr and deleting the sample from Tr _ cand; the above process is repeated until the number of samples in Tr reaches the set sample amount, and the remaining samples are taken as the test set Te.
The method for calculating the distance between the sample subsets comprises the following steps: firstly, calculating the similarity between samples in a first training set Tr and a first test set Te, wherein the similarity comprises the similarity k (Tr, Tr) of the samples in Tr, the similarity k (Te, Te) of the samples in Te and the similarity k (Tr, Te) of the samples between Tr and Te;
the kernel functions of the similarity calculation may be polynomial kernel functions and radial basis kernel functions:
k(x,y)=(xTy+ξ)p
Figure BDA0002320283380000061
all sample similarities are arranged into a gram matrix, taking the first test set as an example:
Figure BDA0002320283380000062
n is the sample size of the first test set;
taking the first training set as an example:
Figure BDA0002320283380000063
m isA first training set sample size;
calculating the weight vector of each similarity matrix, taking the test set as an example:
sN×1=N-11, 1 is a matrix with the length of N;
taking the first training set as an example:
sM×1=M-11
calculating a centralized matrix, taking the first test set as an example:
JN×N=N-1/2(IN-s1T)
taking the first training set as an example:
JM×M=M-1/2(IM-s1T)
centering the gram matrix, using the first training set as an example, i.e. order
Figure BDA0002320283380000064
Take the first test set as an example, that is
Figure BDA0002320283380000065
And (3) carrying out eigenvalue decomposition on the centralized gram matrix, taking the first training set as an example, reserving the first r1 eigenvectors, and recording the eigenvectors as: vr1=[v1,v2,…,vr1]M×r1And the characteristic values are recorded as: lambdar1=D[λ12,…,λr1]r1×r1
Taking the first test set as an example, the first r2 eigenvectors are kept, and their eigenvectors are marked as: vr2=[v1,v2,…,vr2]N×r2And the characteristic values are recorded as: lambdar2=D[λ12,…,λr2]r2×r2
Order to
Figure BDA0002320283380000071
q is the first q retained feature vectors;
Figure BDA0002320283380000072
i, j, k ∈ {1,2},1 and 2 respectively represent a first training set and a first test set;
then the distance between the first training set and the first test set is defined as:
Figure BDA0002320283380000073
in the formula (I), the compound is shown in the specification,
Figure BDA0002320283380000074
the method comprises the following steps of (1) approximating covariance of projection vectors of a sample set j on a mean value i and a mean value k, wherein i, j, k belongs to {1,2},1 represents Tr, and 2 represents Te;
Figure BDA0002320283380000075
to reconstruct the covariance of the gram matrix.
In the step 3), the random division method includes: firstly, calculating a sample size K in a data set, and then generating a random number sequence with a maximum value of K and a length of K, wherein the random number sequence is non-repetitive; and selecting the first n values of the random sequence as the serial number of the training set, extracting corresponding samples from the first n values of the random sequence to form the training set, and forming the test set by the rest samples.
In steps 5) to 6), a kernel density estimation method is selected for probability estimation, and if a gaussian kernel function is selected for the kernel function, the density function can be written as:
Figure BDA0002320283380000076
where N is the sample size and phi represents the gaussian kernel function, the bandwidth h (i.e., the variance of the gaussian kernel) can be determined by cross-validation, or the bandwidth h that minimizes the progressive square integral error function can be selected. At each point xiAfter fitting the kernel function (dataset distance), the final density curve, i.e. the accumulation of the kernel function at each point, is obtained. Based on which the probability P of a certain data set distance can be calculated. And taking the size of the P value as an index for evaluating the partition quality, namely selecting the partition with lower probability as the dominant partition. In addition, the kernel function may alsoTo select for Uniform, Triangle, Epanechnikov, and the like.
The method provided by the invention overcomes the defects of quantization and difficult evaluation of conventional error analysis, tightly grasps the basic assumption that a training set and a test set need to be mutually independent and come from the same distribution, estimates the mean value and the variance of the sample set by decomposing a distance matrix (gram matrix) between samples, and calculates the distance between the two distributions of the training set and the test set. And performing probability distribution estimation by using distance distribution obtained by random sampling, calculating the probability of different partitions, and evaluating the quality of a data partition method or the adaptability of the partition method to specific data by using an exact quantization index. On the basis of simplicity and practicality, the method provides the evaluation of the effectiveness of the sample set partitioning method, and provides a suitable method for helping researchers in the biomedical field to select a proper data partitioning method and determining the real generalization performance of the modeling method.
The present invention is further described in detail below using near infrared spectroscopy data of Epsilon in conjunction with the accompanying figures.
There were 310 samples in the data, 10 samples per batch, for a total of 31 batches. Wherein 12 batches were prepared in the laboratory, 12 batches for pilot plant and 7 batches for mass production. 128 spectra were collected and averaged for each sample with a resolution of 16cm-1Wavelength range 700-.
1) The data set is first partitioned into a first training set and a first test set using the Kennard-Stone method.
The index of the first training set sample is:
281 2 18 149 276 41 252 76 262 221 182 224 131 250 201 102 205 298
142 84 175 295 296 168 11 60 231 264 203 178 66 270 21 193 209 215
192 260 225 91 254 172 155 308 268 272 302 164 188 13 58 306 171 57
218 106 74 244 151 3 286 4 253 271 33 79 220 279 154 137 173 242
101 12 241 116 186 227 267 72 283 285 110 54 124 179 280 83 237 255
99 143 289 32 153 211 132 126 157 261 62 284 28 120 274 166 236 69
223 282 35 249 238 109 245 204 52 46 219 127 47 246 240 42 232 214
138 163 31 94 239 61 159 304 14 174 207 170 156 16 258 147 36 19
287 217 140 130 167 305 169 310 139 293 25 20 299 216 105 210 117 266
162 1 5 8 34 78 114 98 243 134 195 158 95 300 206 7 257 146
the remaining samples were taken as the first test set.
2) Calculating the distance between the two sample sets based on the KL divergence of the regenerated nuclear Hilbert space, and setting the parameters as follows: kernel functions, linear kernel functions; the first data integration score r1 is 1; the second data integration score r2 is 1. Note that here r1 and r2 may be any integer, and their values may be different. The kernel function may also be selected from a gaussian kernel function or a polynomial kernel function, etc. The distance between the first training set and the first test set calculated under the above parameter settings was 0.0082.
3) Dividing the data set into a second training set and a second test set using random division;
4) calculating the distance between sample sets (a second training set and a second testing set) obtained by random division by adopting the same parameter setting as the step 2);
5) repeating the steps 3) and 4) for 500 times, and estimating the random division distance distribution by using a nuclear density estimation method. The optimal bandwidth of the kernel function is 0.0011.
6) The probability of sample division obtained by the KS method was 0.7542.
Changing the sample set dividing method in the step 1) into an SPXY method to obtain 0.1210 distance between the training set and the test set; the probability of the SPXY partition estimated by the probability density distribution function obtained in 5) is 1.
Ideally, the two data sets are extracted independently from the same distribution, with a distance that should be close to 0. The smaller its value, the better the quality of the sample partitioning in the probability distribution generated by the random partitioning. Therefore, the quality of KS partitioning was superior to the SPXY method for this data.
When error analysis is carried out, the prediction error of an ideal model test set is generally considered to be similar to the prediction set error, but the degree of closeness is lack of quantification. In the embodiment of the invention, the ratio of the training set error to the test set error of 21% of the models obtained by random division is greater than 2, and the ratio of the training set error to the test set error of 20% of the models obtained by random division is less than 0.5. The probability of a ratio of training set to test set prediction error greater than 1.2 is 31.31%. The KS method divides the obtained model into a training set and a test set, wherein the ratio of the training set to the test set is 0.7647, and the prediction error of the test set is slightly higher than that of the training set. The probability of estimating the ratio of the training set and the test set of the model obtained by the KS method by using the kernel density estimation method is 0.3898. The ratio of the model training set obtained by the SPXY method to the test set is 0.0947, and the error of the test set is larger than that of the training set. Although the conclusion that the dividing quality of the KS method in the data set is superior to that of SXPY can be obtained through error analysis, the error or the ratio is influenced by different data, the difference value of the error or the ratio obtained through different division cannot directly correspond to the quality difference degree, and the interpretation of the result has certain subjectivity. And the distribution function of the data set distance is constructed by applying random division to different data sets, so that the trouble brought to the division quality evaluation by the numerical difference of the distances obtained from different data sets can be avoided.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (4)

1. A method for evaluating sample set partition quality based on data set distance is characterized by comprising the following steps:
1) according to a sample division method, dividing a sample set into a first training set and a first testing set which are two independent and non-crossed sample subsets; the sample partition method does not include a random partition method;
2) calculating the distance between the first training set and the first testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;
3) dividing the data set into a second training set and a second testing set by adopting a random division method;
4) calculating the distance between the second training set and the second testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;
5) repeatedly executing the steps 3) and 4) for preset times to obtain probability distribution of randomly dividing the data set distance;
6) calculating the probability of the distance in the step 2) according to the probability distribution obtained in the step 5), and taking a calculation result P as an evaluation index of the dividing quality; a smaller P indicates a higher sample partition quality.
2. The method for evaluating the quality of sample set partitioning based on data set distance as claimed in claim 1, wherein said step 1) comprises:
1.1) setting a first training set as Tr, juxtaposing the first training set as an empty set, and placing all samples into a candidate sample set and recording the samples as Tr _ cand;
1.2) calculating Euclidean distance between samples in Tr _ cand, selecting two samples with the farthest Euclidean distance in Tr _ cand into a first training set Tr, and deleting the two samples from Tr _ cand;
1.3) respectively calculating the distance between each residual sample in Tr _ cand and the selected sample in Tr, selecting the minimum distance value between each residual sample and the selected sample, selecting the sample with the maximum distance from all residual samples with the minimum distance value into Tr, and deleting the sample from Tr _ cand;
1.4) repeating the steps 1.1) to 1.3) until the number of samples in Tr reaches the set sample amount, and taking the rest samples as a first test set Te.
3. The method for evaluating the quality of sample set partitioning based on data set distance as claimed in claim 2, wherein said step 2) comprises:
calculating the similarity between the samples in the first training set Tr and the first test set Te; the similarity includes: the similarity k (Tr, Tr) of the samples in Tr, the similarity k (Te, Te) of the samples in Te and the similarity k (Tr, Te) of the samples between Tr and Te;
the kernel function of the similarity calculation selects a polynomial kernel function or a radial basis kernel function;
sorting all the sample similarities into a Graham matrix;
calculating a weight vector of each similarity matrix and calculating a centralized matrix;
centralizing the gram matrix, and defining the distance between the training set and the test set as follows:
Figure FDA0002320283370000021
in the formula (I), the compound is shown in the specification,
Figure FDA0002320283370000022
the method comprises the following steps of (1) approximating covariance of projection vectors of a sample set j on a mean value i and a mean value k, wherein i, j, k belongs to {1,2},1 represents Tr, and 2 represents Te;
Figure FDA0002320283370000023
to reconstruct the covariance of the gram matrix.
4. The method for evaluating the quality of sample set partitioning based on data set distance as claimed in claim 1, wherein said step 3) comprises:
calculating the sample size K in the data set, and generating a random number sequence with the maximum value of K, no repetition and the length of K;
and selecting the first n values of the random sequence as the serial number of the training set, extracting corresponding samples from the first n values of the random sequence to form the training set, and forming the test set by the rest samples.
CN201911300236.6A 2019-12-16 2019-12-16 Method for evaluating sample set partition quality based on data set distance Pending CN111191884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911300236.6A CN111191884A (en) 2019-12-16 2019-12-16 Method for evaluating sample set partition quality based on data set distance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911300236.6A CN111191884A (en) 2019-12-16 2019-12-16 Method for evaluating sample set partition quality based on data set distance

Publications (1)

Publication Number Publication Date
CN111191884A true CN111191884A (en) 2020-05-22

Family

ID=70707392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911300236.6A Pending CN111191884A (en) 2019-12-16 2019-12-16 Method for evaluating sample set partition quality based on data set distance

Country Status (1)

Country Link
CN (1) CN111191884A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241832A (en) * 2020-09-28 2021-01-19 北京科技大学 Product quality grading evaluation standard design method and system
CN112417113A (en) * 2020-11-10 2021-02-26 绿瘦健康产业集团有限公司 Intelligent question-answering method and system based on voice recognition technology

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241832A (en) * 2020-09-28 2021-01-19 北京科技大学 Product quality grading evaluation standard design method and system
CN112241832B (en) * 2020-09-28 2024-03-05 北京科技大学 Product quality grading evaluation standard design method and system
CN112417113A (en) * 2020-11-10 2021-02-26 绿瘦健康产业集团有限公司 Intelligent question-answering method and system based on voice recognition technology

Similar Documents

Publication Publication Date Title
Wang et al. Combination of hyperband and Bayesian optimization for hyperparameter optimization in deep learning
CN112382352B (en) Method for quickly evaluating structural characteristics of metal organic framework material based on machine learning
CN111242206B (en) High-resolution ocean water temperature calculation method based on hierarchical clustering and random forests
CN108985335B (en) Integrated learning prediction method for irradiation swelling of nuclear reactor cladding material
CN111311401A (en) Financial default probability prediction model based on LightGBM
CN109643085A (en) Real-time industrial equipment production forecast and operation optimization
CN110751101B (en) Fatigue driving judgment method based on multiple clustering algorithm of unsupervised extreme learning machine
CN111191884A (en) Method for evaluating sample set partition quality based on data set distance
CN112289391B (en) Anode aluminum foil performance prediction system based on machine learning
Peltola et al. Hierarchical Bayesian Survival Analysis and Projective Covariate Selection in Cardiovascular Event Risk Prediction.
CN114169110B (en) Motor bearing fault diagnosis method based on feature optimization and GWAA-XGboost
Steingroever et al. Bayes factors for reinforcement-learning models of the Iowa gambling task.
CN113240113B (en) Method for enhancing network prediction robustness
Bağbaba Improving collective i/o performance with machine learning supported auto-tuning
CN109960146A (en) The method for improving soft measuring instrument model prediction accuracy
CN117861961A (en) Temperature control method and system of dispensing machine
CN113822336A (en) Cloud hard disk fault prediction method, device and system and readable storage medium
Dumont et al. Hyperparameter optimization of generative adversarial network models for high-energy physics simulations
CN112819085B (en) Model optimization method, device and storage medium based on machine learning
CN115344386A (en) Method, device and equipment for predicting cloud simulation computing resources based on sequencing learning
CN111160464B (en) Industrial high-order dynamic process soft measurement method based on multi-hidden-layer weighted dynamic model
JP2023521757A (en) Using a genetic algorithm to determine a model for identifying sample attributes based on Raman spectra
CN111523685A (en) Method for reducing performance modeling overhead based on active learning
CN117494573B (en) Wind speed prediction method and system and electronic equipment
CN113312988B (en) Signal feature screening and dimension reduction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200522