CN111191884A

CN111191884A - Method for evaluating sample set partition quality based on data set distance

Info

Publication number: CN111191884A
Application number: CN201911300236.6A
Authority: CN
Inventors: 林兆洲; 王大仟; 张金霞; 关竹君; 姜迪
Original assignee: Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital University of Medicine Sciences
Current assignee: Beijing Hospital of Traditional Chinese Medicine Affiliated to Capital University of Medicine Sciences
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-05-22

Abstract

The invention discloses a method for evaluating sample set partition quality based on data set distance, which can overcome the defects of quantization and difficult evaluation of the conventional error analysis, tightly grasp the basic assumption that a training set and a test set need to be mutually independent and come from the same distribution, estimate the mean value and the variance of the sample set by decomposing a distance matrix between samples, and calculate the distance between the two distributions of the training set and the test set. And performing probability distribution estimation by using distance distribution obtained by random sampling, calculating the probability of different partitions, and evaluating the quality of data partition or the adaptability of the partition method to specific data by using an exact quantization index. On the basis of simplicity and practicality, the method provides the evaluation of the effectiveness of the sample set partitioning method, and provides a suitable method for helping researchers in the biomedical field to select a proper data partitioning method and determining the real generalization performance of the modeling method.

Description

Method for evaluating sample set partition quality based on data set distance

Technical Field

The invention relates to the technical field of biomedicine, in particular to a method for evaluating sample set partition quality based on data set distance.

Background

Sample division plays an important role in the biomedical field. The objective is to generate a test set to estimate the generalization capability of the model. The model is a form expression of relations in data, and prediction of unknown samples can be achieved through the model. The training error represents the learning ability of the modeling method, and the prediction ability of the unknown sample is the target pursued by modeling. The data set is therefore typically required to be large enough to cover the range of future applications when performing model training. And when the generalization capability of the model is estimated, the training set and the training range of the model are required to be kept consistent, namely, the two satisfy independent and same distribution. However, for actual data, the distribution is difficult to estimate, so the prediction performance of the common test set and the comparison between the prediction performance and the prediction capability of the training set indirectly characterize the quality of sample set division. But the reduction of the predictive performance of the test set cannot be completely attributed to the distribution difference of the training set and the test set, and the error value size shows larger difference with the difference of the data sets. Therefore, a method capable of directly representing the dividing quality of the sample set needs to be established, and the dividing quality is objectively evaluated.

The division of the sample set is usually based on the characteristics of the actual data, the analysis requirements and the mechanism of the method to select the available method. Methods commonly used for sample set partitioning (data splitting) are numerous, including random partitioning (RS), Kennard-stone (ks) method, SPXY method, DUPLEX method, and the like. The method has strong selected experience and no uniform quantization index, so the applicability evaluation of the division method or the optimization is generally carried out according to the errors (statistics) of training sets and test sets generated by different division methods, namely, the error analysis is carried out.

Therefore, how to establish an objective quality evaluation method and assist in selecting an appropriate partition method is an urgent problem to be solved by practitioners of the same industry.

Disclosure of Invention

In view of the above problems, the present invention provides a method for evaluating sample set partition quality based on data set distance, which at least solves some of the above technical problems, and can achieve the purpose of evaluating sample partition quality.

The embodiment of the invention provides a method for evaluating sample set partition quality based on data set distance, which comprises the following steps:

1) according to a sample division method, dividing a sample set into a first training set and a first testing set which are two independent and non-crossed sample subsets; the sample partition method does not include a random partition method;

2) calculating the distance between the first training set and the first testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;

3) dividing the data set into a second training set and a second testing set by adopting a random division method;

4) calculating the distance between the second training set and the second testing set by adopting a KL divergence method in a regenerated nuclear Hilbert space;

5) repeatedly executing the steps 3) and 4) for preset times to obtain probability distribution of randomly dividing the data set distance;

6) calculating the probability of the distance in the step 2) according to the probability distribution obtained in the step 5), and taking a calculation result P as an evaluation index of the dividing quality; the smaller P represents the higher dividing quality of the KS method or the SPXY method of the sample division.

Further, when the step 1) includes:

1.1) setting a first training set as Tr, juxtaposing the first training set as an empty set, and placing all samples into a candidate sample set and recording the samples as Tr _ cand;

1.2) calculating Euclidean distance between samples in Tr _ cand, selecting two samples with the farthest Euclidean distance in Tr _ cand into a first training set Tr, and deleting the two samples from Tr _ cand;

1.3) respectively calculating the distance between each residual sample in Tr _ cand and the selected sample in Tr, selecting the minimum distance value between each residual sample and the selected sample, selecting the sample with the maximum distance from all residual samples with the minimum distance value into Tr, and deleting the sample from Tr _ cand;

1.4) repeating the steps 1.1) to 1.3) until the number of samples in Tr reaches the set sample amount, and taking the rest samples as a first test set Te.

Further, the step 2) comprises:

calculating the similarity between the samples in the first training set Tr and the first test set Te; the similarity includes: the similarity k (Tr, Tr) of the samples in Tr, the similarity k (Te, Te) of the samples in Te and the similarity k (Tr, Te) of the samples between Tr and Te;

the kernel function of the similarity calculation selects a polynomial kernel function or a radial basis kernel function;

sorting all the sample similarities into a Graham matrix;

calculating a weight vector of each similarity matrix and calculating a centralized matrix;

centralizing the gram matrix, and defining the distance between the training set and the test set as follows:

in the formula (I), the compound is shown in the specification,

the method comprises the following steps of (1) approximating covariance of projection vectors of a sample set j on a mean value i and a mean value k, wherein i, j, k belongs to {1,2},1 represents Tr, and 2 represents Te;

to reconstruct the covariance of the gram matrix.

Further, the step 3) comprises:

calculating the sample size K in the data set, and generating a random number sequence with the maximum value of K, no repetition and the length of K;

and selecting the first n values of the random sequence as the serial number of the training set, extracting corresponding samples from the first n values of the random sequence to form the training set, and forming the test set by the rest samples.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the method for evaluating the sample set division quality based on the data set distance can overcome the defects of quantization and difficult evaluation of the conventional error analysis, tightly grasps the basic assumption that the training set and the test set need to be mutually independent and come from the same distribution, estimates the mean value and the variance of the sample set by decomposing the distance matrix between the samples, and calculates the distance between the two distributions of the training set and the test set. And performing probability distribution estimation by using distance distribution obtained by random sampling, calculating the probability of different partitions, and evaluating the quality of data partition or the adaptability of the partition method to specific data by using an exact quantization index. On the basis of simplicity and practicality, the method provides the evaluation of the effectiveness of the sample set partitioning method, and provides a suitable method for helping researchers in the biomedical field to select a proper data partitioning method and determining the real generalization performance of the modeling method.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of a method for evaluating sample set partition quality based on data set distance according to an embodiment of the present invention;

fig. 2 is a process diagram of step 1) provided in the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, a method for evaluating sample set partition quality based on data set distance according to an embodiment of the present invention includes:

6) calculating the probability of the distance in the step 2) according to the probability distribution obtained in the step 5), and taking a calculation result P as an evaluation index of the dividing quality; a smaller P indicates a higher sample partition quality.

To avoid ambiguity, the training set and the test set in steps 2) and 4) are added with the first and the second differences before the names.

The above steps are described in detail below:

in the step 1), the sample division method may be, for example, a KS method, a SPXY method, an OS, or a DUPLEX, and as shown in fig. 2, the sample division includes:

In this embodiment, let the training set be Tr, and juxtapose it as an empty set, put all samples into the candidate sample set and record as Tr _ cand; firstly, calculating Euclidean distance between samples in Tr _ cand, then selecting two samples with the farthest Euclidean distance in Tr _ cand into a training set Tr, and deleting the two samples from the Tr _ cand; thereafter, respectively calculating the distance between each remaining sample in Tr _ cand and the selected sample in Tr, selecting the minimum distance between each remaining sample and the selected sample, and selecting the sample with the maximum distance into Tr and deleting the sample from Tr _ cand; the above process is repeated until the number of samples in Tr reaches the set sample amount, and the remaining samples are taken as the test set Te.

The method for calculating the distance between the sample subsets comprises the following steps: firstly, calculating the similarity between samples in a first training set Tr and a first test set Te, wherein the similarity comprises the similarity k (Tr, Tr) of the samples in Tr, the similarity k (Te, Te) of the samples in Te and the similarity k (Tr, Te) of the samples between Tr and Te;

the kernel functions of the similarity calculation may be polynomial kernel functions and radial basis kernel functions:

k(x,y)＝(x^Ty+ξ)^p，

all sample similarities are arranged into a gram matrix, taking the first test set as an example:

n is the sample size of the first test set;

taking the first training set as an example:

m isA first training set sample size;

calculating the weight vector of each similarity matrix, taking the test set as an example:

s_N×1＝N-¹1, 1 is a matrix with the length of N;

taking the first training set as an example:

s_M×1＝M-¹1

calculating a centralized matrix, taking the first test set as an example:

J_N×N＝N^-1/2(I_N-s1^T)

taking the first training set as an example:

J_M×M＝M^-1/2(I_M-s1^T)

centering the gram matrix, using the first training set as an example, i.e. order

Take the first test set as an example, that is

And (3) carrying out eigenvalue decomposition on the centralized gram matrix, taking the first training set as an example, reserving the first r1 eigenvectors, and recording the eigenvectors as: v_r1＝[v₁,v₂,…,v_r1]_M×r1And the characteristic values are recorded as: lambda_r1＝D[λ₁,λ₂,…,λ_r1]_r1×r1

Taking the first test set as an example, the first r2 eigenvectors are kept, and their eigenvectors are marked as: v_r2＝[v₁,v₂,…,v_r2]_N×r2And the characteristic values are recorded as: lambda_r2＝D[λ₁,λ₂,…,λ_r2]_r2×r2

Order to

q is the first q retained feature vectors;

i, j, k ∈ {1,2},1 and 2 respectively represent a first training set and a first test set;

then the distance between the first training set and the first test set is defined as:

in the formula (I), the compound is shown in the specification,

to reconstruct the covariance of the gram matrix.

In the step 3), the random division method includes: firstly, calculating a sample size K in a data set, and then generating a random number sequence with a maximum value of K and a length of K, wherein the random number sequence is non-repetitive; and selecting the first n values of the random sequence as the serial number of the training set, extracting corresponding samples from the first n values of the random sequence to form the training set, and forming the test set by the rest samples.

In steps 5) to 6), a kernel density estimation method is selected for probability estimation, and if a gaussian kernel function is selected for the kernel function, the density function can be written as:

where N is the sample size and phi represents the gaussian kernel function, the bandwidth h (i.e., the variance of the gaussian kernel) can be determined by cross-validation, or the bandwidth h that minimizes the progressive square integral error function can be selected. At each point x_iAfter fitting the kernel function (dataset distance), the final density curve, i.e. the accumulation of the kernel function at each point, is obtained. Based on which the probability P of a certain data set distance can be calculated. And taking the size of the P value as an index for evaluating the partition quality, namely selecting the partition with lower probability as the dominant partition. In addition, the kernel function may alsoTo select for Uniform, Triangle, Epanechnikov, and the like.

The method provided by the invention overcomes the defects of quantization and difficult evaluation of conventional error analysis, tightly grasps the basic assumption that a training set and a test set need to be mutually independent and come from the same distribution, estimates the mean value and the variance of the sample set by decomposing a distance matrix (gram matrix) between samples, and calculates the distance between the two distributions of the training set and the test set. And performing probability distribution estimation by using distance distribution obtained by random sampling, calculating the probability of different partitions, and evaluating the quality of a data partition method or the adaptability of the partition method to specific data by using an exact quantization index. On the basis of simplicity and practicality, the method provides the evaluation of the effectiveness of the sample set partitioning method, and provides a suitable method for helping researchers in the biomedical field to select a proper data partitioning method and determining the real generalization performance of the modeling method.

The present invention is further described in detail below using near infrared spectroscopy data of Epsilon in conjunction with the accompanying figures.

There were 310 samples in the data, 10 samples per batch, for a total of 31 batches. Wherein 12 batches were prepared in the laboratory, 12 batches for pilot plant and 7 batches for mass production. 128 spectra were collected and averaged for each sample with a resolution of 16cm^-1Wavelength range 700-.

1) The data set is first partitioned into a first training set and a first test set using the Kennard-Stone method.

The index of the first training set sample is:

281 2 18 149 276 41 252 76 262 221 182 224 131 250 201 102 205 298

142 84 175 295 296 168 11 60 231 264 203 178 66 270 21 193 209 215

192 260 225 91 254 172 155 308 268 272 302 164 188 13 58 306 171 57

218 106 74 244 151 3 286 4 253 271 33 79 220 279 154 137 173 242

101 12 241 116 186 227 267 72 283 285 110 54 124 179 280 83 237 255

99 143 289 32 153 211 132 126 157 261 62 284 28 120 274 166 236 69

223 282 35 249 238 109 245 204 52 46 219 127 47 246 240 42 232 214

138 163 31 94 239 61 159 304 14 174 207 170 156 16 258 147 36 19

287 217 140 130 167 305 169 310 139 293 25 20 299 216 105 210 117 266

162 1 5 8 34 78 114 98 243 134 195 158 95 300 206 7 257 146

the remaining samples were taken as the first test set.

2) Calculating the distance between the two sample sets based on the KL divergence of the regenerated nuclear Hilbert space, and setting the parameters as follows: kernel functions, linear kernel functions; the first data integration score r1 is 1; the second data integration score r2 is 1. Note that here r1 and r2 may be any integer, and their values may be different. The kernel function may also be selected from a gaussian kernel function or a polynomial kernel function, etc. The distance between the first training set and the first test set calculated under the above parameter settings was 0.0082.

3) Dividing the data set into a second training set and a second test set using random division;

4) calculating the distance between sample sets (a second training set and a second testing set) obtained by random division by adopting the same parameter setting as the step 2);

5) repeating the steps 3) and 4) for 500 times, and estimating the random division distance distribution by using a nuclear density estimation method. The optimal bandwidth of the kernel function is 0.0011.

6) The probability of sample division obtained by the KS method was 0.7542.

Changing the sample set dividing method in the step 1) into an SPXY method to obtain 0.1210 distance between the training set and the test set; the probability of the SPXY partition estimated by the probability density distribution function obtained in 5) is 1.

Ideally, the two data sets are extracted independently from the same distribution, with a distance that should be close to 0. The smaller its value, the better the quality of the sample partitioning in the probability distribution generated by the random partitioning. Therefore, the quality of KS partitioning was superior to the SPXY method for this data.

When error analysis is carried out, the prediction error of an ideal model test set is generally considered to be similar to the prediction set error, but the degree of closeness is lack of quantification. In the embodiment of the invention, the ratio of the training set error to the test set error of 21% of the models obtained by random division is greater than 2, and the ratio of the training set error to the test set error of 20% of the models obtained by random division is less than 0.5. The probability of a ratio of training set to test set prediction error greater than 1.2 is 31.31%. The KS method divides the obtained model into a training set and a test set, wherein the ratio of the training set to the test set is 0.7647, and the prediction error of the test set is slightly higher than that of the training set. The probability of estimating the ratio of the training set and the test set of the model obtained by the KS method by using the kernel density estimation method is 0.3898. The ratio of the model training set obtained by the SPXY method to the test set is 0.0947, and the error of the test set is larger than that of the training set. Although the conclusion that the dividing quality of the KS method in the data set is superior to that of SXPY can be obtained through error analysis, the error or the ratio is influenced by different data, the difference value of the error or the ratio obtained through different division cannot directly correspond to the quality difference degree, and the interpretation of the result has certain subjectivity. And the distribution function of the data set distance is constructed by applying random division to different data sets, so that the trouble brought to the division quality evaluation by the numerical difference of the distances obtained from different data sets can be avoided.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for evaluating sample set partition quality based on data set distance is characterized by comprising the following steps:

2. The method for evaluating the quality of sample set partitioning based on data set distance as claimed in claim 1, wherein said step 1) comprises:

3. The method for evaluating the quality of sample set partitioning based on data set distance as claimed in claim 2, wherein said step 2) comprises:

sorting all the sample similarities into a Graham matrix;

in the formula (I), the compound is shown in the specification,

to reconstruct the covariance of the gram matrix.

4. The method for evaluating the quality of sample set partitioning based on data set distance as claimed in claim 1, wherein said step 3) comprises: