CN109933579A

CN109933579A - A kind of part k nearest neighbor missing values interpolation system and method

Info

Publication number: CN109933579A
Application number: CN201910104623.6A
Authority: CN
Inventors: 周毅; 杨日东
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2019-02-01
Filing date: 2019-02-01
Publication date: 2019-06-25
Anticipated expiration: 2039-02-01
Also published as: CN109933579B

Abstract

The present invention provides a kind of local k nearest neighbor missing values interpolation system, and judgment module is for judging fragmentary sample T_iIn each missing attribute j, projection module projects data set T, and the data set T ' of the condition of satisfaction is acquired in data set T: the attribute in data set T ' is in T_iIn the attribute that does not lack do not lack, in T_iThe attribute j of current interpolation is not also lacked.Secondly, data computation module acquires T in T '_iK nearest neighbor T_iK.Logic processing module is to fragmentary sample T_iMissing attribute j analyzed, if the missing attribute j of current sample is categorical attribute, by T_iKT is incorporated into the mode of the value of attribute j_iMissing attribute j in；Otherwise by T_iKT is incorporated into the average of the value of attribute j_iMissing attribute j in.For a kind of local k nearest neighbor interpolation system provided by the invention in the case where miss rate is small, performance is slightly better than traditional k nearest neighbor interpolation；In the biggish situation of miss rate, performance outclass traditional k nearest neighbor interpolation.

Description

A kind of part k nearest neighbor missing values interpolation system and method

Technical field

The invention belongs to data analysis preprocessing technical field more particularly to a kind of local k nearest neighbor missing values interpolation system and sides Method.

Background technique

With the development of information age, various fields have accumulated mass data, how to effectively utilize these data, have become For current one big research hotspot.However, often will appear in practice shortage of data, noise, repetition and it is inconsistent situations such as, this is very Affect to big degree the stabilization of data digging method.Therefore, handle and just seem particularly significant to missing data collection.It makes It may be since data can not obtain or be missed in operation, in the feelings handled without missing values at shortage of data Under condition, certain machine learning methods even can not be used directly.Therefore, missing values interpolation is data mining and machine learning field In a practical and challenging problem.

K nearest neighbor interpolation (KNearest Neighbor Imputation) is a kind of base that Olga Troyanskaya is proposed In the interpolating method of data local similarity.The basic thought of k nearest neighbor interpolation is, for the sample containing missing values, to lack Data can refer to and its most similar K sample.Specifically, data set is divided into two set, a collection by k nearest neighbor interpolation It closes comprising all Complete Samples (being free of the sample of missing values), another is gathered comprising all fragmentary samples (i.e. There are the samples of missing values).For each fragmentary sample, missing values are point by the k nearest neighbor for asking it to concentrate in Complete Sample Generic attribute, then in interpolation k nearest neighbor sample the attribute value mode；It is numerical attribute for missing values, then interpolation k nearest neighbor sample The average of the attribute value in this.Since the missing values of fragmentary sample are acquired according to " adjacent " sample, k nearest neighbor interpolation Method not will increase excessive new samples information.

Although k nearest neighbor interpolation is an outstanding interpolating method, the interpolation effect of k nearest neighbor interpolation dramatically by The influence of miss rate.For k nearest neighbor interpolating method when data set miss rate is larger, the Complete Sample in data set is considerably less, this meaning Taste, for fragmentary sample, the k nearest neighbor sample calculated in Complete Sample may not be truly at this time " neighbour ".This will lead to the k nearest neighbor referred to when missing sample interpolation, and actually there are also a certain distance with sample itself, finally Cause the numerical error of interpolation larger.

Therefore, it is necessary to be improved k nearest neighbor interpolation, makes it in the biggish situation of miss rate, still there is preferable interpolation Performance.

Summary of the invention

In order to lack the k nearest neighbor referred to when sample interpolation actually the asking there are also a certain distance with sample itself at present Topic.The present invention proposes a kind of local k nearest neighbor missing values interpolation system.

A kind of part k nearest neighbor missing values interpolation system, including input module, normalization module, judgment module, projective module Block, data computation module and output module；

The input module is used to input the parameter K of the data set T comprising fragmentary sample and k nearest neighbor；

The normalization module is for being normalized operation to data set T；

The judgment module is used to judge the fragmentary sample T in data set T_iMissing attribute, if j be current interpolation Missing attribute；

The projection module traverses T, finds out in data set T and meet corresponding requirements for projecting to data set T Sample set T ', wherein sample T_iIn the attribute that does not lack also do not lacked in T '；Sample T_iCurrent missing attribute j is in T ' It does not lack；

The data computation module is for calculating fragmentary sample T_iWith each sample T ' in sample set T '_sDistance, Fragmentary sample T is obtained according to the distance of calculating_iK nearest samples T_iK；

The logic processing module is used for fragmentary sample T_iMissing attribute j analyzed, if fragmentary sample T_i Missing attribute j be categorical attribute, then by K nearest samples T_iKFragmentary sample is incorporated into the mode of the value of attribute j T_iMissing attribute j in；If fragmentary sample T_iMissing attribute j be numerical attribute, then by K nearest samples T_iKIn attribute The average of the value of j is incorporated into fragmentary sample T_iMissing attribute j in；

The output module is used to export the data set T of interpolation completion.

Preferably, the normalization operation of the normalization module is as follows:

Wherein, T_ijIndicate the raw value of the i-th row j column, T '_ijNumerical value after indicating the i-th row j row normalization, Min (T_j) Indicate the minimum value of jth column, Max (T_j) indicate the maximum value that jth arranges.

Preferably, the data computation module calculates fragmentary sample T_iWith sample T ' each in T '_sDistance include with Lower step:

Wherein, N is fragmentary sample T_iThe numerical attribute number not lacked, M are fragmentary sample T_iThe classification category not lacked Property number, i (T_im, T '_sm) it is indicator function, T is worked as in expression_imWith T '_smIt is 0 when equal, is 1 when unequal.

The present invention also provides a kind of local k nearest neighbor missing values interpolating methods, comprising the following steps:

S1. the parameter K of data set T and k nearest neighbor of the input module input containing missing data；

S2. operation is normalized to the numerical attribute of data set T in normalization module；

S3. judgment module judges the fragmentary sample T in data set T_iMissing attribute, if j be current interpolation missing Attribute；

S4. projection module is projected, and is traversed T, is found out the sample set T ' for meeting corresponding requirements in T, wherein sample T_iIn The attribute not lacked does not also lack in T '；Sample T_iCurrent missing attribute j is not also lacked in T '；

S5. data computation module calculates fragmentary sample T_iAt a distance from each sample in sample set T '；According to step The resulting distance value of S5 obtains K arest neighbors T_iK；

S6. logic processing module is to fragmentary sample T_iMissing attribute j analyzed, if fragmentary sample T_iMissing Attribute j is categorical attribute, then by K nearest samples T_iKFragmentary sample T is incorporated into the mode of the value of attribute j_iLack It loses in attribute j；If fragmentary sample T_iMissing attribute j be numerical attribute, then by K nearest samples T_iKIn taking for attribute j The average of value is incorporated into fragmentary sample T_iMissing attribute j in；

S7. judgment module judges the complete fragmentary sample T of interpolation_iIt whether is still fragmentary sample, it is no if then returning to S3 Then enter S8；

S8. judgment module judges whether data set T contains fragmentary sample, takes another fragmentary sample if then enabling and returns S3 is returned, S9 is otherwise entered；

S9. the data set T that output module output interpolation is completed.

Preferably, the normalization operation in the S2 is as follows:

Preferably, in the S4 sample set T ' meet it is claimed below: sample T_iIn the attribute that does not lack in T ' not yet Missing；Sample T_iCurrent missing attribute j is not also lacked in T '.

Preferably, the S5 specifically includes the following steps:

Wherein, N is sample T_iThe numerical attribute number not lacked, M are sample T_iThe categorical attribute number not lacked, I (T_im, T '_sm) it is indicator function, T is worked as in expression_imWith T '_smIt is 0 when equal, is 1 when unequal.

The present invention is with local k nearest neighbor interpolation and the maximum difference of k nearest neighbor interpolation: k nearest neighbor interpolation is to endless This interpolation of bulk sample is according to the k nearest neighbor in Complete Sample, and local k nearest neighbor interpolation is lacking of being presently processing according to sample It loses attribute and does not lack attribute, once projected in entire data set T, then acquire and work as in the data set that projection obtains The k nearest neighbor of preceding sample finally carries out corresponding interpolation.Compared with k nearest neighbor interpolation, local k nearest neighbor interpolation makes not exclusively Sample can acquire k nearest neighbor in a bigger sample set, it means that the sample number that k-nearest neighbor can learn is bigger, finds Neighbour is closer to currently processed fragmentary sample.

For especially k nearest neighbor interpolating method when data set miss rate is larger, the Complete Sample in data set is considerably less, this meaning Taste, for fragmentary sample, the k nearest neighbor sample calculated in Complete Sample may not be truly at this time " neighbour ".This will lead to the k nearest neighbor referred to when missing sample interpolation, and actually there are also a certain distance with sample itself, finally Cause the numerical error of interpolation larger.And local k nearest neighbor interpolation is not rely on Complete Sample, as miss rate increases, throws The reference data set that shadow obtains significant can't be reduced.Therefore, compared to k nearest neighbor interpolation, local k nearest neighbor interpolation exists Performance is more preferably in the larger situation of miss rate.

Compared with prior art, the beneficial effect of technical solution of the present invention is:

For local k nearest neighbor interpolation in the case where miss rate is small, performance is slightly better than traditional k nearest neighbor interpolation；In miss rate In biggish situation, performance outclass traditional k nearest neighbor interpolation.

Detailed description of the invention

Fig. 1 is a kind of overall flow figure of local k nearest neighbor missing values interpolating method provided by the invention.

Fig. 2 is local k nearest neighbor interpolation and k nearest neighbor interpolation, multiple interpolation, in Breast CancerCoimbra number Compare figure according to the filling capacity collected under upper different miss rates.

Fig. 3 is local k nearest neighbor interpolation and k nearest neighbor interpolation, multiple interpolation, the difference on Parkinsons data set Filling capacity under miss rate compares figure.

Wherein, LKNNI is local k nearest neighbor interpolation, and KNNI is k nearest neighbor interpolation, and MI is multiple interpolation.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiment is only a unit embodiment of the invention, only for illustration, Bu Nengli Solution is the limitation to this patent.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative labor Every other embodiment obtained under the premise of dynamic, shall fall within the protection scope of the present invention.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

Embodiment 1

The present embodiment provides a kind of local k nearest neighbor missing values interpolation systems, including input module, normalization module, judgement Module, projection module, data computation module and output module；

The normalization module is for being normalized operation to data set T；

The projection module traverses T, finds out the sample set for meeting corresponding requirements in T for projecting to data set T T ', wherein sample T_iIn the attribute that does not lack also do not lacked in T '；Sample T_iCurrent missing attribute j is not also lacked in T '；

The output module is used to export the data set T of interpolation completion.

In the present embodiment, the normalization operation of the normalization module is as follows:

In the present embodiment, the data computation module calculates fragmentary sample T_iWith sample T ' each in T '_sDistance The following steps are included:

Embodiment 2

The present embodiment provides a kind of local k nearest neighbor missing values interpolating methods, as shown in Figure 1, comprising the following steps:

S4. projection module is projected, and is traversed T, is found out the sample set T ' for meeting corresponding requirements in T

S9. the data set T that output module output interpolation is completed.

In the present embodiment, the normalization operation in the S2 is as follows:

In the present embodiment, in the S4 sample set T ' meet it is claimed below: sample T_iIn the attribute that does not lack in T ' In also do not lack；Sample T_iCurrent missing attribute j is not also lacked in T '.

In the present embodiment, the S5 specifically includes the following steps:

Embodiment 3

The present embodiment is consistent with 1 content of embodiment, and a kind of local k nearest neighbor missing values interpolating method provided in this embodiment is The performance of accurate evaluation fill method on different data sets, when calculating Measure Indexes, to all numerical attributes into Row normalized.In balancing method performance, the filling effect of numerical attribute is indicated with mean square error；Categorical attribute then uses Accuracy indicates.Calculation is as follows:

Wherein, L is the numerical attribute values number always lacked, T_lIt is first of raw value data to be filled, T '_lIt is l A numeric data filled.

Embodiment 4

The present embodiment, in order to measure the similarity between Filling power and actual numerical value, using on UCI BreastCancerCoimbra and Parkinsons data set, the two is all complete data set.Utilize the method mould of random erasure Quasi- multivariable missing at random, compares Filling power and original property value again after Missing Data Filling.

The present embodiment by local k nearest neighbor interpolation, with k nearest neighbor interpolation, multiple interpolation, Carry out interpolation performance on BreastCancerCoimbra and Parkinsons data set under different miss rates compares.In order to protect Result credibility is tested in confirmation, reduces simulation missing bring error, sample of this experiment to identical interpolating method and identical miss rate This progress 30 times experiments, use the average value of Measure Indexes as experimental result, as Figure 2-3.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of part k nearest neighbor missing values interpolation system, which is characterized in that including input module, normalization module, judge mould Block, projection module, data computation module and output module；

The normalization module is for being normalized operation to data set T；

The judgment module is used to judge the fragmentary sample T in data set T_iMissing attribute, if j be current interpolation lack Lose attribute；

For the projection module for projecting to data set T, traversal T finds out the sample set T ' for meeting corresponding requirements in T, Wherein sample T_iIn the attribute that does not lack also do not lacked in T '；Sample T_iCurrent missing attribute j is not also lacked in T '；

The data computation module is for calculating fragmentary sample T_iWith each sample T ' in sample set T '_sDistance, according to The distance of calculating obtains fragmentary sample T_iK nearest samples T_iK；

The logic processing module is used for fragmentary sample T_iMissing attribute j analyzed, if fragmentary sample T_iLack Losing attribute j is categorical attribute, then by K nearest samples T_iKFragmentary sample T is incorporated into the mode of the value of attribute j_i's It lacks in attribute j；If fragmentary sample T_iMissing attribute j be numerical attribute, then by K nearest samples T_iKAttribute j's The average of value is incorporated into fragmentary sample T_iMissing attribute j in；

The output module is used to export the data set T of interpolation completion.

2. a kind of local k nearest neighbor missing values interpolation system according to claim 1, which is characterized in that the normalization The normalization operation of module is as follows:

Wherein, T_ijIndicate the raw value of the i-th row j column, T '_ijNumerical value after indicating the i-th row j row normalization, Min (T_·j) indicate The minimum value of jth column, Max (T_·j) indicate the maximum value that jth arranges.

3. a kind of local k nearest neighbor missing values interpolation system according to claim 1, which is characterized in that the data meter It calculates module and calculates fragmentary sample T_iWith sample T ' each in T '_sDistance the following steps are included:

Wherein, N is fragmentary sample T_iThe numerical attribute number not lacked, M are fragmentary sample T_iThe categorical attribute not lacked Number, I (T_im, T '_sm) it is indicator function, T is worked as in expression_imWith T '_smIt is 0 when equal, is 1 when unequal.

4. a kind of part k nearest neighbor missing values interpolating method, which comprises the following steps:

S5. data computation module calculates fragmentary sample T_iAt a distance from each sample in sample set T '；According to obtained by step S5 Distance value obtain K arest neighbors T_iK；

S6. logic processing module is to fragmentary sample T_iMissing attribute j analyzed, if fragmentary sample T_iMissing attribute j It is categorical attribute, then by K nearest samples T_iKFragmentary sample T is incorporated into the mode of the value of attribute j_iMissing attribute In j；If fragmentary sample T_iMissing attribute j be numerical attribute, then by K nearest samples T_iKAttribute j value it is flat Mean is incorporated into fragmentary sample T_iMissing attribute j in；

S7. judgment module judges the complete fragmentary sample T of interpolation_iWhether still be fragmentary sample, if then returning to S3, otherwise into Enter S8；

S8. judgment module judges whether data set T contains fragmentary sample, takes another fragmentary sample if then enabling and returns Otherwise S3 enters S9；

S9. the data set T that output module output interpolation is completed.

5. a kind of local k nearest neighbor missing values interpolating method according to claim 4, which is characterized in that in the S2 Normalization operation is as follows:

6. a kind of local k nearest neighbor missing values interpolating method as claimed in claim 4, which is characterized in that sample in the S4 Collection T ' meets claimed below: sample T_iIn the attribute that does not lack also do not lacked in T '；Sample T_iCurrent missing attribute j is in T ' In also do not lack.

7. a kind of local k nearest neighbor missing values interpolating method according to claim 4, which is characterized in that the S5 is specific The following steps are included:

Wherein, N is sample T_iThe numerical attribute number not lacked, M are sample T_iThe categorical attribute number not lacked, I (T_im, T ′_sm) it is indicator function, T is worked as in expression_imWith T '_smIt is 0 when equal, is 1 when unequal.