CN103544218A

CN103544218A - Nearest neighbor filling method of non-fixed k values

Info

Publication number: CN103544218A
Application number: CN201310452387.XA
Authority: CN
Inventors: 张师超; 朱晓峰; 刘星毅
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2014-01-29

Abstract

The invention provides a nearest neighbor filling method of non-fixed k values. The method mainly aims at overcoming the defects of an existing nearest neighbor filling method. The method comprises the steps that firstly, reasonable definitions are performed on various attribute distance computational formulas; then, an appropriate k value is selected with regard to each missing instance by using the mode of sparse coding, and meanwhile the attributes which most conform to the missing instances are selected; finally, k non-missing instances closest to the missing instances are selected for missing value filling according to the obtained k values. According to the method, the instance problem of missing data filling can be solved, furthermore, the reasonability of missing value filling can be enhanced, and filling quality can be improved without increasing filling complexity. The method is easy to implement and only refers to some simple mathematical models when codes are edited.

Description

The arest neighbors fill method of on-fixed k value

Technical field

The present invention relates to Computer Science and Technology field and areas of information technology, particularly a kind of method of using the arest neighbors method filling missing data of on-fixed k value.

Background technology

The principle of nearest neighbor algorithm (kNN) can be described below: two relations with the example of minimum distance are the most closely.Therefore, if an example has disappearance (no matter disappearance is at conditional attribute or decision attribute), other do not lack the distance of example with data centralization can to calculate it, then find with its nearest example, finally, the value of missing data just replaces by the value (Category Attributes) on this attribute of example of its minimum distance or mean value (connection attribute).

Because arest neighbors method is the Lazy learning method (Lazy Learning) of instance-based learning, because it actual not according to a sorter of training sample structure, but first all training samples are stored, in the time will classifying, just carry out computing temporarily.Certainly, if when user can not specify k value, need in advance from training sample study k value.With active learning (Active Learning) method, as decision tree inductive method is compared with neural net method, what a disaggregated model the latter has just constructed before classifying; Therefore the former, because be Lazy learning method, when training sample number increases sharply, can cause the calculated amount of nearest neighbor algorithm to increase sharply.Due to effectively indexing means support, this problem is resolved.So nearest neighbor algorithm is widely used, for example, fill missing data, classification etc.Due to easy understanding, simple to operate, successful all has widespread use in scientific research or real life.For example, when various examples are classified, the classification accurate rate of nearest neighbor algorithm is all very high in two class problems or multiclass problem.Aspect filling missing data, arest neighbors method is the most popular cold chucking method, in 1967, proposes first, be embedded in some common softwares at present, for example, SAS etc.

But there are some obvious shortcomings in arest neighbors filling algorithm: 1, the computing method of Euclidean distance; 2, the value of k; 3, different example values is identical.

Most of arest neighbors filling algorithms are used Euclidean distance formula to calculate the distance of two examples.But a lot of documents have proved that Euclidean distance formula can not well process discrete type, continuity or mixed type attribute.And in practical application, various dissimilar attributes exist simultaneously, for example, connection attribute, binary attribute, unordered discrete type and in order discrete type etc. (in the present invention, also noncontinuity attribute being referred to as to Category Attributes).

The value of parameter k in kNN fill method is but a problem meriting attention very much.In experiment, if k has got greatly, may easily cause randomness too serious, if k gets little, number of samples is just not much of that, does not reach the standard (from the viewpoint of the non-science meaning, wish large sample capacity at least want more than 30) of large sample capacity in statistical significance.And data set is different, best k value is also different, and the optimum of k is chosen and will be obtained by experiment conventionally, and this must increase the complexity of experiment.This is a publicity difficult problem, so the value of k has obtained a lot of experts' attention, has suggestion k=5(to work as n>100, and n is the number of data set missing data).Careful reader can find, all will get a k example having determined oneself is filled at all disappearance examples of whole data centralization.This is obviously unreasonable, because likely some examples are filled dry straightly when k=5, and the 5th neighbours of other example may be the isolated points of oneself.Therefore, it is irrational that a data set is got to same k value, and unusual difficult the getting of such k.

Summary of the invention

The object of the present invention is to provide simple and effective missing values fill method.The method can solve distance calculate unreasonable with arest neighbors k value to the same problem of all disappearance examples.First the present invention defines a kind of simple and effective distance calculating method, then uses the mode of sparse coding to select suitable k value to each disappearance example, finally by the k value obtaining, selects k nearest intact unfounded example of disappearance example to carry out missing values filling.

Technical scheme of the present invention comprises the steps:

(1) attribute is divided into five classes: continuous type, symmetrical binary type, asymmetric binary type, unordered discrete type and in order discrete type;

And define the distance calculating formula of inhomogeneity attribute instance;

(2) each disappearance example is selected to nearest k training example, choose the attribute that meets this disappearance example most simultaneously;

(3) calculate disappearance example with the distance of all training examples, choose nearest k intact unfounded example, then use this k intact unfounded example to carry out missing values filling to disappearance example.

Wherein, the distance calculating formula of inhomogeneity attribute instance is as follows:

Mixed type: wherein

representing whether example i and j have deficient phenomena, if had, is 0, otherwise is that 1, f is f generic attribute in five generic attributes, and n is attribute number, d _ij ^fbe the distance of example i and j f generic attribute;

Two continuous types:

wherein n representative has n connection attribute in example i and j, Ai, and k is the property value of k attribute of example i,

the mean value of n connection attribute in example i;

Symmetrical binary type:

asymmetric binary type:

wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 ";

Unordered discrete type:

wherein, p is the data set number of unordered discrete type attribute, and m has the number of same alike result value in two examples;

Distance between discrete type: A and B is in order:

dist (A, B) = \frac{2 \times \log P (common (A, B))}{\log P (description (A, B))} = \frac{- 2 \times \log p (A \cup B)}{- \log p (A) - \log p (B)},

Wherein common and description represent according to specific field value, common (A in the present invention for example, B) represent the logic union of A and B, P (common (A, B)) represent common (A, B) probability, P (description (A, B)) represents the poor of A probability and B probability.At length, in step (2), adopt the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example, automatically delete the attribute of redundancy in A or noise, regression parameter is that non-zero number is exactly to use the number of nearest neighbor algorithm k.

It is inhomogeneous apart from computational problem that step of the present invention (1) solves attribute.This has well solved the single irrationality that all kinds unification is calculated with Euclidean distance of algorithm in the past.

Step (2) solves value and the different disappearance example k value of arest neighbors filling algorithm k and answers different problems.The most close neighbours of each disappearance example are different.This method is used the mode of sparse coding to determine neighbours' number of current disappearance example.The more realistic application of this thought.And in selecting neighbours' process, this step use attribute reduction method of the present invention removes the interference of some attributes.

Step (3) is common arest neighbors filling algorithm.Each disappearance example is adopted to different k neighbour fill methods.

This method can solve the example problem that missing data is filled problem, also can in the situation that not increasing filling complexity, strengthen rationality and the raising filling quality that missing values is filled.The present invention is easy to implement, only relates to some simple mathematical models while writing code.

Embodiment

The first, various mixing distances are calculated.Common attribute in research is divided into five classes: continuous type, symmetrical scale-of-two, asymmetric scale-of-two, unordered discrete type and in order discrete type.Distance definition of the present invention is as follows: wherein represent whether example i and j have deficient phenomena, if had, be 0, otherwise be 1.F is f generic attribute in five generic attributes, and n is attribute number, d _ij ^fbe the distance of example i and j f generic attribute.

A. successive value is apart from calculating

The distance computing formula of two successive value examples:

wherein n representative has and in example i and j, has n connection attribute, A _i,kthe property value of k attribute of example i,

the mean value of n connection attribute in example i.

B. symmetrical scale-of-two and asymmetric binary attribute are apart from calculating

If it is uniformly that two values distribute, just says that this is symmetrical binary attribute, otherwise be asymmetric binary attribute.For example, in sex attribute, " man " and " female " is two property values, due to man and woman distributed uniformly, so sex attribute is called symmetrical binary attribute, when the distance of two these attributes of example of calculating, to the weights of two values, can get identical.Again such as, whether checking " trouble acquired immune deficiency syndrome (AIDS) " this attribute, has "Yes" and " non-" two values, according to reality, can know, the probability of "Yes" will be far smaller than the probability of " non-", therefore calculate apart from time, the power of " non-" value is greater than the weight of "Yes" value.According to contingency table, calculate herein the distance of binary attribute, in following table, q represents the number that the value of example i and example j is " 1 ", the like (seeing the following form).

The present invention defines symmetrical scale-of-two range formula: asymmetric binary range formula is:

wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 "; Observe these two formula and can find that difference is at denominator, reason is two probability differences of asymmetric binary attribute, and weights also should be different.

C. unordered discrete type attribute

The value of some attributes (for example color) can be " redness ", " blueness " etc., between these attributes, there is no ordinal relation, be called unordered discrete type attribute, if the data centralization at p altogether unordered discrete type attribute, in two examples, having same alike result to be worth number is m, and between them, distance can be defined as:

D. order attribute

The property value of some Category Attributes is sequential, attribute " rank " for example, " 1 " and " 2 " is sequential, but different from connection attribute again, because there is no the data between 1-2 between these orderly Category Attributes, therefore the distance of, calculating such attribute is the feature of comprehensive unordered Category Attributes and connection attribute also.For example attribute ' quality ' has 5 orderly Category Attributes value: excellent, good, average, bad and awful.Obviously, " excellent " is better than " good ", good how many on earth but we can not determine.The distance that the present invention defines between two property value A and B is:

dist (A, B) = \frac{2 \times \log P (common (A, B))}{\log P (description (A, B))} = \frac{- 2 \times \log p (A \cup B)}{- \log p (A) - \log p (B)}

Wherein ' common ' and ' description ' represents according to specific field value, common (A in the present invention for example, B) represent the logic union of A and B, P (common (A, B)) represent common (A, B) probability, P (description (A, B)) represents the poor of A probability and B probability.By formula above, can draw

dist (' excellent',' good') = \frac{2 \times \log P (' excellent' \cup' good')}{\log P (' excellent') + \log P (' good')} = \frac{2 \times \log (0.1 + 0.2)}{(\log 0.1 + \log 0.2)} = 0.62

\begin{matrix} dist (' excellent',' average') = \frac{2 \times \log P (' excellent' \cup' good' \cup' average')}{\log P (' excellent') + \log P (' good') + \log P (' average')} \\ = \frac{2 \times \log (0.1 + 0.2 + 0.4)}{\log 0.1 + \log 0.2 + \log 0.4} = 0.15 \end{matrix}

Result demonstration, the similarity between two orderly discrete type attributes " excellent " and " good " is greater than the similarity of attribute " excellent " and attribute " average ".

The second, each disappearance example is selected to nearest k training example.Choose the attribute that meets this disappearance example most simultaneously.The present invention adopts the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example.In regression analysis process, automatically delete the attribute of redundancy in A or noise, i.e. attribute reduction.And the result returning is sparse, a lot of regression parameters are 0.This means, regression parameter is that the intact unfounded example that 0 correspondence needn't be used for current disappearance example to fill.The number that regression parameter is non-zero is exactly the number that lower step is used nearest neighbor algorithm k.And the k that each disappearance example obtains is different.

The 3rd, to disappearance example calculation, it follows the distance of all training examples, chooses nearest k intact unfounded example.Then use this k the example without disappearance to carry out missing values filling to disappearance example.

Claims

1. the arest neighbors fill method of on-fixed k value, is characterized in that: comprise the steps:

2. method claimed in claim 1, is characterized in that: the distance calculating formula of inhomogeneity attribute instance is as follows:

Mixed type:

wherein

representing whether example i and j have deficient phenomena, if had, is 0, otherwise is that 1, f is f generic attribute in five generic attributes, and n is attribute number, d _ij ^fbe the distance of example i and j f generic attribute; Two continuous types:

wherein n representative has n connection attribute, A in example i and j _i,kthe property value of k attribute of example i, the mean value of n connection attribute in example i;

Symmetrical binary type:

asymmetric binary type:

Unordered discrete type:

Distance between discrete type: A and B is in order:

dist (A, B) = \frac{2 \times \log P (common (A, B))}{\log P (description (A, B))} = \frac{- 2 \times \log p (A \cup B)}{- \log p (A) - \log p (B)},

Wherein common and description represent according to specific field value.

3. method claimed in claim 1, it is characterized in that: in step (2), adopt the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example, the attribute of redundancy or noise in automatic deletion A, regression parameter is that non-zero number is exactly to use the number of nearest neighbor algorithm k.