CN103544218A - Nearest neighbor filling method of non-fixed k values - Google Patents

Nearest neighbor filling method of non-fixed k values Download PDF

Info

Publication number
CN103544218A
CN103544218A CN201310452387.XA CN201310452387A CN103544218A CN 103544218 A CN103544218 A CN 103544218A CN 201310452387 A CN201310452387 A CN 201310452387A CN 103544218 A CN103544218 A CN 103544218A
Authority
CN
China
Prior art keywords
value
attribute
disappearance
log
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310452387.XA
Other languages
Chinese (zh)
Inventor
张师超
朱晓峰
刘星毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN201310452387.XA priority Critical patent/CN103544218A/en
Publication of CN103544218A publication Critical patent/CN103544218A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a nearest neighbor filling method of non-fixed k values. The method mainly aims at overcoming the defects of an existing nearest neighbor filling method. The method comprises the steps that firstly, reasonable definitions are performed on various attribute distance computational formulas; then, an appropriate k value is selected with regard to each missing instance by using the mode of sparse coding, and meanwhile the attributes which most conform to the missing instances are selected; finally, k non-missing instances closest to the missing instances are selected for missing value filling according to the obtained k values. According to the method, the instance problem of missing data filling can be solved, furthermore, the reasonability of missing value filling can be enhanced, and filling quality can be improved without increasing filling complexity. The method is easy to implement and only refers to some simple mathematical models when codes are edited.

Description

The arest neighbors fill method of on-fixed k value
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, particularly a kind of method of using the arest neighbors method filling missing data of on-fixed k value.
Background technology
The principle of nearest neighbor algorithm (kNN) can be described below: two relations with the example of minimum distance are the most closely.Therefore, if an example has disappearance (no matter disappearance is at conditional attribute or decision attribute), other do not lack the distance of example with data centralization can to calculate it, then find with its nearest example, finally, the value of missing data just replaces by the value (Category Attributes) on this attribute of example of its minimum distance or mean value (connection attribute).
Because arest neighbors method is the Lazy learning method (Lazy Learning) of instance-based learning, because it actual not according to a sorter of training sample structure, but first all training samples are stored, in the time will classifying, just carry out computing temporarily.Certainly, if when user can not specify k value, need in advance from training sample study k value.With active learning (Active Learning) method, as decision tree inductive method is compared with neural net method, what a disaggregated model the latter has just constructed before classifying; Therefore the former, because be Lazy learning method, when training sample number increases sharply, can cause the calculated amount of nearest neighbor algorithm to increase sharply.Due to effectively indexing means support, this problem is resolved.So nearest neighbor algorithm is widely used, for example, fill missing data, classification etc.Due to easy understanding, simple to operate, successful all has widespread use in scientific research or real life.For example, when various examples are classified, the classification accurate rate of nearest neighbor algorithm is all very high in two class problems or multiclass problem.Aspect filling missing data, arest neighbors method is the most popular cold chucking method, in 1967, proposes first, be embedded in some common softwares at present, for example, SAS etc.
But there are some obvious shortcomings in arest neighbors filling algorithm: 1, the computing method of Euclidean distance; 2, the value of k; 3, different example values is identical.
Most of arest neighbors filling algorithms are used Euclidean distance formula to calculate the distance of two examples.But a lot of documents have proved that Euclidean distance formula can not well process discrete type, continuity or mixed type attribute.And in practical application, various dissimilar attributes exist simultaneously, for example, connection attribute, binary attribute, unordered discrete type and in order discrete type etc. (in the present invention, also noncontinuity attribute being referred to as to Category Attributes).
The value of parameter k in kNN fill method is but a problem meriting attention very much.In experiment, if k has got greatly, may easily cause randomness too serious, if k gets little, number of samples is just not much of that, does not reach the standard (from the viewpoint of the non-science meaning, wish large sample capacity at least want more than 30) of large sample capacity in statistical significance.And data set is different, best k value is also different, and the optimum of k is chosen and will be obtained by experiment conventionally, and this must increase the complexity of experiment.This is a publicity difficult problem, so the value of k has obtained a lot of experts' attention, has suggestion k=5(to work as n>100, and n is the number of data set missing data).Careful reader can find, all will get a k example having determined oneself is filled at all disappearance examples of whole data centralization.This is obviously unreasonable, because likely some examples are filled dry straightly when k=5, and the 5th neighbours of other example may be the isolated points of oneself.Therefore, it is irrational that a data set is got to same k value, and unusual difficult the getting of such k.
Summary of the invention
The object of the present invention is to provide simple and effective missing values fill method.The method can solve distance calculate unreasonable with arest neighbors k value to the same problem of all disappearance examples.First the present invention defines a kind of simple and effective distance calculating method, then uses the mode of sparse coding to select suitable k value to each disappearance example, finally by the k value obtaining, selects k nearest intact unfounded example of disappearance example to carry out missing values filling.
Technical scheme of the present invention comprises the steps:
(1) attribute is divided into five classes: continuous type, symmetrical binary type, asymmetric binary type, unordered discrete type and in order discrete type;
And define the distance calculating formula of inhomogeneity attribute instance;
(2) each disappearance example is selected to nearest k training example, choose the attribute that meets this disappearance example most simultaneously;
(3) calculate disappearance example with the distance of all training examples, choose nearest k intact unfounded example, then use this k intact unfounded example to carry out missing values filling to disappearance example.
Wherein, the distance calculating formula of inhomogeneity attribute instance is as follows:
Mixed type: wherein
Figure BDA0000389658290000022
representing whether example i and j have deficient phenomena, if had, is 0, otherwise is that 1, f is f generic attribute in five generic attributes, and n is attribute number, d ij fbe the distance of example i and j f generic attribute;
Two continuous types:
Figure BDA0000389658290000023
wherein n representative has n connection attribute in example i and j, Ai, and k is the property value of k attribute of example i,
Figure BDA0000389658290000024
the mean value of n connection attribute in example i;
Symmetrical binary type:
Figure BDA0000389658290000025
asymmetric binary type:
Figure BDA0000389658290000026
wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 ";
Unordered discrete type:
Figure BDA0000389658290000027
wherein, p is the data set number of unordered discrete type attribute, and m has the number of same alike result value in two examples;
Distance between discrete type: A and B is in order:
dist ( A , B ) = 2 × log P ( common ( A , B ) ) log P ( description ( A , B ) ) = - 2 × log p ( A ∪ B ) - log p ( A ) - log p ( B ) , Wherein common and description represent according to specific field value, common (A in the present invention for example, B) represent the logic union of A and B, P (common (A, B)) represent common (A, B) probability, P (description (A, B)) represents the poor of A probability and B probability.At length, in step (2), adopt the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example, automatically delete the attribute of redundancy in A or noise, regression parameter is that non-zero number is exactly to use the number of nearest neighbor algorithm k.
It is inhomogeneous apart from computational problem that step of the present invention (1) solves attribute.This has well solved the single irrationality that all kinds unification is calculated with Euclidean distance of algorithm in the past.
Step (2) solves value and the different disappearance example k value of arest neighbors filling algorithm k and answers different problems.The most close neighbours of each disappearance example are different.This method is used the mode of sparse coding to determine neighbours' number of current disappearance example.The more realistic application of this thought.And in selecting neighbours' process, this step use attribute reduction method of the present invention removes the interference of some attributes.
Step (3) is common arest neighbors filling algorithm.Each disappearance example is adopted to different k neighbour fill methods.
This method can solve the example problem that missing data is filled problem, also can in the situation that not increasing filling complexity, strengthen rationality and the raising filling quality that missing values is filled.The present invention is easy to implement, only relates to some simple mathematical models while writing code.
Embodiment
The first, various mixing distances are calculated.Common attribute in research is divided into five classes: continuous type, symmetrical scale-of-two, asymmetric scale-of-two, unordered discrete type and in order discrete type.Distance definition of the present invention is as follows: wherein represent whether example i and j have deficient phenomena, if had, be 0, otherwise be 1.F is f generic attribute in five generic attributes, and n is attribute number, d ij fbe the distance of example i and j f generic attribute.
A. successive value is apart from calculating
The distance computing formula of two successive value examples:
Figure BDA0000389658290000034
wherein n representative has and in example i and j, has n connection attribute, A i,kthe property value of k attribute of example i,
Figure BDA0000389658290000035
the mean value of n connection attribute in example i.
B. symmetrical scale-of-two and asymmetric binary attribute are apart from calculating
If it is uniformly that two values distribute, just says that this is symmetrical binary attribute, otherwise be asymmetric binary attribute.For example, in sex attribute, " man " and " female " is two property values, due to man and woman distributed uniformly, so sex attribute is called symmetrical binary attribute, when the distance of two these attributes of example of calculating, to the weights of two values, can get identical.Again such as, whether checking " trouble acquired immune deficiency syndrome (AIDS) " this attribute, has "Yes" and " non-" two values, according to reality, can know, the probability of "Yes" will be far smaller than the probability of " non-", therefore calculate apart from time, the power of " non-" value is greater than the weight of "Yes" value.According to contingency table, calculate herein the distance of binary attribute, in following table, q represents the number that the value of example i and example j is " 1 ", the like (seeing the following form).
Figure BDA0000389658290000041
The present invention defines symmetrical scale-of-two range formula: asymmetric binary range formula is:
Figure BDA0000389658290000043
wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 "; Observe these two formula and can find that difference is at denominator, reason is two probability differences of asymmetric binary attribute, and weights also should be different.
C. unordered discrete type attribute
The value of some attributes (for example color) can be " redness ", " blueness " etc., between these attributes, there is no ordinal relation, be called unordered discrete type attribute, if the data centralization at p altogether unordered discrete type attribute, in two examples, having same alike result to be worth number is m, and between them, distance can be defined as:
D. order attribute
The property value of some Category Attributes is sequential, attribute " rank " for example, " 1 " and " 2 " is sequential, but different from connection attribute again, because there is no the data between 1-2 between these orderly Category Attributes, therefore the distance of, calculating such attribute is the feature of comprehensive unordered Category Attributes and connection attribute also.For example attribute ' quality ' has 5 orderly Category Attributes value: excellent, good, average, bad and awful.Obviously, " excellent " is better than " good ", good how many on earth but we can not determine.The distance that the present invention defines between two property value A and B is:
dist ( A , B ) = 2 × log P ( common ( A , B ) ) log P ( description ( A , B ) ) = - 2 × log p ( A ∪ B ) - log p ( A ) - log p ( B )
Wherein ' common ' and ' description ' represents according to specific field value, common (A in the present invention for example, B) represent the logic union of A and B, P (common (A, B)) represent common (A, B) probability, P (description (A, B)) represents the poor of A probability and B probability.By formula above, can draw
dist ( ' excellent ' , ' good ' ) = 2 × log P ( ' excellent ' ∪ ' good ' ) log P ( ' excellent ' ) + log P ( ' good ' ) = 2 × log ( 0.1 + 0.2 ) ( log 0.1 + log 0.2 ) = 0.62
dist ( ' excellent ' , ' average ' ) = 2 × log P ( ' excellent ' ∪ ' good ' ∪ ' average ' ) log P ( ' excellent ' ) + log P ( ' good ' ) + log P ( ' average ' ) = 2 × log ( 0.1 + 0.2 + 0.4 ) log 0.1 + log 0.2 + log 0.4 = 0.15
Result demonstration, the similarity between two orderly discrete type attributes " excellent " and " good " is greater than the similarity of attribute " excellent " and attribute " average ".
The second, each disappearance example is selected to nearest k training example.Choose the attribute that meets this disappearance example most simultaneously.The present invention adopts the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example.In regression analysis process, automatically delete the attribute of redundancy in A or noise, i.e. attribute reduction.And the result returning is sparse, a lot of regression parameters are 0.This means, regression parameter is that the intact unfounded example that 0 correspondence needn't be used for current disappearance example to fill.The number that regression parameter is non-zero is exactly the number that lower step is used nearest neighbor algorithm k.And the k that each disappearance example obtains is different.
The 3rd, to disappearance example calculation, it follows the distance of all training examples, chooses nearest k intact unfounded example.Then use this k the example without disappearance to carry out missing values filling to disappearance example.

Claims (3)

1. the arest neighbors fill method of on-fixed k value, is characterized in that: comprise the steps:
(1) attribute is divided into five classes: continuous type, symmetrical binary type, asymmetric binary type, unordered discrete type and in order discrete type;
And define the distance calculating formula of inhomogeneity attribute instance;
(2) each disappearance example is selected to nearest k training example, choose the attribute that meets this disappearance example most simultaneously;
(3) calculate disappearance example with the distance of all training examples, choose nearest k intact unfounded example, then use this k intact unfounded example to carry out missing values filling to disappearance example.
2. method claimed in claim 1, is characterized in that: the distance calculating formula of inhomogeneity attribute instance is as follows:
Mixed type:
Figure FDA0000389658280000011
wherein
Figure FDA0000389658280000012
representing whether example i and j have deficient phenomena, if had, is 0, otherwise is that 1, f is f generic attribute in five generic attributes, and n is attribute number, d ij fbe the distance of example i and j f generic attribute; Two continuous types:
Figure FDA0000389658280000013
wherein n representative has n connection attribute, A in example i and j i,kthe property value of k attribute of example i, the mean value of n connection attribute in example i;
Symmetrical binary type:
Figure FDA0000389658280000015
asymmetric binary type:
Figure FDA0000389658280000016
wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 ";
Unordered discrete type:
Figure FDA0000389658280000017
wherein, p is the data set number of unordered discrete type attribute, and m has the number of same alike result value in two examples;
Distance between discrete type: A and B is in order:
dist ( A , B ) = 2 × log P ( common ( A , B ) ) log P ( description ( A , B ) ) = - 2 × log p ( A ∪ B ) - log p ( A ) - log p ( B ) , Wherein common and description represent according to specific field value.
3. method claimed in claim 1, it is characterized in that: in step (2), adopt the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example, the attribute of redundancy or noise in automatic deletion A, regression parameter is that non-zero number is exactly to use the number of nearest neighbor algorithm k.
CN201310452387.XA 2013-09-29 2013-09-29 Nearest neighbor filling method of non-fixed k values Pending CN103544218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310452387.XA CN103544218A (en) 2013-09-29 2013-09-29 Nearest neighbor filling method of non-fixed k values

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310452387.XA CN103544218A (en) 2013-09-29 2013-09-29 Nearest neighbor filling method of non-fixed k values

Publications (1)

Publication Number Publication Date
CN103544218A true CN103544218A (en) 2014-01-29

Family

ID=49967670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310452387.XA Pending CN103544218A (en) 2013-09-29 2013-09-29 Nearest neighbor filling method of non-fixed k values

Country Status (1)

Country Link
CN (1) CN103544218A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250431A (en) * 2016-07-25 2016-12-21 华南师范大学 A kind of Color Feature Extraction Method based on classification clothing and costume retrieval system
CN107193876A (en) * 2017-04-21 2017-09-22 美林数据技术股份有限公司 A kind of missing data complementing method based on arest neighbors KNN algorithms
CN107273429A (en) * 2017-05-19 2017-10-20 哈工大大数据产业有限公司 A kind of Missing Data Filling method and system based on deep learning
CN110097170A (en) * 2019-04-25 2019-08-06 深圳市豪斯莱科技有限公司 Information pushes object prediction model acquisition methods, terminal and storage medium
CN111737463A (en) * 2020-06-04 2020-10-02 江苏名通信息科技有限公司 Big data missing value filling method, device and computer program
CN111784799A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Image filling method, device, equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SHICHAO ZHANG 等: "Missing data imputation by utilizing information within incomplete instances", 《THE JOURNAL OF SYSTEMS AND SOFTWARE》 *
SHICHAO ZHANG: "Shell-neighbor method and its application in missing data", 《APPLIED INTELLIGENCE》 *
刘星毅 等: "基于欧式距离的最近邻改进算法", 《广西科学院学报》 *
庄连生 等: "非负稀疏局部线性编码", 《软件学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250431A (en) * 2016-07-25 2016-12-21 华南师范大学 A kind of Color Feature Extraction Method based on classification clothing and costume retrieval system
CN106250431B (en) * 2016-07-25 2019-03-22 华南师范大学 A kind of Color Feature Extraction Method and costume retrieval system based on classification clothes
CN107193876A (en) * 2017-04-21 2017-09-22 美林数据技术股份有限公司 A kind of missing data complementing method based on arest neighbors KNN algorithms
CN107193876B (en) * 2017-04-21 2020-10-09 美林数据技术股份有限公司 Missing data filling method based on nearest neighbor KNN algorithm
CN107273429A (en) * 2017-05-19 2017-10-20 哈工大大数据产业有限公司 A kind of Missing Data Filling method and system based on deep learning
CN110097170A (en) * 2019-04-25 2019-08-06 深圳市豪斯莱科技有限公司 Information pushes object prediction model acquisition methods, terminal and storage medium
CN111737463A (en) * 2020-06-04 2020-10-02 江苏名通信息科技有限公司 Big data missing value filling method, device and computer program
CN111737463B (en) * 2020-06-04 2024-02-09 江苏名通信息科技有限公司 Big data missing value filling method, device and computer readable memory
CN111784799A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Image filling method, device, equipment and storage medium
CN111784799B (en) * 2020-06-30 2024-01-12 北京百度网讯科技有限公司 Image filling method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Mienye et al. Prediction performance of improved decision tree-based algorithms: a review
CN103544218A (en) Nearest neighbor filling method of non-fixed k values
CN110781406B (en) Social network user multi-attribute inference method based on variational automatic encoder
KR20210040248A (en) Generative structure-property inverse computational co-design of materials
CN114239560B (en) Three-dimensional image classification method, apparatus, device, and computer-readable storage medium
CN104866578A (en) Hybrid filling method for incomplete data
CN109829065B (en) Image retrieval method, device, equipment and computer readable storage medium
US20200250894A1 (en) Forming a dataset for inference of editable feature trees
CN104317838B (en) Cross-media Hash index method based on coupling differential dictionary
US20220414144A1 (en) Multi-task deep hash learning-based retrieval method for massive logistics product images
CN113704082A (en) Model evaluation method and device, electronic equipment and storage medium
CN110619364B (en) Wavelet neural network three-dimensional model classification method based on cloud model
CN110647995A (en) Rule training method, device, equipment and storage medium
CN113254810B (en) Search result output method and device, computer equipment and readable storage medium
CN105808582A (en) Parallel generation method and device of decision tree on the basis of layered strategy
KR101467707B1 (en) Method for instance-matching in knowledge base and device therefor
CN103077228A (en) Set characteristic vector-based quick clustering method and device
CN113516019A (en) Hyperspectral image unmixing method and device and electronic equipment
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
CN103310027B (en) Rules extraction method for map template coupling
CN105138527B (en) A kind of data classification homing method and device
CN114238746A (en) Cross-modal retrieval method, device, equipment and storage medium
de Sá et al. A novel approach to estimated Boulingand-Minkowski fractal dimension from complex networks
CN107944045B (en) Image search method and system based on t distribution Hash
WO2023050461A1 (en) Data clustering method and system, and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140129