CN103544218A - Nearest neighbor filling method of non-fixed k values - Google Patents
Nearest neighbor filling method of non-fixed k values Download PDFInfo
- Publication number
- CN103544218A CN103544218A CN201310452387.XA CN201310452387A CN103544218A CN 103544218 A CN103544218 A CN 103544218A CN 201310452387 A CN201310452387 A CN 201310452387A CN 103544218 A CN103544218 A CN 103544218A
- Authority
- CN
- China
- Prior art keywords
- value
- attribute
- disappearance
- log
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a nearest neighbor filling method of non-fixed k values. The method mainly aims at overcoming the defects of an existing nearest neighbor filling method. The method comprises the steps that firstly, reasonable definitions are performed on various attribute distance computational formulas; then, an appropriate k value is selected with regard to each missing instance by using the mode of sparse coding, and meanwhile the attributes which most conform to the missing instances are selected; finally, k non-missing instances closest to the missing instances are selected for missing value filling according to the obtained k values. According to the method, the instance problem of missing data filling can be solved, furthermore, the reasonability of missing value filling can be enhanced, and filling quality can be improved without increasing filling complexity. The method is easy to implement and only refers to some simple mathematical models when codes are edited.
Description
Technical field
The present invention relates to Computer Science and Technology field and areas of information technology, particularly a kind of method of using the arest neighbors method filling missing data of on-fixed k value.
Background technology
The principle of nearest neighbor algorithm (kNN) can be described below: two relations with the example of minimum distance are the most closely.Therefore, if an example has disappearance (no matter disappearance is at conditional attribute or decision attribute), other do not lack the distance of example with data centralization can to calculate it, then find with its nearest example, finally, the value of missing data just replaces by the value (Category Attributes) on this attribute of example of its minimum distance or mean value (connection attribute).
Because arest neighbors method is the Lazy learning method (Lazy Learning) of instance-based learning, because it actual not according to a sorter of training sample structure, but first all training samples are stored, in the time will classifying, just carry out computing temporarily.Certainly, if when user can not specify k value, need in advance from training sample study k value.With active learning (Active Learning) method, as decision tree inductive method is compared with neural net method, what a disaggregated model the latter has just constructed before classifying; Therefore the former, because be Lazy learning method, when training sample number increases sharply, can cause the calculated amount of nearest neighbor algorithm to increase sharply.Due to effectively indexing means support, this problem is resolved.So nearest neighbor algorithm is widely used, for example, fill missing data, classification etc.Due to easy understanding, simple to operate, successful all has widespread use in scientific research or real life.For example, when various examples are classified, the classification accurate rate of nearest neighbor algorithm is all very high in two class problems or multiclass problem.Aspect filling missing data, arest neighbors method is the most popular cold chucking method, in 1967, proposes first, be embedded in some common softwares at present, for example, SAS etc.
But there are some obvious shortcomings in arest neighbors filling algorithm: 1, the computing method of Euclidean distance; 2, the value of k; 3, different example values is identical.
Most of arest neighbors filling algorithms are used Euclidean distance formula to calculate the distance of two examples.But a lot of documents have proved that Euclidean distance formula can not well process discrete type, continuity or mixed type attribute.And in practical application, various dissimilar attributes exist simultaneously, for example, connection attribute, binary attribute, unordered discrete type and in order discrete type etc. (in the present invention, also noncontinuity attribute being referred to as to Category Attributes).
The value of parameter k in kNN fill method is but a problem meriting attention very much.In experiment, if k has got greatly, may easily cause randomness too serious, if k gets little, number of samples is just not much of that, does not reach the standard (from the viewpoint of the non-science meaning, wish large sample capacity at least want more than 30) of large sample capacity in statistical significance.And data set is different, best k value is also different, and the optimum of k is chosen and will be obtained by experiment conventionally, and this must increase the complexity of experiment.This is a publicity difficult problem, so the value of k has obtained a lot of experts' attention, has suggestion k=5(to work as n>100, and n is the number of data set missing data).Careful reader can find, all will get a k example having determined oneself is filled at all disappearance examples of whole data centralization.This is obviously unreasonable, because likely some examples are filled dry straightly when k=5, and the 5th neighbours of other example may be the isolated points of oneself.Therefore, it is irrational that a data set is got to same k value, and unusual difficult the getting of such k.
Summary of the invention
The object of the present invention is to provide simple and effective missing values fill method.The method can solve distance calculate unreasonable with arest neighbors k value to the same problem of all disappearance examples.First the present invention defines a kind of simple and effective distance calculating method, then uses the mode of sparse coding to select suitable k value to each disappearance example, finally by the k value obtaining, selects k nearest intact unfounded example of disappearance example to carry out missing values filling.
Technical scheme of the present invention comprises the steps:
(1) attribute is divided into five classes: continuous type, symmetrical binary type, asymmetric binary type, unordered discrete type and in order discrete type;
And define the distance calculating formula of inhomogeneity attribute instance;
(2) each disappearance example is selected to nearest k training example, choose the attribute that meets this disappearance example most simultaneously;
(3) calculate disappearance example with the distance of all training examples, choose nearest k intact unfounded example, then use this k intact unfounded example to carry out missing values filling to disappearance example.
Wherein, the distance calculating formula of inhomogeneity attribute instance is as follows:
Mixed type:
wherein
representing whether example i and j have deficient phenomena, if had, is 0, otherwise is that 1, f is f generic attribute in five generic attributes, and n is attribute number, d
ij fbe the distance of example i and j f generic attribute;
Two continuous types:
wherein n representative has n connection attribute in example i and j, Ai, and k is the property value of k attribute of example i,
the mean value of n connection attribute in example i;
Symmetrical binary type:
asymmetric binary type:
wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 ";
Unordered discrete type:
wherein, p is the data set number of unordered discrete type attribute, and m has the number of same alike result value in two examples;
Distance between discrete type: A and B is in order:
It is inhomogeneous apart from computational problem that step of the present invention (1) solves attribute.This has well solved the single irrationality that all kinds unification is calculated with Euclidean distance of algorithm in the past.
Step (2) solves value and the different disappearance example k value of arest neighbors filling algorithm k and answers different problems.The most close neighbours of each disappearance example are different.This method is used the mode of sparse coding to determine neighbours' number of current disappearance example.The more realistic application of this thought.And in selecting neighbours' process, this step use attribute reduction method of the present invention removes the interference of some attributes.
Step (3) is common arest neighbors filling algorithm.Each disappearance example is adopted to different k neighbour fill methods.
This method can solve the example problem that missing data is filled problem, also can in the situation that not increasing filling complexity, strengthen rationality and the raising filling quality that missing values is filled.The present invention is easy to implement, only relates to some simple mathematical models while writing code.
Embodiment
The first, various mixing distances are calculated.Common attribute in research is divided into five classes: continuous type, symmetrical scale-of-two, asymmetric scale-of-two, unordered discrete type and in order discrete type.Distance definition of the present invention is as follows:
wherein
represent whether example i and j have deficient phenomena, if had, be 0, otherwise be 1.F is f generic attribute in five generic attributes, and n is attribute number, d
ij fbe the distance of example i and j f generic attribute.
A. successive value is apart from calculating
The distance computing formula of two successive value examples:
wherein n representative has and in example i and j, has n connection attribute, A
i,kthe property value of k attribute of example i,
the mean value of n connection attribute in example i.
B. symmetrical scale-of-two and asymmetric binary attribute are apart from calculating
If it is uniformly that two values distribute, just says that this is symmetrical binary attribute, otherwise be asymmetric binary attribute.For example, in sex attribute, " man " and " female " is two property values, due to man and woman distributed uniformly, so sex attribute is called symmetrical binary attribute, when the distance of two these attributes of example of calculating, to the weights of two values, can get identical.Again such as, whether checking " trouble acquired immune deficiency syndrome (AIDS) " this attribute, has "Yes" and " non-" two values, according to reality, can know, the probability of "Yes" will be far smaller than the probability of " non-", therefore calculate apart from time, the power of " non-" value is greater than the weight of "Yes" value.According to contingency table, calculate herein the distance of binary attribute, in following table, q represents the number that the value of example i and example j is " 1 ", the like (seeing the following form).
The present invention defines symmetrical scale-of-two range formula:
asymmetric binary range formula is:
wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 "; Observe these two formula and can find that difference is at denominator, reason is two probability differences of asymmetric binary attribute, and weights also should be different.
C. unordered discrete type attribute
The value of some attributes (for example color) can be " redness ", " blueness " etc., between these attributes, there is no ordinal relation, be called unordered discrete type attribute, if the data centralization at p altogether unordered discrete type attribute, in two examples, having same alike result to be worth number is m, and between them, distance can be defined as:
D. order attribute
The property value of some Category Attributes is sequential, attribute " rank " for example, " 1 " and " 2 " is sequential, but different from connection attribute again, because there is no the data between 1-2 between these orderly Category Attributes, therefore the distance of, calculating such attribute is the feature of comprehensive unordered Category Attributes and connection attribute also.For example attribute ' quality ' has 5 orderly Category Attributes value: excellent, good, average, bad and awful.Obviously, " excellent " is better than " good ", good how many on earth but we can not determine.The distance that the present invention defines between two property value A and B is:
Wherein ' common ' and ' description ' represents according to specific field value, common (A in the present invention for example, B) represent the logic union of A and B, P (common (A, B)) represent common (A, B) probability, P (description (A, B)) represents the poor of A probability and B probability.By formula above, can draw
Result demonstration, the similarity between two orderly discrete type attributes " excellent " and " good " is greater than the similarity of attribute " excellent " and attribute " average ".
The second, each disappearance example is selected to nearest k training example.Choose the attribute that meets this disappearance example most simultaneously.The present invention adopts the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example.In regression analysis process, automatically delete the attribute of redundancy in A or noise, i.e. attribute reduction.And the result returning is sparse, a lot of regression parameters are 0.This means, regression parameter is that the intact unfounded example that 0 correspondence needn't be used for current disappearance example to fill.The number that regression parameter is non-zero is exactly the number that lower step is used nearest neighbor algorithm k.And the k that each disappearance example obtains is different.
The 3rd, to disappearance example calculation, it follows the distance of all training examples, chooses nearest k intact unfounded example.Then use this k the example without disappearance to carry out missing values filling to disappearance example.
Claims (3)
1. the arest neighbors fill method of on-fixed k value, is characterized in that: comprise the steps:
(1) attribute is divided into five classes: continuous type, symmetrical binary type, asymmetric binary type, unordered discrete type and in order discrete type;
And define the distance calculating formula of inhomogeneity attribute instance;
(2) each disappearance example is selected to nearest k training example, choose the attribute that meets this disappearance example most simultaneously;
(3) calculate disappearance example with the distance of all training examples, choose nearest k intact unfounded example, then use this k intact unfounded example to carry out missing values filling to disappearance example.
2. method claimed in claim 1, is characterized in that: the distance calculating formula of inhomogeneity attribute instance is as follows:
Mixed type:
wherein
representing whether example i and j have deficient phenomena, if had, is 0, otherwise is that 1, f is f generic attribute in five generic attributes, and n is attribute number, d
ij fbe the distance of example i and j f generic attribute; Two continuous types:
wherein n representative has n connection attribute, A in example i and j
i,kthe property value of k attribute of example i,
the mean value of n connection attribute in example i;
Symmetrical binary type:
asymmetric binary type:
wherein q represents the number that the value of example i and example j is " 1 ", r represents that example i value is the number that the value of " 0 " and example j is " 1 ", s represents that example i value is the number that the value of " 1 " and example j is " 0 ", and t represents that example i value is the number that the value of " 0 " and example j is " 0 ";
Unordered discrete type:
wherein, p is the data set number of unordered discrete type attribute, and m has the number of same alike result value in two examples;
Distance between discrete type: A and B is in order:
3. method claimed in claim 1, it is characterized in that: in step (2), adopt the principle of sparse coding to use all example set A without disappearance to be reconstructed recurrence to current disappearance example, the attribute of redundancy or noise in automatic deletion A, regression parameter is that non-zero number is exactly to use the number of nearest neighbor algorithm k.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452387.XA CN103544218A (en) | 2013-09-29 | 2013-09-29 | Nearest neighbor filling method of non-fixed k values |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310452387.XA CN103544218A (en) | 2013-09-29 | 2013-09-29 | Nearest neighbor filling method of non-fixed k values |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103544218A true CN103544218A (en) | 2014-01-29 |
Family
ID=49967670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310452387.XA Pending CN103544218A (en) | 2013-09-29 | 2013-09-29 | Nearest neighbor filling method of non-fixed k values |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103544218A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250431A (en) * | 2016-07-25 | 2016-12-21 | 华南师范大学 | A kind of Color Feature Extraction Method based on classification clothing and costume retrieval system |
CN107193876A (en) * | 2017-04-21 | 2017-09-22 | 美林数据技术股份有限公司 | A kind of missing data complementing method based on arest neighbors KNN algorithms |
CN107273429A (en) * | 2017-05-19 | 2017-10-20 | 哈工大大数据产业有限公司 | A kind of Missing Data Filling method and system based on deep learning |
CN110097170A (en) * | 2019-04-25 | 2019-08-06 | 深圳市豪斯莱科技有限公司 | Information pushes object prediction model acquisition methods, terminal and storage medium |
CN111737463A (en) * | 2020-06-04 | 2020-10-02 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer program |
CN111784799A (en) * | 2020-06-30 | 2020-10-16 | 北京百度网讯科技有限公司 | Image filling method, device, equipment and storage medium |
-
2013
- 2013-09-29 CN CN201310452387.XA patent/CN103544218A/en active Pending
Non-Patent Citations (4)
Title |
---|
SHICHAO ZHANG 等: "Missing data imputation by utilizing information within incomplete instances", 《THE JOURNAL OF SYSTEMS AND SOFTWARE》 * |
SHICHAO ZHANG: "Shell-neighbor method and its application in missing data", 《APPLIED INTELLIGENCE》 * |
刘星毅 等: "基于欧式距离的最近邻改进算法", 《广西科学院学报》 * |
庄连生 等: "非负稀疏局部线性编码", 《软件学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250431A (en) * | 2016-07-25 | 2016-12-21 | 华南师范大学 | A kind of Color Feature Extraction Method based on classification clothing and costume retrieval system |
CN106250431B (en) * | 2016-07-25 | 2019-03-22 | 华南师范大学 | A kind of Color Feature Extraction Method and costume retrieval system based on classification clothes |
CN107193876A (en) * | 2017-04-21 | 2017-09-22 | 美林数据技术股份有限公司 | A kind of missing data complementing method based on arest neighbors KNN algorithms |
CN107193876B (en) * | 2017-04-21 | 2020-10-09 | 美林数据技术股份有限公司 | Missing data filling method based on nearest neighbor KNN algorithm |
CN107273429A (en) * | 2017-05-19 | 2017-10-20 | 哈工大大数据产业有限公司 | A kind of Missing Data Filling method and system based on deep learning |
CN110097170A (en) * | 2019-04-25 | 2019-08-06 | 深圳市豪斯莱科技有限公司 | Information pushes object prediction model acquisition methods, terminal and storage medium |
CN111737463A (en) * | 2020-06-04 | 2020-10-02 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer program |
CN111737463B (en) * | 2020-06-04 | 2024-02-09 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer readable memory |
CN111784799A (en) * | 2020-06-30 | 2020-10-16 | 北京百度网讯科技有限公司 | Image filling method, device, equipment and storage medium |
CN111784799B (en) * | 2020-06-30 | 2024-01-12 | 北京百度网讯科技有限公司 | Image filling method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mienye et al. | Prediction performance of improved decision tree-based algorithms: a review | |
CN103544218A (en) | Nearest neighbor filling method of non-fixed k values | |
CN110781406B (en) | Social network user multi-attribute inference method based on variational automatic encoder | |
KR20210040248A (en) | Generative structure-property inverse computational co-design of materials | |
CN114239560B (en) | Three-dimensional image classification method, apparatus, device, and computer-readable storage medium | |
CN104866578A (en) | Hybrid filling method for incomplete data | |
CN109829065B (en) | Image retrieval method, device, equipment and computer readable storage medium | |
US20200250894A1 (en) | Forming a dataset for inference of editable feature trees | |
CN104317838B (en) | Cross-media Hash index method based on coupling differential dictionary | |
US20220414144A1 (en) | Multi-task deep hash learning-based retrieval method for massive logistics product images | |
CN113704082A (en) | Model evaluation method and device, electronic equipment and storage medium | |
CN110619364B (en) | Wavelet neural network three-dimensional model classification method based on cloud model | |
CN110647995A (en) | Rule training method, device, equipment and storage medium | |
CN113254810B (en) | Search result output method and device, computer equipment and readable storage medium | |
CN105808582A (en) | Parallel generation method and device of decision tree on the basis of layered strategy | |
KR101467707B1 (en) | Method for instance-matching in knowledge base and device therefor | |
CN103077228A (en) | Set characteristic vector-based quick clustering method and device | |
CN113516019A (en) | Hyperspectral image unmixing method and device and electronic equipment | |
Bruzzese et al. | DESPOTA: DEndrogram slicing through a pemutation test approach | |
CN103310027B (en) | Rules extraction method for map template coupling | |
CN105138527B (en) | A kind of data classification homing method and device | |
CN114238746A (en) | Cross-modal retrieval method, device, equipment and storage medium | |
de Sá et al. | A novel approach to estimated Boulingand-Minkowski fractal dimension from complex networks | |
CN107944045B (en) | Image search method and system based on t distribution Hash | |
WO2023050461A1 (en) | Data clustering method and system, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140129 |