CN102622446A - Hadoop based parallel k nearest neighbor classification method - Google Patents
Hadoop based parallel k nearest neighbor classification method Download PDFInfo
- Publication number
- CN102622446A CN102622446A CN201210071445XA CN201210071445A CN102622446A CN 102622446 A CN102622446 A CN 102622446A CN 201210071445X A CN201210071445X A CN 201210071445XA CN 201210071445 A CN201210071445 A CN 201210071445A CN 102622446 A CN102622446 A CN 102622446A
- Authority
- CN
- China
- Prior art keywords
- data
- neighbour
- hadoop
- test data
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 239000003638 chemical reducing agent Substances 0.000 claims abstract description 10
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 8
- 241001269238 Data Species 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Hadoop based parallel k nearest neighbor classification method, which includes the steps: preprocessing data; parallelly computing the distance between test data and training data at an Mapper end of each node of Hadoop; using the selection algorithm to determine local k nearest neighbor data of the test data at the Mapper ends, and transmitting all the local k nearest neighbor data to a Reducer end of each node of the Hadoop; receiving all the local k nearest neighbor data of the test data at the Reducer ends, and using the selection algorithm to determine global k nearest neighbor data; classifying the test data by the aid of the global k nearest neighbor data to obtain classification results of the test data; and repeating to obtain classification results of all the test data. Using the Hadoop based parallel k nearest neighbor classification method can effectively solve the massive data classification problem and greatly increase classification speed.
Description
Technical field
The present invention relates to a kind of parallel k nearest neighbour classification method based on Hadoop.
Background technology
K nearest neighbour classification method is a kind of widespread use and effective sorting technique.How to handle the classification of mass data scale, how to accomplish this type of data mining task efficiently and have great researching value.Existing technology is the distance at the Mapper of Hadoop end parallel computation test data and all training datas; Receiving with the test data in the Reducer termination then is key's; Distance and class are designated as the data of value; Confirm the k neighbour of each test data then, finally confirm the classification results of test data.This technology is simple and easy to usefulness, but the transmission data wherein are too huge, particularly for mass data, is impossible mission basically, so must solve the efficiency in the practical application.
Summary of the invention
Goal of the invention: the problem and shortage to above-mentioned prior art exists, the purpose of this invention is to provide a kind of parallel k nearest neighbour classification method based on Hadoop, can solve the classification problem of mass data effectively, improve the speed of classification greatly.
Technical scheme: for realizing the foregoing invention purpose, the technical scheme that the present invention adopts is a kind of parallel k nearest neighbour classification method based on Hadoop, comprises the steps:
(1) data pre-service;
(2) in test data of the Mapper of each node of Hadoop end parallel computation and the distance that is positioned at the training data of this node;
(3) hold the local k neighbour data of confirming this test data with selection algorithm at said Mapper, all local k neighbour data are sent to the Reducer end of each node of Hadoop;
(4) in all local k neighbour data of said this test data of Reducer termination receipts, confirm overall k neighbour data with selection algorithm;
(5) utilize said overall k neighbour data that this test data is classified, obtain the classification results of this test data;
(6) repeated execution of steps (2) obtains the classification results of all test datas to (5).
Said local k neighbour data can be (key, value), wherein key is a test data, value is the class target data splitting of said distance and training data.
In the said step (5), The classification basis can be that D-F is theoretical.
Beneficial effect: the present invention makes full use of data-intensive characteristic, adopts parallelization to calculate local k neighbour data, and then obtains the method for overall k neighbour data.Experimental result shows that the inventive method can significantly reduce each data between nodes transmission quantity, thereby promotes classification effectiveness greatly; Classification problem for handling large-scale data has good effect, has good speed-up ratio.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is the comparison synoptic diagram of the classification time of the inventive method and existing method: test data is 4.4M, and training data is that the classification time ratio of 39.7M is than (ordinate is classification time/s, and horizontal ordinate is the node number);
Fig. 3 is the comparison synoptic diagram of the travelling speed of the inventive method and existing method: test data is 10.8M, and training data is that the classification time ratio of 97.8M is than (ordinate is classification time/s, and horizontal ordinate is the node number);
Fig. 4 is the speed-up ratio that under node is counted condition of different, possessed in the inventive method under the very big situation of data volume and the comparison synoptic diagram of desirable speed-up ratio: test data is 4.4M; Training data is the speed-up ratio (ordinate is for quickening multiple, and horizontal ordinate is the node number) of 3.9G.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, specify the step of the inventive method below:
Step 2 (for example for text data, is calculated all the other chordal distances in test data of the Mapper of each node of Hadoop end parallel computation with the distance that is positioned at the training data of this node; For vectorial type data, calculate its Euclidean distance).
Step 6, repeated execution of steps 2-5 obtains the classification results of all test datas.
Originally just simply in the distance of Mapper end parallel computation test data, in Reducer termination receipts and confirm k neighbour data, thereby accomplish last classification based on the k neighbour parallel classification algorithm of Hadoop with training data.This method is improved it, the volume of transmitted data in the middle of having significantly reduced, thus improved the speed of classification greatly.Different with method originally is; Method of the present invention is held the function of not only accomplishing computed range at Mapper, has more increased the function that filters out local k neighbour data, thereby has made more crypto set of data computing; Thereby reduced the transmission of data, also had more speed-up ratio near perfect condition.
As shown in Figures 2 and 3, can find out that under the situation of different node numbers, the classification time ratio prior art of the inventive method has saved about 1/2 to 4/5.
As shown in Figure 4, the speed-up ratio that under node is counted condition of different, is possessed in the inventive method under the very big situation of data volume and the gap of desirable speed-up ratio are very little, are no more than under 3 the situation almost as broad as long at the node number.
Claims (3)
1. the parallel k nearest neighbour classification method based on Hadoop comprises the steps:
(1) data pre-service;
(2) in test data of the Mapper of each node of Hadoop platform end parallel computation and the distance that is positioned at the training data of this node;
(3) hold the local k neighbour data of confirming this test data with selection algorithm at said Mapper, all local k neighbour data are sent to the Reducer end of each node of Hadoop;
(4) in all local k neighbour data of said this test data of Reducer termination receipts, confirm overall k neighbour data with selection algorithm;
(5) utilize said overall k neighbour data that this test data is classified, obtain the classification results of this test data;
(6) repeated execution of steps (2) obtains the classification results of all test datas to (5).
2. according to the said a kind of parallel k nearest neighbour classification method of claim 1, it is characterized in that based on Hadoop: said local k neighbour data for (key, value), wherein key is a test data, value is the class target data splitting of said distance and training data.
3. according to the said a kind of parallel k nearest neighbour classification method based on Hadoop of claim 1, it is characterized in that: in the said step (5), The classification basis is that D-F is theoretical.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210071445XA CN102622446A (en) | 2012-03-19 | 2012-03-19 | Hadoop based parallel k nearest neighbor classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210071445XA CN102622446A (en) | 2012-03-19 | 2012-03-19 | Hadoop based parallel k nearest neighbor classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102622446A true CN102622446A (en) | 2012-08-01 |
Family
ID=46562365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210071445XA Pending CN102622446A (en) | 2012-03-19 | 2012-03-19 | Hadoop based parallel k nearest neighbor classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102622446A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291954A (en) * | 2017-07-28 | 2017-10-24 | 南京邮电大学 | A kind of OCL parallel query methods based on MapReduce |
CN111814892A (en) * | 2020-07-16 | 2020-10-23 | 贵州民族大学 | Design method for constructing parallel KNN classifier by distributed objects |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1953442A (en) * | 2006-09-14 | 2007-04-25 | 浙江大学 | Method of k-neighbour query based on data mesh |
CN101799748A (en) * | 2009-02-06 | 2010-08-11 | 中国移动通信集团公司 | Method for determining data sample class and system thereof |
CN102243641A (en) * | 2011-04-29 | 2011-11-16 | 西安交通大学 | Method for efficiently clustering massive data |
-
2012
- 2012-03-19 CN CN201210071445XA patent/CN102622446A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1953442A (en) * | 2006-09-14 | 2007-04-25 | 浙江大学 | Method of k-neighbour query based on data mesh |
CN101799748A (en) * | 2009-02-06 | 2010-08-11 | 中国移动通信集团公司 | Method for determining data sample class and system thereof |
CN102243641A (en) * | 2011-04-29 | 2011-11-16 | 西安交通大学 | Method for efficiently clustering massive data |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107291954A (en) * | 2017-07-28 | 2017-10-24 | 南京邮电大学 | A kind of OCL parallel query methods based on MapReduce |
CN107291954B (en) * | 2017-07-28 | 2020-07-31 | 南京邮电大学 | OC L parallel query method based on MapReduce |
CN111814892A (en) * | 2020-07-16 | 2020-10-23 | 贵州民族大学 | Design method for constructing parallel KNN classifier by distributed objects |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105956093B (en) | A kind of personalized recommendation method based on multiple view anchor point figure Hash technology | |
Goyal | A survey on travelling salesman problem | |
CN106254321A (en) | A kind of whole network abnormal data stream sorting technique | |
Jiang et al. | Effects of efficient edge rewiring strategies on network transport efficiency | |
CN104102833B (en) | Based on the tax index normalization found between compact district and fusion calculation method | |
CN118376258B (en) | Vehicle path planning method and system based on cloud network edge | |
CN113422695A (en) | Optimization method for improving robustness of topological structure of Internet of things | |
CN104751200B (en) | A kind of method of SVM network traffic classification | |
CN102622446A (en) | Hadoop based parallel k nearest neighbor classification method | |
Wang et al. | Evolutionary algorithm-based and network architecture search-enabled multiobjective traffic classification | |
CN103544328A (en) | Parallel k mean value clustering method based on Hadoop | |
Bahmanikashkooli et al. | Application of particle swarm optimization algorithm for computing critical depth of horseshoe cross section tunnel | |
CN105138527A (en) | Data classification regression method and data classification regression device | |
CN103218441B (en) | A kind of content-based image search method with feeding back | |
CN104217118A (en) | Vessel pilot scheduling problem model and solving method | |
Yeo et al. | Big data: Cloud computing in genomics applications | |
CN109600359A (en) | Data distributing method based on node data processing capacity and node operation load | |
CN103336963A (en) | Method and device for image feature extraction | |
CN105631210A (en) | Directed digraph strongly-connected component analysis method based on MapReduce | |
CN104881530A (en) | Hobbing dry-cutting processing method based on optimized technical parameter | |
CN107018027B (en) | Link prediction method based on Bayesian estimation and common neighbor node degree | |
CN104639606B (en) | A kind of optimization method of differentiation contrast piecemeal | |
CN103500359A (en) | Radar radiation source identification method based on structure equivalence type fuzzy neural network | |
Chatterjee et al. | Pattern matching based algorithms for graph compression | |
CN105653501A (en) | Kriging interpolation acceleration method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120801 |