CN102622446A - Hadoop based parallel k nearest neighbor classification method - Google Patents

Hadoop based parallel k nearest neighbor classification method Download PDF

Info

Publication number
CN102622446A
CN102622446A CN201210071445XA CN201210071445A CN102622446A CN 102622446 A CN102622446 A CN 102622446A CN 201210071445X A CN201210071445X A CN 201210071445XA CN 201210071445 A CN201210071445 A CN 201210071445A CN 102622446 A CN102622446 A CN 102622446A
Authority
CN
China
Prior art keywords
data
neighbour
hadoop
test data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210071445XA
Other languages
Chinese (zh)
Inventor
高阳
杨育彬
王灵江
商琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY
Nanjing University
Original Assignee
JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY, Nanjing University filed Critical JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY
Priority to CN201210071445XA priority Critical patent/CN102622446A/en
Publication of CN102622446A publication Critical patent/CN102622446A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Hadoop based parallel k nearest neighbor classification method, which includes the steps: preprocessing data; parallelly computing the distance between test data and training data at an Mapper end of each node of Hadoop; using the selection algorithm to determine local k nearest neighbor data of the test data at the Mapper ends, and transmitting all the local k nearest neighbor data to a Reducer end of each node of the Hadoop; receiving all the local k nearest neighbor data of the test data at the Reducer ends, and using the selection algorithm to determine global k nearest neighbor data; classifying the test data by the aid of the global k nearest neighbor data to obtain classification results of the test data; and repeating to obtain classification results of all the test data. Using the Hadoop based parallel k nearest neighbor classification method can effectively solve the massive data classification problem and greatly increase classification speed.

Description

A kind of parallel k nearest neighbour classification method based on Hadoop
Technical field
The present invention relates to a kind of parallel k nearest neighbour classification method based on Hadoop.
Background technology
K nearest neighbour classification method is a kind of widespread use and effective sorting technique.How to handle the classification of mass data scale, how to accomplish this type of data mining task efficiently and have great researching value.Existing technology is the distance at the Mapper of Hadoop end parallel computation test data and all training datas; Receiving with the test data in the Reducer termination then is key's; Distance and class are designated as the data of value; Confirm the k neighbour of each test data then, finally confirm the classification results of test data.This technology is simple and easy to usefulness, but the transmission data wherein are too huge, particularly for mass data, is impossible mission basically, so must solve the efficiency in the practical application.
Summary of the invention
Goal of the invention: the problem and shortage to above-mentioned prior art exists, the purpose of this invention is to provide a kind of parallel k nearest neighbour classification method based on Hadoop, can solve the classification problem of mass data effectively, improve the speed of classification greatly.
Technical scheme: for realizing the foregoing invention purpose, the technical scheme that the present invention adopts is a kind of parallel k nearest neighbour classification method based on Hadoop, comprises the steps:
(1) data pre-service;
(2) in test data of the Mapper of each node of Hadoop end parallel computation and the distance that is positioned at the training data of this node;
(3) hold the local k neighbour data of confirming this test data with selection algorithm at said Mapper, all local k neighbour data are sent to the Reducer end of each node of Hadoop;
(4) in all local k neighbour data of said this test data of Reducer termination receipts, confirm overall k neighbour data with selection algorithm;
(5) utilize said overall k neighbour data that this test data is classified, obtain the classification results of this test data;
(6) repeated execution of steps (2) obtains the classification results of all test datas to (5).
Said local k neighbour data can be (key, value), wherein key is a test data, value is the class target data splitting of said distance and training data.
In the said step (5), The classification basis can be that D-F is theoretical.
Beneficial effect: the present invention makes full use of data-intensive characteristic, adopts parallelization to calculate local k neighbour data, and then obtains the method for overall k neighbour data.Experimental result shows that the inventive method can significantly reduce each data between nodes transmission quantity, thereby promotes classification effectiveness greatly; Classification problem for handling large-scale data has good effect, has good speed-up ratio.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method;
Fig. 2 is the comparison synoptic diagram of the classification time of the inventive method and existing method: test data is 4.4M, and training data is that the classification time ratio of 39.7M is than (ordinate is classification time/s, and horizontal ordinate is the node number);
Fig. 3 is the comparison synoptic diagram of the travelling speed of the inventive method and existing method: test data is 10.8M, and training data is that the classification time ratio of 97.8M is than (ordinate is classification time/s, and horizontal ordinate is the node number);
Fig. 4 is the speed-up ratio that under node is counted condition of different, possessed in the inventive method under the very big situation of data volume and the comparison synoptic diagram of desirable speed-up ratio: test data is 4.4M; Training data is the speed-up ratio (ordinate is for quickening multiple, and horizontal ordinate is the node number) of 3.9G.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, specify the step of the inventive method below:
Step 1, the TF-IDF value of text data is for example calculated in the data pre-service.
Step 2 (for example for text data, is calculated all the other chordal distances in test data of the Mapper of each node of Hadoop end parallel computation with the distance that is positioned at the training data of this node; For vectorial type data, calculate its Euclidean distance).
Step 3 is held the local k neighbour data of confirming this test data with selection algorithm at Mapper, is key with the test data; Class target the data splitting (< distance of distance and training data; Type mark>form) be value, it is right to form data, all local k neighbour data is sent to Reducer hold.
Step 4 is received all values of same key in the Reducer termination, according to the distance of values the inside, confirms final overall k neighbour data with selection algorithm.
Step 5, basic D-F is theoretical, utilizes overall k neighbour data that this test data is classified, and confirms its type mark.
Step 6, repeated execution of steps 2-5 obtains the classification results of all test datas.
Originally just simply in the distance of Mapper end parallel computation test data, in Reducer termination receipts and confirm k neighbour data, thereby accomplish last classification based on the k neighbour parallel classification algorithm of Hadoop with training data.This method is improved it, the volume of transmitted data in the middle of having significantly reduced, thus improved the speed of classification greatly.Different with method originally is; Method of the present invention is held the function of not only accomplishing computed range at Mapper, has more increased the function that filters out local k neighbour data, thereby has made more crypto set of data computing; Thereby reduced the transmission of data, also had more speed-up ratio near perfect condition.
As shown in Figures 2 and 3, can find out that under the situation of different node numbers, the classification time ratio prior art of the inventive method has saved about 1/2 to 4/5.
As shown in Figure 4, the speed-up ratio that under node is counted condition of different, is possessed in the inventive method under the very big situation of data volume and the gap of desirable speed-up ratio are very little, are no more than under 3 the situation almost as broad as long at the node number.

Claims (3)

1. the parallel k nearest neighbour classification method based on Hadoop comprises the steps:
(1) data pre-service;
(2) in test data of the Mapper of each node of Hadoop platform end parallel computation and the distance that is positioned at the training data of this node;
(3) hold the local k neighbour data of confirming this test data with selection algorithm at said Mapper, all local k neighbour data are sent to the Reducer end of each node of Hadoop;
(4) in all local k neighbour data of said this test data of Reducer termination receipts, confirm overall k neighbour data with selection algorithm;
(5) utilize said overall k neighbour data that this test data is classified, obtain the classification results of this test data;
(6) repeated execution of steps (2) obtains the classification results of all test datas to (5).
2. according to the said a kind of parallel k nearest neighbour classification method of claim 1, it is characterized in that based on Hadoop: said local k neighbour data for (key, value), wherein key is a test data, value is the class target data splitting of said distance and training data.
3. according to the said a kind of parallel k nearest neighbour classification method based on Hadoop of claim 1, it is characterized in that: in the said step (5), The classification basis is that D-F is theoretical.
CN201210071445XA 2012-03-19 2012-03-19 Hadoop based parallel k nearest neighbor classification method Pending CN102622446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210071445XA CN102622446A (en) 2012-03-19 2012-03-19 Hadoop based parallel k nearest neighbor classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210071445XA CN102622446A (en) 2012-03-19 2012-03-19 Hadoop based parallel k nearest neighbor classification method

Publications (1)

Publication Number Publication Date
CN102622446A true CN102622446A (en) 2012-08-01

Family

ID=46562365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210071445XA Pending CN102622446A (en) 2012-03-19 2012-03-19 Hadoop based parallel k nearest neighbor classification method

Country Status (1)

Country Link
CN (1) CN102622446A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291954A (en) * 2017-07-28 2017-10-24 南京邮电大学 A kind of OCL parallel query methods based on MapReduce
CN111814892A (en) * 2020-07-16 2020-10-23 贵州民族大学 Design method for constructing parallel KNN classifier by distributed objects

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1953442A (en) * 2006-09-14 2007-04-25 浙江大学 Method of k-neighbour query based on data mesh
CN101799748A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Method for determining data sample class and system thereof
CN102243641A (en) * 2011-04-29 2011-11-16 西安交通大学 Method for efficiently clustering massive data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1953442A (en) * 2006-09-14 2007-04-25 浙江大学 Method of k-neighbour query based on data mesh
CN101799748A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Method for determining data sample class and system thereof
CN102243641A (en) * 2011-04-29 2011-11-16 西安交通大学 Method for efficiently clustering massive data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291954A (en) * 2017-07-28 2017-10-24 南京邮电大学 A kind of OCL parallel query methods based on MapReduce
CN107291954B (en) * 2017-07-28 2020-07-31 南京邮电大学 OC L parallel query method based on MapReduce
CN111814892A (en) * 2020-07-16 2020-10-23 贵州民族大学 Design method for constructing parallel KNN classifier by distributed objects

Similar Documents

Publication Publication Date Title
CN105956093B (en) A kind of personalized recommendation method based on multiple view anchor point figure Hash technology
Goyal A survey on travelling salesman problem
CN106254321A (en) A kind of whole network abnormal data stream sorting technique
Jiang et al. Effects of efficient edge rewiring strategies on network transport efficiency
CN104102833B (en) Based on the tax index normalization found between compact district and fusion calculation method
CN118376258B (en) Vehicle path planning method and system based on cloud network edge
CN113422695A (en) Optimization method for improving robustness of topological structure of Internet of things
CN104751200B (en) A kind of method of SVM network traffic classification
CN102622446A (en) Hadoop based parallel k nearest neighbor classification method
Wang et al. Evolutionary algorithm-based and network architecture search-enabled multiobjective traffic classification
CN103544328A (en) Parallel k mean value clustering method based on Hadoop
Bahmanikashkooli et al. Application of particle swarm optimization algorithm for computing critical depth of horseshoe cross section tunnel
CN105138527A (en) Data classification regression method and data classification regression device
CN103218441B (en) A kind of content-based image search method with feeding back
CN104217118A (en) Vessel pilot scheduling problem model and solving method
Yeo et al. Big data: Cloud computing in genomics applications
CN109600359A (en) Data distributing method based on node data processing capacity and node operation load
CN103336963A (en) Method and device for image feature extraction
CN105631210A (en) Directed digraph strongly-connected component analysis method based on MapReduce
CN104881530A (en) Hobbing dry-cutting processing method based on optimized technical parameter
CN107018027B (en) Link prediction method based on Bayesian estimation and common neighbor node degree
CN104639606B (en) A kind of optimization method of differentiation contrast piecemeal
CN103500359A (en) Radar radiation source identification method based on structure equivalence type fuzzy neural network
Chatterjee et al. Pattern matching based algorithms for graph compression
CN105653501A (en) Kriging interpolation acceleration method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120801