CN102622446A

CN102622446A - Hadoop based parallel k nearest neighbor classification method

Info

Publication number: CN102622446A
Application number: CN201210071445XA
Authority: CN
Inventors: 高阳; 杨育彬; 王灵江; 商琳
Original assignee: JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY; Nanjing University
Current assignee: JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY; Nanjing University
Priority date: 2012-03-19
Filing date: 2012-03-19
Publication date: 2012-08-01

Abstract

The invention discloses a Hadoop based parallel k nearest neighbor classification method, which includes the steps: preprocessing data; parallelly computing the distance between test data and training data at an Mapper end of each node of Hadoop; using the selection algorithm to determine local k nearest neighbor data of the test data at the Mapper ends, and transmitting all the local k nearest neighbor data to a Reducer end of each node of the Hadoop; receiving all the local k nearest neighbor data of the test data at the Reducer ends, and using the selection algorithm to determine global k nearest neighbor data; classifying the test data by the aid of the global k nearest neighbor data to obtain classification results of the test data; and repeating to obtain classification results of all the test data. Using the Hadoop based parallel k nearest neighbor classification method can effectively solve the massive data classification problem and greatly increase classification speed.

Description

A kind of parallel k nearest neighbour classification method based on Hadoop

Technical field

The present invention relates to a kind of parallel k nearest neighbour classification method based on Hadoop.

Background technology

K nearest neighbour classification method is a kind of widespread use and effective sorting technique.How to handle the classification of mass data scale, how to accomplish this type of data mining task efficiently and have great researching value.Existing technology is the distance at the Mapper of Hadoop end parallel computation test data and all training datas; Receiving with the test data in the Reducer termination then is key's; Distance and class are designated as the data of value; Confirm the k neighbour of each test data then, finally confirm the classification results of test data.This technology is simple and easy to usefulness, but the transmission data wherein are too huge, particularly for mass data, is impossible mission basically, so must solve the efficiency in the practical application.

Summary of the invention

Goal of the invention: the problem and shortage to above-mentioned prior art exists, the purpose of this invention is to provide a kind of parallel k nearest neighbour classification method based on Hadoop, can solve the classification problem of mass data effectively, improve the speed of classification greatly.

Technical scheme: for realizing the foregoing invention purpose, the technical scheme that the present invention adopts is a kind of parallel k nearest neighbour classification method based on Hadoop, comprises the steps:

(1) data pre-service;

(2) in test data of the Mapper of each node of Hadoop end parallel computation and the distance that is positioned at the training data of this node;

(3) hold the local k neighbour data of confirming this test data with selection algorithm at said Mapper, all local k neighbour data are sent to the Reducer end of each node of Hadoop;

(4) in all local k neighbour data of said this test data of Reducer termination receipts, confirm overall k neighbour data with selection algorithm;

(5) utilize said overall k neighbour data that this test data is classified, obtain the classification results of this test data;

(6) repeated execution of steps (2) obtains the classification results of all test datas to (5).

Said local k neighbour data can be (key, value), wherein key is a test data, value is the class target data splitting of said distance and training data.

In the said step (5), The classification basis can be that D-F is theoretical.

Beneficial effect: the present invention makes full use of data-intensive characteristic, adopts parallelization to calculate local k neighbour data, and then obtains the method for overall k neighbour data.Experimental result shows that the inventive method can significantly reduce each data between nodes transmission quantity, thereby promotes classification effectiveness greatly; Classification problem for handling large-scale data has good effect, has good speed-up ratio.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method;

Fig. 2 is the comparison synoptic diagram of the classification time of the inventive method and existing method: test data is 4.4M, and training data is that the classification time ratio of 39.7M is than (ordinate is classification time/s, and horizontal ordinate is the node number);

Fig. 3 is the comparison synoptic diagram of the travelling speed of the inventive method and existing method: test data is 10.8M, and training data is that the classification time ratio of 97.8M is than (ordinate is classification time/s, and horizontal ordinate is the node number);

Fig. 4 is the speed-up ratio that under node is counted condition of different, possessed in the inventive method under the very big situation of data volume and the comparison synoptic diagram of desirable speed-up ratio: test data is 4.4M; Training data is the speed-up ratio (ordinate is for quickening multiple, and horizontal ordinate is the node number) of 3.9G.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.

As shown in Figure 1, specify the step of the inventive method below:

Step 1, the TF-IDF value of text data is for example calculated in the data pre-service.

Step 2 (for example for text data, is calculated all the other chordal distances in test data of the Mapper of each node of Hadoop end parallel computation with the distance that is positioned at the training data of this node; For vectorial type data, calculate its Euclidean distance).

Step 3 is held the local k neighbour data of confirming this test data with selection algorithm at Mapper, is key with the test data; Class target the data splitting (< distance of distance and training data; Type mark>form) be value, it is right to form data, all local k neighbour data is sent to Reducer hold.

Step 4 is received all values of same key in the Reducer termination, according to the distance of values the inside, confirms final overall k neighbour data with selection algorithm.

Step 5, basic D-F is theoretical, utilizes overall k neighbour data that this test data is classified, and confirms its type mark.

Step 6, repeated execution of steps 2-5 obtains the classification results of all test datas.

Originally just simply in the distance of Mapper end parallel computation test data, in Reducer termination receipts and confirm k neighbour data, thereby accomplish last classification based on the k neighbour parallel classification algorithm of Hadoop with training data.This method is improved it, the volume of transmitted data in the middle of having significantly reduced, thus improved the speed of classification greatly.Different with method originally is; Method of the present invention is held the function of not only accomplishing computed range at Mapper, has more increased the function that filters out local k neighbour data, thereby has made more crypto set of data computing; Thereby reduced the transmission of data, also had more speed-up ratio near perfect condition.

As shown in Figures 2 and 3, can find out that under the situation of different node numbers, the classification time ratio prior art of the inventive method has saved about 1/2 to 4/5.

As shown in Figure 4, the speed-up ratio that under node is counted condition of different, is possessed in the inventive method under the very big situation of data volume and the gap of desirable speed-up ratio are very little, are no more than under 3 the situation almost as broad as long at the node number.

Claims

1. the parallel k nearest neighbour classification method based on Hadoop comprises the steps:

(1) data pre-service;

(2) in test data of the Mapper of each node of Hadoop platform end parallel computation and the distance that is positioned at the training data of this node;

2. according to the said a kind of parallel k nearest neighbour classification method of claim 1, it is characterized in that based on Hadoop: said local k neighbour data for (key, value), wherein key is a test data, value is the class target data splitting of said distance and training data.

3. according to the said a kind of parallel k nearest neighbour classification method based on Hadoop of claim 1, it is characterized in that: in the said step (5), The classification basis is that D-F is theoretical.