CN106203508A

CN106203508A - A kind of image classification method based on Hadoop platform

Info

Publication number: CN106203508A
Application number: CN201610543066.4A
Authority: CN
Inventors: 侯春萍; 张倩楠; 王宝亮; 常鹏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-07-11
Filing date: 2016-07-11
Publication date: 2016-12-07

Abstract

The present invention relates to a kind of image classification method based on Hadoop platform, including: extract image Sift feature, generate the SIFT feature storehouse of training image；Sift feature is utilized to generate BoVW visual dictionary；After extracting the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, training image is expressed as histogram vectors form based on dictionary；The histogram vectors of training image being inputted as the training of random forest grader, the parallelization designing grader on Hadoop generates；For needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, input grader, Hadoop platform carries out parallel sorting.The present invention not only has accuracy of preferably classifying, and the most effectively reduces the classification time, can be well applied to large-scale image classification scene.

Description

A kind of image classification method based on Hadoop platform

Technical field

The present invention relates to Image Classfication Technology, be specifically related to a kind of distributed image sorting technique.

Background technology

One, image classification aspect

Image Classfication Technology utilizes computer that image is carried out automated analysis and classification, is object detection and recognition, figure Basis as fields such as retrievals.Image classification is typically made up of the extraction of characteristics of image and two aspects of classification of feature based.

In terms of feature extraction, the most mostly pay close attention to the local feature of image, use Scale invariant features transform (Scale Invariant Feature Transform, SIFT), accelerate robust feature (Speeded Up Robust Features, SURF) scheduling algorithm extracts local feature vectors.Visual word bag (Bag of Visual Words, BoVW) model is basis at this On further, a large amount of characteristic vectors extracted are clustered, generate visual dictionary, image is mapped to word according to dictionary The represented as histograms of allusion quotation word, had the most both decreased the quantity of characteristic vector, also made image vector more expressiveness.

In terms of the classification of feature based, the methods using machine learning carry out more.The grader of main flow has support Vector machine (Support Vector Machine, SVM) grader, it is in solving small sample, the classification of non-linear and higher-dimension Show good character.Additionally such as Adaboost grader, as a kind of iterative algorithm, add in each wheel one new Weak Classifier, until reaching certain predetermined sufficiently small error rate.But, when pending image is larger, its sea File system and computing architecture that these traditional classification algorithms are relied on by amount sample, the feature of high dimension vector propose the biggest Challenge.Random theory is introduced decision-tree model by random forests algorithm, utilizes some decision trees to set up assembled classifier, for image Classification provides new thinking, is one of bioinformatics, Data Mining hot topic direction.But, when data volume is the biggest Time, random forest grader consumes long problem when being also faced with classification.

Two, cloud computing platform aspect

Hadoop is the distributed cloud computing platform of increasing income of current main-stream, is usually used in web access log analysis, reverse indexing Structure, clustering documents, machine translation based on statistics and generate the large-scale data process work such as index of whole search engine. Hadoop platform is developed by Apache foundation, and the most well-known Internet firm arranges internal based on Hadoop framework Application, such as Taobao, eBay and Baidu etc..

Also there are some companies to designing big data processing platform (DPP), it is provided that the big data solution of complete business, such as Microsoft Azure, the BigCloude etc. of China Mobile.By the distributed computation ability that big data platform is powerful, depositing of mass data Store up and calculate to be better achieved.Therefore, how to combine big data processing platform (DPP), transplant the processing procedure of large nuber of images, be One problem highly significant.

Summary of the invention

It is an object of the present invention to provide a kind of distributed image sorting technique being suitable to Hadoop platform, the method can Make full use of the distributed computation ability of Hadoop platform, when overcoming in large-scale image classification, consume long, storage and file system The problem of system bottleneck, improves the efficiency of large-scale image classification.Technical scheme is as follows.

A kind of distributed image sorting algorithm based on Hadoop platform, including following technical step:

Step 1. extracts image Sift feature:

Inputting several training images, design is parallel in Hadoop platform extracts each training image SIFT feature, generates The SIFT feature storehouse of training image；

Step 2. utilize Sift feature generate BoVW visual dictionary:

In Hadoop platform, the sift vector in sift feature database is carried out Distributed Cluster, obtain some vision lists Word, as the dictionary of BoVW model；

After step 3. extracts the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, instruction Practicing graphical representation is histogram vectors form based on dictionary；

The histogram vectors of the training image of step 3 is inputted by step 4. as the training of random forest grader, The parallelization of Hadoop upper design grader generates；

Step 5. is for needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, The grader of input step 4 gained, carries out parallel sorting in Hadoop platform.

The present invention is directed to the large-scale image sort operation problem that time-consumingly too much, file system and storage architecture fall behind, carry Go out a kind of image classification method based on Hadoop platform.Experiment display, the present invention not only has accuracy of preferably classifying, with Shi Youxiao reduces the classification time, can be well applied to large-scale image classification scene.

Accompanying drawing explanation

Fig. 1 is the structure chart of Hadoop platform

Fig. 2 is the flow chart of the present invention

Fig. 3 is the schematic diagram of BoVW model

Detailed description of the invention

The categorizing process of image is divided into the extraction of characteristics of image and two rank of training of random forest grader by the present invention Section, carries out paralell design and programming in each stage so that the overall process of image procossing is all not related to all images number According to operation；It addition, introduce BoVW model at first stage, carry out image simplifying expression according to model, improve image and divide The accuracy of class.

The present invention chooses Caltech-101 classics image library and tests, and randomly selects brain, bonsai, leopards Classify Deng eight class images.In every class image, choosing 30 width images respectively as training image, 20 width are as test figure Picture, each experiment is all carried out 10 times.The present invention will be further described below.

(1) extraction of characteristics of image

The Sift that Sift algorithm extracts describes son and maintains the invariance, image scale transform, rotation, brightness flop etc. to regarding Angle change, affine transformation also keep certain stability.Hadoop carries out the Sift feature extraction of great amount of images, parallelization Step is as follows:

Training image data set { img (x, y) } input Hadoop cluster, block formula are stored in each from node by step 1. Due to image own separate, in block, (x, y) as a single burst for each image img.

Step 2 Map task node input key/value is to ＜ Key, Value ＞, and wherein Key is img (x, ID y), Value Be view data img (x, y).Each map task carries out Sift feature extraction to image, and every image all obtains characteristic set { X₁, X₂... }, X_iInside meets ＜ (x_i,y_i),X_n(x_i,y_i) ＞ form, wherein (x_i,y_i) it is the position coordinates of characteristic point, X_n(x_i, y_i) it is the Sift characteristic vector of 128 dimensions at this.

The Sift feature that step 3 each Map node produces is stored on HDFS cluster in a distributed manner, it is not necessary to carry out Reduce Process, is therefore set to 0 by Reduce function number.

(2) BoVW model representation image is used

Directly the great amount of images Sift vector of extraction in (1) is implemented classification and there is the problem that data volume is big, calculate complexity. Introduce BoVW model to carry out image simplifying expression, it is possible to reduce data volume, it is thus achieved that preferably classifying quality.BoVW model is greatly On the basis of amount Sift feature, produce limited vision word (Visual Words, VW) composition visual dictionary, then utilize and regard Feel that image table is shown as dimension and is equal to the vector of dictionary size by dictionary.Cluster process therein uses K-means algorithm, will be to N characteristic point in quantity space is divided into the K class specified, as shown in formula (1) according to variance within clusters and minimum principle.

m i n Σ_{i = 1}^{K} \underset{x &Element; C_{i}}{Σ} d i s t {(μ_{i}, x_{i})}^{2} - - - (1)

In formula: C_iRepresenting ith cluster classification, center is μ_i, x_iData point for the category.

The present invention uses Apache Machout machine learning storehouse of increasing income to implement K-means algorithm, first converts in (2) and produces Raw a large amount of Sift are characterized as sequenceFile form, and acquiescence maximum iteration time is 10, set cluster centre number as 200, Utilize the KMeansMapper under org.apache.mahout.clustering.kmeans, KMeansCombiner, KMeansReducer, KMeansDriver produce cluster centre, and gained cluster centre is stored on HDFS, as BoVW model Dictionary use.

After obtaining the dictionary that size is 200, according to the Euclidean distance of dictionary word and image each Sift vector, by Sift to Amount is mapped as away from its nearer vision word.The word frequency of statistics vision word, gained frequency histogram is based on BoVW 200 The image vector of dimension

(3) random forest classifier training

Random forest by multiple decision trees use random manner set up, multiple Weak Classifiers are combined into one high-precision The grader of degree.Each decision tree in forest is by root node, branch node, leaf node composition.During node split, from instruction Practice in the alternative property set of sample and randomly choose some attributes, calculate the Geordie impurity level of each attribute, impurity level is declined maximum Attribute as this node split attribute.Geordie impurity level is defined as:

G i n i (N_{n o d e}) = 1 - Σ_{1 = 1}^{H} p_{i}^{2} - - - (2)

In formula, p_iIt is being classification i accounting at node.After division, if former set is divided into L part, then divide Average Gini coefficient after splitting is:

G i n i (α) = Σ_{i = 1}^{L} \frac{| T_{i} |}{| T |} \times G i n i (i) - - - (3)

In formula, L is the number of child node, and Ti is the number of sample at child node i, Gini (i) be the Geordie of child node i not Purity.

The image Sift vector obtained according to (1) (2) and visual dictionary, can represent the image as the vector of 200 dimensions, should Vector, as the training data of random forest, generates random forest on Hadoop in a distributed manner, and parallel step is as follows:

Step 1 sets in random forest the number of tree as T=200, at original training sample collectionOn, use Bootstrap method obtains T training sample subset { S₁,S₂...S₂₀₀, as the input of each tree.

Step 2 starts and equal-sized T the Map task of forest, inputs key-value pair ＜ Key, and Value ＞, Key are forests Mark, Value is training subset S_i.Map function calculates optimal Split Attribute at each node, is sequentially generated left and right branch.Defeated Going out key-value pair ＜ Key, Value ＞, Key are forest marks, information DT of Value storage each tree_i。

After step 3Map task is fully completed, result inputting Reduce task, Reduce task integrates the letter of all trees Breath, exports ＜ Key, and Value ＞, Key are forest mark, and Value is that random forest is all, is expressed as { DT₁,DT₂...DT₂₀₀}。

(4) random forest is utilized to carry out class test

Test image by Sift feature extraction, represent based on BoVW image, random forest classification three links, finally by The ballot of random forest grader produces classification results.

The Hadoop cluster that distributed experimental situation is made up of 4 DellR730 servers, wherein 1 station server is NameNode node, other 3 is DataNode node.The operating system of four station servers all uses Ubuntu14.04, Hadoop version is CDH 5.5.0.Test image chooses in cal-101 image library 8 class figure such as brain, bonsai, leopards As carrying out, all kinds of images are chosen 20 as test image.

The performance that experiment compares the distributed of this method and unit form goes up at runtime: BoVW dictionary creation, In random forest classifier training and test image three subprocess of classification, the execution time of distributed type assemblies all has bright compared with unit Aobvious shortening, average speedup is about 2.4.

It addition, experiment display, the impact of the different classification performance on this method of achievement number, when achievement number is very few, Classification accuracy increases along with achievement number and increases；When achievement number reaches 200, the performance of grader tends towards stability, accurate Exactness is close to 75%.

Shown in sum up, the present invention can obtain good classifying quality, makes full use of Hadoop environment distributed simultaneously Advantage in storage and concurrent operation, substantially reduces the time overhead of classification.

Claims

1. an image classification method based on Hadoop platform, comprises the following steps:

Step 1. extracts image Sift feature:

Inputting several training images, design is parallel in Hadoop platform extracts each training image SIFT feature, generates training The SIFT feature storehouse of image；

Step 2. utilize Sift feature generate BoVW visual dictionary:

In Hadoop platform, the sift vector in sift feature database is carried out Distributed Cluster, obtain some vision word, make Dictionary for BoVW model；

After step 3. extracts the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, training figure As being expressed as histogram vectors form based on dictionary；

The histogram vectors of the training image of step 3 is inputted, at Hadoop by step 4. as the training of random forest grader The parallelization of upper design grader generates；

Step 5. is for needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, and input The grader of step 4 gained, carries out parallel sorting in Hadoop platform.