CN106203508A - A kind of image classification method based on Hadoop platform - Google Patents

A kind of image classification method based on Hadoop platform Download PDF

Info

Publication number
CN106203508A
CN106203508A CN201610543066.4A CN201610543066A CN106203508A CN 106203508 A CN106203508 A CN 106203508A CN 201610543066 A CN201610543066 A CN 201610543066A CN 106203508 A CN106203508 A CN 106203508A
Authority
CN
China
Prior art keywords
image
dictionary
training
hadoop
grader
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610543066.4A
Other languages
Chinese (zh)
Inventor
侯春萍
张倩楠
王宝亮
常鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201610543066.4A priority Critical patent/CN106203508A/en
Publication of CN106203508A publication Critical patent/CN106203508A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a kind of image classification method based on Hadoop platform, including: extract image Sift feature, generate the SIFT feature storehouse of training image;Sift feature is utilized to generate BoVW visual dictionary;After extracting the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, training image is expressed as histogram vectors form based on dictionary;The histogram vectors of training image being inputted as the training of random forest grader, the parallelization designing grader on Hadoop generates;For needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, input grader, Hadoop platform carries out parallel sorting.The present invention not only has accuracy of preferably classifying, and the most effectively reduces the classification time, can be well applied to large-scale image classification scene.

Description

A kind of image classification method based on Hadoop platform
Technical field
The present invention relates to Image Classfication Technology, be specifically related to a kind of distributed image sorting technique.
Background technology
One, image classification aspect
Image Classfication Technology utilizes computer that image is carried out automated analysis and classification, is object detection and recognition, figure Basis as fields such as retrievals.Image classification is typically made up of the extraction of characteristics of image and two aspects of classification of feature based.
In terms of feature extraction, the most mostly pay close attention to the local feature of image, use Scale invariant features transform (Scale Invariant Feature Transform, SIFT), accelerate robust feature (Speeded Up Robust Features, SURF) scheduling algorithm extracts local feature vectors.Visual word bag (Bag of Visual Words, BoVW) model is basis at this On further, a large amount of characteristic vectors extracted are clustered, generate visual dictionary, image is mapped to word according to dictionary The represented as histograms of allusion quotation word, had the most both decreased the quantity of characteristic vector, also made image vector more expressiveness.
In terms of the classification of feature based, the methods using machine learning carry out more.The grader of main flow has support Vector machine (Support Vector Machine, SVM) grader, it is in solving small sample, the classification of non-linear and higher-dimension Show good character.Additionally such as Adaboost grader, as a kind of iterative algorithm, add in each wheel one new Weak Classifier, until reaching certain predetermined sufficiently small error rate.But, when pending image is larger, its sea File system and computing architecture that these traditional classification algorithms are relied on by amount sample, the feature of high dimension vector propose the biggest Challenge.Random theory is introduced decision-tree model by random forests algorithm, utilizes some decision trees to set up assembled classifier, for image Classification provides new thinking, is one of bioinformatics, Data Mining hot topic direction.But, when data volume is the biggest Time, random forest grader consumes long problem when being also faced with classification.
Two, cloud computing platform aspect
Hadoop is the distributed cloud computing platform of increasing income of current main-stream, is usually used in web access log analysis, reverse indexing Structure, clustering documents, machine translation based on statistics and generate the large-scale data process work such as index of whole search engine. Hadoop platform is developed by Apache foundation, and the most well-known Internet firm arranges internal based on Hadoop framework Application, such as Taobao, eBay and Baidu etc..
Also there are some companies to designing big data processing platform (DPP), it is provided that the big data solution of complete business, such as Microsoft Azure, the BigCloude etc. of China Mobile.By the distributed computation ability that big data platform is powerful, depositing of mass data Store up and calculate to be better achieved.Therefore, how to combine big data processing platform (DPP), transplant the processing procedure of large nuber of images, be One problem highly significant.
Summary of the invention
It is an object of the present invention to provide a kind of distributed image sorting technique being suitable to Hadoop platform, the method can Make full use of the distributed computation ability of Hadoop platform, when overcoming in large-scale image classification, consume long, storage and file system The problem of system bottleneck, improves the efficiency of large-scale image classification.Technical scheme is as follows.
A kind of distributed image sorting algorithm based on Hadoop platform, including following technical step:
Step 1. extracts image Sift feature:
Inputting several training images, design is parallel in Hadoop platform extracts each training image SIFT feature, generates The SIFT feature storehouse of training image;
Step 2. utilize Sift feature generate BoVW visual dictionary:
In Hadoop platform, the sift vector in sift feature database is carried out Distributed Cluster, obtain some vision lists Word, as the dictionary of BoVW model;
After step 3. extracts the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, instruction Practicing graphical representation is histogram vectors form based on dictionary;
The histogram vectors of the training image of step 3 is inputted by step 4. as the training of random forest grader, The parallelization of Hadoop upper design grader generates;
Step 5. is for needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, The grader of input step 4 gained, carries out parallel sorting in Hadoop platform.
The present invention is directed to the large-scale image sort operation problem that time-consumingly too much, file system and storage architecture fall behind, carry Go out a kind of image classification method based on Hadoop platform.Experiment display, the present invention not only has accuracy of preferably classifying, with Shi Youxiao reduces the classification time, can be well applied to large-scale image classification scene.
Accompanying drawing explanation
Fig. 1 is the structure chart of Hadoop platform
Fig. 2 is the flow chart of the present invention
Fig. 3 is the schematic diagram of BoVW model
Detailed description of the invention
The categorizing process of image is divided into the extraction of characteristics of image and two rank of training of random forest grader by the present invention Section, carries out paralell design and programming in each stage so that the overall process of image procossing is all not related to all images number According to operation;It addition, introduce BoVW model at first stage, carry out image simplifying expression according to model, improve image and divide The accuracy of class.
The present invention chooses Caltech-101 classics image library and tests, and randomly selects brain, bonsai, leopards Classify Deng eight class images.In every class image, choosing 30 width images respectively as training image, 20 width are as test figure Picture, each experiment is all carried out 10 times.The present invention will be further described below.
(1) extraction of characteristics of image
The Sift that Sift algorithm extracts describes son and maintains the invariance, image scale transform, rotation, brightness flop etc. to regarding Angle change, affine transformation also keep certain stability.Hadoop carries out the Sift feature extraction of great amount of images, parallelization Step is as follows:
Training image data set { img (x, y) } input Hadoop cluster, block formula are stored in each from node by step 1. Due to image own separate, in block, (x, y) as a single burst for each image img.
Step 2 Map task node input key/value is to < Key, Value >, and wherein Key is img (x, ID y), Value Be view data img (x, y).Each map task carries out Sift feature extraction to image, and every image all obtains characteristic set { X1, X2... }, XiInside meets < (xi,yi),Xn(xi,yi) > form, wherein (xi,yi) it is the position coordinates of characteristic point, Xn(xi, yi) it is the Sift characteristic vector of 128 dimensions at this.
The Sift feature that step 3 each Map node produces is stored on HDFS cluster in a distributed manner, it is not necessary to carry out Reduce Process, is therefore set to 0 by Reduce function number.
(2) BoVW model representation image is used
Directly the great amount of images Sift vector of extraction in (1) is implemented classification and there is the problem that data volume is big, calculate complexity. Introduce BoVW model to carry out image simplifying expression, it is possible to reduce data volume, it is thus achieved that preferably classifying quality.BoVW model is greatly On the basis of amount Sift feature, produce limited vision word (Visual Words, VW) composition visual dictionary, then utilize and regard Feel that image table is shown as dimension and is equal to the vector of dictionary size by dictionary.Cluster process therein uses K-means algorithm, will be to N characteristic point in quantity space is divided into the K class specified, as shown in formula (1) according to variance within clusters and minimum principle.
m i n Σ i = 1 K Σ x ∈ C i d i s t ( μ i , x i ) 2 - - - ( 1 )
In formula: CiRepresenting ith cluster classification, center is μi, xiData point for the category.
The present invention uses Apache Machout machine learning storehouse of increasing income to implement K-means algorithm, first converts in (2) and produces Raw a large amount of Sift are characterized as sequenceFile form, and acquiescence maximum iteration time is 10, set cluster centre number as 200, Utilize the KMeansMapper under org.apache.mahout.clustering.kmeans, KMeansCombiner, KMeansReducer, KMeansDriver produce cluster centre, and gained cluster centre is stored on HDFS, as BoVW model Dictionary use.
After obtaining the dictionary that size is 200, according to the Euclidean distance of dictionary word and image each Sift vector, by Sift to Amount is mapped as away from its nearer vision word.The word frequency of statistics vision word, gained frequency histogram is based on BoVW 200 The image vector of dimension
(3) random forest classifier training
Random forest by multiple decision trees use random manner set up, multiple Weak Classifiers are combined into one high-precision The grader of degree.Each decision tree in forest is by root node, branch node, leaf node composition.During node split, from instruction Practice in the alternative property set of sample and randomly choose some attributes, calculate the Geordie impurity level of each attribute, impurity level is declined maximum Attribute as this node split attribute.Geordie impurity level is defined as:
G i n i ( N n o d e ) = 1 - Σ 1 = 1 H p i 2 - - - ( 2 )
In formula, piIt is being classification i accounting at node.After division, if former set is divided into L part, then divide Average Gini coefficient after splitting is:
G i n i ( α ) = Σ i = 1 L | T i | | T | × G i n i ( i ) - - - ( 3 )
In formula, L is the number of child node, and Ti is the number of sample at child node i, Gini (i) be the Geordie of child node i not Purity.
The image Sift vector obtained according to (1) (2) and visual dictionary, can represent the image as the vector of 200 dimensions, should Vector, as the training data of random forest, generates random forest on Hadoop in a distributed manner, and parallel step is as follows:
Step 1 sets in random forest the number of tree as T=200, at original training sample collectionOn, use Bootstrap method obtains T training sample subset { S1,S2...S200, as the input of each tree.
Step 2 starts and equal-sized T the Map task of forest, inputs key-value pair < Key, and Value >, Key are forests Mark, Value is training subset Si.Map function calculates optimal Split Attribute at each node, is sequentially generated left and right branch.Defeated Going out key-value pair < Key, Value >, Key are forest marks, information DT of Value storage each treei
After step 3Map task is fully completed, result inputting Reduce task, Reduce task integrates the letter of all trees Breath, exports < Key, and Value >, Key are forest mark, and Value is that random forest is all, is expressed as { DT1,DT2...DT200}。
(4) random forest is utilized to carry out class test
Test image by Sift feature extraction, represent based on BoVW image, random forest classification three links, finally by The ballot of random forest grader produces classification results.
The Hadoop cluster that distributed experimental situation is made up of 4 DellR730 servers, wherein 1 station server is NameNode node, other 3 is DataNode node.The operating system of four station servers all uses Ubuntu14.04, Hadoop version is CDH 5.5.0.Test image chooses in cal-101 image library 8 class figure such as brain, bonsai, leopards As carrying out, all kinds of images are chosen 20 as test image.
The performance that experiment compares the distributed of this method and unit form goes up at runtime: BoVW dictionary creation, In random forest classifier training and test image three subprocess of classification, the execution time of distributed type assemblies all has bright compared with unit Aobvious shortening, average speedup is about 2.4.
It addition, experiment display, the impact of the different classification performance on this method of achievement number, when achievement number is very few, Classification accuracy increases along with achievement number and increases;When achievement number reaches 200, the performance of grader tends towards stability, accurate Exactness is close to 75%.
Shown in sum up, the present invention can obtain good classifying quality, makes full use of Hadoop environment distributed simultaneously Advantage in storage and concurrent operation, substantially reduces the time overhead of classification.

Claims (1)

1. an image classification method based on Hadoop platform, comprises the following steps:
Step 1. extracts image Sift feature:
Inputting several training images, design is parallel in Hadoop platform extracts each training image SIFT feature, generates training The SIFT feature storehouse of image;
Step 2. utilize Sift feature generate BoVW visual dictionary:
In Hadoop platform, the sift vector in sift feature database is carried out Distributed Cluster, obtain some vision word, make Dictionary for BoVW model;
After step 3. extracts the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, training figure As being expressed as histogram vectors form based on dictionary;
The histogram vectors of the training image of step 3 is inputted, at Hadoop by step 4. as the training of random forest grader The parallelization of upper design grader generates;
Step 5. is for needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, and input The grader of step 4 gained, carries out parallel sorting in Hadoop platform.
CN201610543066.4A 2016-07-11 2016-07-11 A kind of image classification method based on Hadoop platform Pending CN106203508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610543066.4A CN106203508A (en) 2016-07-11 2016-07-11 A kind of image classification method based on Hadoop platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610543066.4A CN106203508A (en) 2016-07-11 2016-07-11 A kind of image classification method based on Hadoop platform

Publications (1)

Publication Number Publication Date
CN106203508A true CN106203508A (en) 2016-12-07

Family

ID=57476325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610543066.4A Pending CN106203508A (en) 2016-07-11 2016-07-11 A kind of image classification method based on Hadoop platform

Country Status (1)

Country Link
CN (1) CN106203508A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874199A (en) * 2017-02-10 2017-06-20 腾讯科技(深圳)有限公司 Test case treating method and apparatus
CN109492682A (en) * 2018-10-30 2019-03-19 桂林电子科技大学 A kind of multi-branched random forest data classification method
CN109657711A (en) * 2018-12-10 2019-04-19 广东浪潮大数据研究有限公司 A kind of image classification method, device, equipment and readable storage medium storing program for executing
CN110175546A (en) * 2019-05-15 2019-08-27 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
CN111385655A (en) * 2018-12-29 2020-07-07 武汉斗鱼网络科技有限公司 Advertisement bullet screen detection method and device, server and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207889A (en) * 2013-01-31 2013-07-17 重庆大学 Method for retrieving massive face images based on Hadoop
CN104239897A (en) * 2014-09-04 2014-12-24 天津大学 Visual feature representing method based on autoencoder word bag
CN104392250A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Image classification method based on MapReduce
CN104933445A (en) * 2015-06-26 2015-09-23 电子科技大学 Mass image classification method based on distributed K-means

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207889A (en) * 2013-01-31 2013-07-17 重庆大学 Method for retrieving massive face images based on Hadoop
CN104239897A (en) * 2014-09-04 2014-12-24 天津大学 Visual feature representing method based on autoencoder word bag
CN104392250A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Image classification method based on MapReduce
CN104933445A (en) * 2015-06-26 2015-09-23 电子科技大学 Mass image classification method based on distributed K-means

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874199A (en) * 2017-02-10 2017-06-20 腾讯科技(深圳)有限公司 Test case treating method and apparatus
CN106874199B (en) * 2017-02-10 2022-10-18 腾讯科技(深圳)有限公司 Test case processing method and device
CN109492682A (en) * 2018-10-30 2019-03-19 桂林电子科技大学 A kind of multi-branched random forest data classification method
CN109657711A (en) * 2018-12-10 2019-04-19 广东浪潮大数据研究有限公司 A kind of image classification method, device, equipment and readable storage medium storing program for executing
CN111385655A (en) * 2018-12-29 2020-07-07 武汉斗鱼网络科技有限公司 Advertisement bullet screen detection method and device, server and storage medium
CN110175546A (en) * 2019-05-15 2019-08-27 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
WO2020228163A1 (en) * 2019-05-15 2020-11-19 深圳市商汤科技有限公司 Image processing method and apparatus, and electronic device and storage medium
JP2021528715A (en) * 2019-05-15 2021-10-21 シェンチェン センスタイム テクノロジー カンパニー リミテッドShenzhen Sensetime Technology Co.,Ltd Image processing methods and devices, electronic devices and storage media
JP7128906B2 (en) 2019-05-15 2022-08-31 シェンチェン センスタイム テクノロジー カンパニー リミテッド Image processing method and apparatus, electronic equipment and storage medium
TWI785267B (en) * 2019-05-15 2022-12-01 大陸商深圳市商湯科技有限公司 Method and electronic apparatus for image processing and storage medium thereof

Similar Documents

Publication Publication Date Title
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN106203508A (en) A kind of image classification method based on Hadoop platform
CN110059807A (en) Image processing method, device and storage medium
CN104392250A (en) Image classification method based on MapReduce
CN102663401B (en) Image characteristic extracting and describing method
CN102663447B (en) Cross-media searching method based on discrimination correlation analysis
Gabryel et al. The image classification with different types of image features
CN110287311B (en) Text classification method and device, storage medium and computer equipment
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
Mehdipour Ghazi et al. Open-set plant identification using an ensemble of deep convolutional neural networks
Deng et al. Citrus disease recognition based on weighted scalable vocabulary tree
CN111813939A (en) Text classification method based on representation enhancement and fusion
Abir et al. Bangla handwritten character recognition with multilayer convolutional neural network
Zhuang et al. A handwritten Chinese character recognition based on convolutional neural network and median filtering
Li A review of machine learning algorithms for text classification
Li et al. Spatial-temporal dynamic hand gesture recognition via hybrid deep learning model
Pengcheng et al. Fast Chinese calligraphic character recognition with large-scale data
Zhang et al. Large-scale clustering with structured optimal bipartite graph
Kuhn et al. Brcars: a dataset for fine-grained classification of car images
Zhang et al. Semi-supervised graph based embedding with non-convex sparse coding techniques
CN112579783B (en) Short text clustering method based on Laplace atlas
Su et al. Cross-modality based celebrity face naming for news image collections
David et al. Authentication of Vincent van Gogh’s work
Becattini et al. Indexing quantized ensembles of exemplar-SVMs with rejecting taxonomies

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161207

RJ01 Rejection of invention patent application after publication