CN106203508A - A kind of image classification method based on Hadoop platform - Google Patents
A kind of image classification method based on Hadoop platform Download PDFInfo
- Publication number
- CN106203508A CN106203508A CN201610543066.4A CN201610543066A CN106203508A CN 106203508 A CN106203508 A CN 106203508A CN 201610543066 A CN201610543066 A CN 201610543066A CN 106203508 A CN106203508 A CN 106203508A
- Authority
- CN
- China
- Prior art keywords
- image
- dictionary
- training
- hadoop
- grader
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The present invention relates to a kind of image classification method based on Hadoop platform, including: extract image Sift feature, generate the SIFT feature storehouse of training image;Sift feature is utilized to generate BoVW visual dictionary;After extracting the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, training image is expressed as histogram vectors form based on dictionary;The histogram vectors of training image being inputted as the training of random forest grader, the parallelization designing grader on Hadoop generates;For needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, input grader, Hadoop platform carries out parallel sorting.The present invention not only has accuracy of preferably classifying, and the most effectively reduces the classification time, can be well applied to large-scale image classification scene.
Description
Technical field
The present invention relates to Image Classfication Technology, be specifically related to a kind of distributed image sorting technique.
Background technology
One, image classification aspect
Image Classfication Technology utilizes computer that image is carried out automated analysis and classification, is object detection and recognition, figure
Basis as fields such as retrievals.Image classification is typically made up of the extraction of characteristics of image and two aspects of classification of feature based.
In terms of feature extraction, the most mostly pay close attention to the local feature of image, use Scale invariant features transform (Scale
Invariant Feature Transform, SIFT), accelerate robust feature (Speeded Up Robust Features,
SURF) scheduling algorithm extracts local feature vectors.Visual word bag (Bag of Visual Words, BoVW) model is basis at this
On further, a large amount of characteristic vectors extracted are clustered, generate visual dictionary, image is mapped to word according to dictionary
The represented as histograms of allusion quotation word, had the most both decreased the quantity of characteristic vector, also made image vector more expressiveness.
In terms of the classification of feature based, the methods using machine learning carry out more.The grader of main flow has support
Vector machine (Support Vector Machine, SVM) grader, it is in solving small sample, the classification of non-linear and higher-dimension
Show good character.Additionally such as Adaboost grader, as a kind of iterative algorithm, add in each wheel one new
Weak Classifier, until reaching certain predetermined sufficiently small error rate.But, when pending image is larger, its sea
File system and computing architecture that these traditional classification algorithms are relied on by amount sample, the feature of high dimension vector propose the biggest
Challenge.Random theory is introduced decision-tree model by random forests algorithm, utilizes some decision trees to set up assembled classifier, for image
Classification provides new thinking, is one of bioinformatics, Data Mining hot topic direction.But, when data volume is the biggest
Time, random forest grader consumes long problem when being also faced with classification.
Two, cloud computing platform aspect
Hadoop is the distributed cloud computing platform of increasing income of current main-stream, is usually used in web access log analysis, reverse indexing
Structure, clustering documents, machine translation based on statistics and generate the large-scale data process work such as index of whole search engine.
Hadoop platform is developed by Apache foundation, and the most well-known Internet firm arranges internal based on Hadoop framework
Application, such as Taobao, eBay and Baidu etc..
Also there are some companies to designing big data processing platform (DPP), it is provided that the big data solution of complete business, such as Microsoft
Azure, the BigCloude etc. of China Mobile.By the distributed computation ability that big data platform is powerful, depositing of mass data
Store up and calculate to be better achieved.Therefore, how to combine big data processing platform (DPP), transplant the processing procedure of large nuber of images, be
One problem highly significant.
Summary of the invention
It is an object of the present invention to provide a kind of distributed image sorting technique being suitable to Hadoop platform, the method can
Make full use of the distributed computation ability of Hadoop platform, when overcoming in large-scale image classification, consume long, storage and file system
The problem of system bottleneck, improves the efficiency of large-scale image classification.Technical scheme is as follows.
A kind of distributed image sorting algorithm based on Hadoop platform, including following technical step:
Step 1. extracts image Sift feature:
Inputting several training images, design is parallel in Hadoop platform extracts each training image SIFT feature, generates
The SIFT feature storehouse of training image;
Step 2. utilize Sift feature generate BoVW visual dictionary:
In Hadoop platform, the sift vector in sift feature database is carried out Distributed Cluster, obtain some vision lists
Word, as the dictionary of BoVW model;
After step 3. extracts the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, instruction
Practicing graphical representation is histogram vectors form based on dictionary;
The histogram vectors of the training image of step 3 is inputted by step 4. as the training of random forest grader,
The parallelization of Hadoop upper design grader generates;
Step 5. is for needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation,
The grader of input step 4 gained, carries out parallel sorting in Hadoop platform.
The present invention is directed to the large-scale image sort operation problem that time-consumingly too much, file system and storage architecture fall behind, carry
Go out a kind of image classification method based on Hadoop platform.Experiment display, the present invention not only has accuracy of preferably classifying, with
Shi Youxiao reduces the classification time, can be well applied to large-scale image classification scene.
Accompanying drawing explanation
Fig. 1 is the structure chart of Hadoop platform
Fig. 2 is the flow chart of the present invention
Fig. 3 is the schematic diagram of BoVW model
Detailed description of the invention
The categorizing process of image is divided into the extraction of characteristics of image and two rank of training of random forest grader by the present invention
Section, carries out paralell design and programming in each stage so that the overall process of image procossing is all not related to all images number
According to operation;It addition, introduce BoVW model at first stage, carry out image simplifying expression according to model, improve image and divide
The accuracy of class.
The present invention chooses Caltech-101 classics image library and tests, and randomly selects brain, bonsai, leopards
Classify Deng eight class images.In every class image, choosing 30 width images respectively as training image, 20 width are as test figure
Picture, each experiment is all carried out 10 times.The present invention will be further described below.
(1) extraction of characteristics of image
The Sift that Sift algorithm extracts describes son and maintains the invariance, image scale transform, rotation, brightness flop etc. to regarding
Angle change, affine transformation also keep certain stability.Hadoop carries out the Sift feature extraction of great amount of images, parallelization
Step is as follows:
Training image data set { img (x, y) } input Hadoop cluster, block formula are stored in each from node by step 1.
Due to image own separate, in block, (x, y) as a single burst for each image img.
Step 2 Map task node input key/value is to < Key, Value >, and wherein Key is img (x, ID y), Value
Be view data img (x, y).Each map task carries out Sift feature extraction to image, and every image all obtains characteristic set { X1,
X2... }, XiInside meets < (xi,yi),Xn(xi,yi) > form, wherein (xi,yi) it is the position coordinates of characteristic point, Xn(xi,
yi) it is the Sift characteristic vector of 128 dimensions at this.
The Sift feature that step 3 each Map node produces is stored on HDFS cluster in a distributed manner, it is not necessary to carry out Reduce
Process, is therefore set to 0 by Reduce function number.
(2) BoVW model representation image is used
Directly the great amount of images Sift vector of extraction in (1) is implemented classification and there is the problem that data volume is big, calculate complexity.
Introduce BoVW model to carry out image simplifying expression, it is possible to reduce data volume, it is thus achieved that preferably classifying quality.BoVW model is greatly
On the basis of amount Sift feature, produce limited vision word (Visual Words, VW) composition visual dictionary, then utilize and regard
Feel that image table is shown as dimension and is equal to the vector of dictionary size by dictionary.Cluster process therein uses K-means algorithm, will be to
N characteristic point in quantity space is divided into the K class specified, as shown in formula (1) according to variance within clusters and minimum principle.
In formula: CiRepresenting ith cluster classification, center is μi, xiData point for the category.
The present invention uses Apache Machout machine learning storehouse of increasing income to implement K-means algorithm, first converts in (2) and produces
Raw a large amount of Sift are characterized as sequenceFile form, and acquiescence maximum iteration time is 10, set cluster centre number as 200,
Utilize the KMeansMapper under org.apache.mahout.clustering.kmeans, KMeansCombiner,
KMeansReducer, KMeansDriver produce cluster centre, and gained cluster centre is stored on HDFS, as BoVW model
Dictionary use.
After obtaining the dictionary that size is 200, according to the Euclidean distance of dictionary word and image each Sift vector, by Sift to
Amount is mapped as away from its nearer vision word.The word frequency of statistics vision word, gained frequency histogram is based on BoVW 200
The image vector of dimension
(3) random forest classifier training
Random forest by multiple decision trees use random manner set up, multiple Weak Classifiers are combined into one high-precision
The grader of degree.Each decision tree in forest is by root node, branch node, leaf node composition.During node split, from instruction
Practice in the alternative property set of sample and randomly choose some attributes, calculate the Geordie impurity level of each attribute, impurity level is declined maximum
Attribute as this node split attribute.Geordie impurity level is defined as:
In formula, piIt is being classification i accounting at node.After division, if former set is divided into L part, then divide
Average Gini coefficient after splitting is:
In formula, L is the number of child node, and Ti is the number of sample at child node i, Gini (i) be the Geordie of child node i not
Purity.
The image Sift vector obtained according to (1) (2) and visual dictionary, can represent the image as the vector of 200 dimensions, should
Vector, as the training data of random forest, generates random forest on Hadoop in a distributed manner, and parallel step is as follows:
Step 1 sets in random forest the number of tree as T=200, at original training sample collectionOn, use
Bootstrap method obtains T training sample subset { S1,S2...S200, as the input of each tree.
Step 2 starts and equal-sized T the Map task of forest, inputs key-value pair < Key, and Value >, Key are forests
Mark, Value is training subset Si.Map function calculates optimal Split Attribute at each node, is sequentially generated left and right branch.Defeated
Going out key-value pair < Key, Value >, Key are forest marks, information DT of Value storage each treei。
After step 3Map task is fully completed, result inputting Reduce task, Reduce task integrates the letter of all trees
Breath, exports < Key, and Value >, Key are forest mark, and Value is that random forest is all, is expressed as { DT1,DT2...DT200}。
(4) random forest is utilized to carry out class test
Test image by Sift feature extraction, represent based on BoVW image, random forest classification three links, finally by
The ballot of random forest grader produces classification results.
The Hadoop cluster that distributed experimental situation is made up of 4 DellR730 servers, wherein 1 station server is
NameNode node, other 3 is DataNode node.The operating system of four station servers all uses Ubuntu14.04,
Hadoop version is CDH 5.5.0.Test image chooses in cal-101 image library 8 class figure such as brain, bonsai, leopards
As carrying out, all kinds of images are chosen 20 as test image.
The performance that experiment compares the distributed of this method and unit form goes up at runtime: BoVW dictionary creation,
In random forest classifier training and test image three subprocess of classification, the execution time of distributed type assemblies all has bright compared with unit
Aobvious shortening, average speedup is about 2.4.
It addition, experiment display, the impact of the different classification performance on this method of achievement number, when achievement number is very few,
Classification accuracy increases along with achievement number and increases;When achievement number reaches 200, the performance of grader tends towards stability, accurate
Exactness is close to 75%.
Shown in sum up, the present invention can obtain good classifying quality, makes full use of Hadoop environment distributed simultaneously
Advantage in storage and concurrent operation, substantially reduces the time overhead of classification.
Claims (1)
1. an image classification method based on Hadoop platform, comprises the following steps:
Step 1. extracts image Sift feature:
Inputting several training images, design is parallel in Hadoop platform extracts each training image SIFT feature, generates training
The SIFT feature storehouse of image;
Step 2. utilize Sift feature generate BoVW visual dictionary:
In Hadoop platform, the sift vector in sift feature database is carried out Distributed Cluster, obtain some vision word, make
Dictionary for BoVW model;
After step 3. extracts the dictionary of BoVW model, the training image through feature extraction is compareed with this dictionary, training figure
As being expressed as histogram vectors form based on dictionary;
The histogram vectors of the training image of step 3 is inputted, at Hadoop by step 4. as the training of random forest grader
The parallelization of upper design grader generates;
Step 5. is for needs point class testing image, after it is carried out successively feature extraction, histogram vectorsization operation, and input
The grader of step 4 gained, carries out parallel sorting in Hadoop platform.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610543066.4A CN106203508A (en) | 2016-07-11 | 2016-07-11 | A kind of image classification method based on Hadoop platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610543066.4A CN106203508A (en) | 2016-07-11 | 2016-07-11 | A kind of image classification method based on Hadoop platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106203508A true CN106203508A (en) | 2016-12-07 |
Family
ID=57476325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610543066.4A Pending CN106203508A (en) | 2016-07-11 | 2016-07-11 | A kind of image classification method based on Hadoop platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106203508A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106874199A (en) * | 2017-02-10 | 2017-06-20 | 腾讯科技(深圳)有限公司 | Test case treating method and apparatus |
CN109492682A (en) * | 2018-10-30 | 2019-03-19 | 桂林电子科技大学 | A kind of multi-branched random forest data classification method |
CN109657711A (en) * | 2018-12-10 | 2019-04-19 | 广东浪潮大数据研究有限公司 | A kind of image classification method, device, equipment and readable storage medium storing program for executing |
CN110175546A (en) * | 2019-05-15 | 2019-08-27 | 深圳市商汤科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN111385655A (en) * | 2018-12-29 | 2020-07-07 | 武汉斗鱼网络科技有限公司 | Advertisement bullet screen detection method and device, server and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103207889A (en) * | 2013-01-31 | 2013-07-17 | 重庆大学 | Method for retrieving massive face images based on Hadoop |
CN104239897A (en) * | 2014-09-04 | 2014-12-24 | 天津大学 | Visual feature representing method based on autoencoder word bag |
CN104392250A (en) * | 2014-11-21 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Image classification method based on MapReduce |
CN104933445A (en) * | 2015-06-26 | 2015-09-23 | 电子科技大学 | Mass image classification method based on distributed K-means |
-
2016
- 2016-07-11 CN CN201610543066.4A patent/CN106203508A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103207889A (en) * | 2013-01-31 | 2013-07-17 | 重庆大学 | Method for retrieving massive face images based on Hadoop |
CN104239897A (en) * | 2014-09-04 | 2014-12-24 | 天津大学 | Visual feature representing method based on autoencoder word bag |
CN104392250A (en) * | 2014-11-21 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Image classification method based on MapReduce |
CN104933445A (en) * | 2015-06-26 | 2015-09-23 | 电子科技大学 | Mass image classification method based on distributed K-means |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106874199A (en) * | 2017-02-10 | 2017-06-20 | 腾讯科技(深圳)有限公司 | Test case treating method and apparatus |
CN106874199B (en) * | 2017-02-10 | 2022-10-18 | 腾讯科技(深圳)有限公司 | Test case processing method and device |
CN109492682A (en) * | 2018-10-30 | 2019-03-19 | 桂林电子科技大学 | A kind of multi-branched random forest data classification method |
CN109657711A (en) * | 2018-12-10 | 2019-04-19 | 广东浪潮大数据研究有限公司 | A kind of image classification method, device, equipment and readable storage medium storing program for executing |
CN111385655A (en) * | 2018-12-29 | 2020-07-07 | 武汉斗鱼网络科技有限公司 | Advertisement bullet screen detection method and device, server and storage medium |
CN110175546A (en) * | 2019-05-15 | 2019-08-27 | 深圳市商汤科技有限公司 | Image processing method and device, electronic equipment and storage medium |
WO2020228163A1 (en) * | 2019-05-15 | 2020-11-19 | 深圳市商汤科技有限公司 | Image processing method and apparatus, and electronic device and storage medium |
JP2021528715A (en) * | 2019-05-15 | 2021-10-21 | シェンチェン センスタイム テクノロジー カンパニー リミテッドShenzhen Sensetime Technology Co.,Ltd | Image processing methods and devices, electronic devices and storage media |
JP7128906B2 (en) | 2019-05-15 | 2022-08-31 | シェンチェン センスタイム テクノロジー カンパニー リミテッド | Image processing method and apparatus, electronic equipment and storage medium |
TWI785267B (en) * | 2019-05-15 | 2022-12-01 | 大陸商深圳市商湯科技有限公司 | Method and electronic apparatus for image processing and storage medium thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108564129B (en) | Trajectory data classification method based on generation countermeasure network | |
CN100583101C (en) | Text categorization feature selection and weight computation method based on field knowledge | |
CN106203508A (en) | A kind of image classification method based on Hadoop platform | |
CN110059807A (en) | Image processing method, device and storage medium | |
CN104392250A (en) | Image classification method based on MapReduce | |
CN102663401B (en) | Image characteristic extracting and describing method | |
CN102663447B (en) | Cross-media searching method based on discrimination correlation analysis | |
Gabryel et al. | The image classification with different types of image features | |
CN110287311B (en) | Text classification method and device, storage medium and computer equipment | |
CN112819023A (en) | Sample set acquisition method and device, computer equipment and storage medium | |
Mehdipour Ghazi et al. | Open-set plant identification using an ensemble of deep convolutional neural networks | |
Deng et al. | Citrus disease recognition based on weighted scalable vocabulary tree | |
CN111813939A (en) | Text classification method based on representation enhancement and fusion | |
Abir et al. | Bangla handwritten character recognition with multilayer convolutional neural network | |
Zhuang et al. | A handwritten Chinese character recognition based on convolutional neural network and median filtering | |
Li | A review of machine learning algorithms for text classification | |
Li et al. | Spatial-temporal dynamic hand gesture recognition via hybrid deep learning model | |
Pengcheng et al. | Fast Chinese calligraphic character recognition with large-scale data | |
Zhang et al. | Large-scale clustering with structured optimal bipartite graph | |
Kuhn et al. | Brcars: a dataset for fine-grained classification of car images | |
Zhang et al. | Semi-supervised graph based embedding with non-convex sparse coding techniques | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
Su et al. | Cross-modality based celebrity face naming for news image collections | |
David et al. | Authentication of Vincent van Gogh’s work | |
Becattini et al. | Indexing quantized ensembles of exemplar-SVMs with rejecting taxonomies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161207 |
|
RJ01 | Rejection of invention patent application after publication |