CN110659608A

CN110659608A - Scene classification method based on multi-feature fusion

Info

Publication number: CN110659608A
Application number: CN201910901697.2A
Authority: CN
Inventors: 轩靖奇; 蔡春花; 王峰
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-01-07

Abstract

The invention researches a feature fusion method for scene classification aiming at the defects of single feature discrimination performance and generalization ability of images in the field of scene recognition. Firstly, extracting GIST (GIST, HOG, scale and noise) features, HOG (histogram of oriented gradient) features, SIFT (scale and noise) features and PLBP (PLBP) features of a scene image, and carrying out feature coding on the SIFT features in a VLAD (very large amplitude digital) mode; then, analyzing and fusing the extracted features in different modes by using a serial fusion method; and finally, inputting the scene images into a multi-linear SVM to classify the scene images, and evaluating the average accuracy and the classification speed of final recognition through a large number of experiments. Experimental results show that the method provided by the invention can realize mutual feature information complementation by using the advantages of different features, and achieves better classification performance under the condition of low feature extraction time consumption and low classification time consumption.

Description

Scene classification method based on multi-feature fusion

Technical Field

The invention belongs to the field of scene recognition, and particularly relates to a scene classification method based on multi-feature fusion.

Background

The goal of scene recognition is to identify the scene to which the image belongs by extracting and analyzing features in the scene to obtain information of the scene. As an important research direction of computer vision, the method is applied to various fields such as image video retrieval, security control systems, robot intelligent vision systems, intelligent transportation and the like. Due to the fact that the images of the same type of scene have large differences in the aspects of background, scale, visual angle, illumination and the like, and the images of different types of scenes have similarity, the classification and the identification of the images of the scenes are difficult.

Scene recognition is an important and difficult research topic in the field of computer vision. Before 2010, classification recognition is mainly realized by using low-level features, and mainly comprises textures, shapes, colors and the like. However, such simple global features are not enough to describe the features of the whole image, and the classification performance is not good under complex environments. To overcome this problem, some scholars then proceed from the local part of the underlying feature, processing the color and texture of the local area. David Lowe proposed a scale-space based image local feature description operator SIFT with image scaling, rotation and affine transformation invariance in the 2004 IJCV conference. In 2005, Dalal et al proposed a histogram of gradient directions (HOG) feature at the CVPR conference, a feature that is obtained by counting gradient direction information of a local area of an image as an image. Olivia and Torralba adopt and improve a global feature Gist capable of reflecting scene information such as image natural degree and openness degree, but the Gist feature has a less obvious effect when classifying complex indoor scenes. Philbin proposes a BOW model based on SIFT features, expresses the extracted features into a combination of a plurality of visual vocabularies to form a dictionary, and classifies the samples by analyzing and calculating the frequency of the visual vocabularies in the samples. The BOVW model is simple and can effectively reduce the characteristic dimension of the sample, but the model does not consider the spatial position information of the characteristic points. For the disadvantage, Lazebnik et al proposed a spatial pyramid matching model (SPM) in 2006, and divided sample spaces at different levels, so that spatial position information of features is fully considered, and performance of the boww model is greatly improved.

Due to the complexity of the scene image, it is difficult for a single feature to describe all the information in the image. How to seek a method for mining richer information by taking advantages of all characteristics into consideration so as to achieve a classification effect superior to that of a single characteristic becomes a hot direction.

Disclosure of Invention

The invention aims to provide a method for realizing a multi-feature fusion method for scene classification. A fusion mode of VLAD characteristics, GIST characteristics, PLBP characteristics and HOG characteristics based on SIFT local descriptors is provided. By further coding the local features, the related information among the local features is mined, the discriminability is enhanced, and the classification speed is increased; meanwhile, the HOG characteristics of the fused image are considered to extract edge and gradient characteristics so as to well grasp the characteristics of local shapes; fusing GIST characteristics to improve the global description capability of the image; and the PLBP is fused to improve the problem of insufficient expression of the spatial information of the texture features. And further, a support vector machine based on the RBF kernel function is used for realizing scene image classification after feature fusion.

In order to solve the technical problems, the invention provides the following technical scheme, which sequentially comprises the following steps:

(1) scene image preprocessing

In the preprocessing stage of the experiment, the scene image is subjected to gray level conversion and other processing. When Gist feature extraction is performed, the image size is adjusted to 256 × 256, and when other features are extracted, the image size is adjusted to 300 × 300.

(2) Feature extraction

And extracting SIFT features, GIST features, PLBP features and HOG features of the scene image. Then, the local features are further coded by using a VLAD algorithm to mine relevant information among the local features, so that discriminability is enhanced, and the classification speed is increased; meanwhile, the HOG characteristics of the extracted image are considered to obtain edge and gradient characteristics so as to well grasp the characteristics of local shapes; extracting GIST characteristics to improve the global description capacity of the image; and extracting PLBP to solve the problem of insufficient expression of spatial information of textural features. The step (2) is characterized in that:

1) GIST feature: dividing the image into 4 × 4 grids, processing each block by 4-scale 8-direction Gabor filter sets, then averaging to obtain a 32-dimensional vector set of the image block, and cascading Gist feature vectors of all image blocks of the whole image to obtain the Gist feature of the whole image, wherein the dimension is 4 × 4 × 32-512 dimensions.

2) HOG features: the HOG features are formed by calculating and counting a gradient direction histogram of a local area of an image, and the essence is that the image features are represented by statistical information of image gradients. Firstly, normalizing a grayed image, calculating the gradient of each pixel point, forming a cell by a plurality of pixels, counting a gradient histogram in the cell unit, forming a block by a plurality of adjacent cell units, and forming the gradient histogram in the block by serially connecting and normalizing the cell unit histograms, wherein the block histograms form the characteristics of an image block, and the HOG characteristics of the image can be obtained by serially connecting and combining a plurality of block characteristics. The invention divides the image into 50 multiplied by 50 cells and calculates the 40-bins gradient histogram of each cell, and sets the adjacent 2 multiplied by 2 cells to form a block. Then, if the image size is 300 × 300, there are 6 cells in the vertical direction and 6 cells in the horizontal direction, and the adjacent 2 × 2 cells are combined into one block, there are 5 blocks in the vertical direction and 5 blocks in the horizontal direction, and therefore the final HOG feature vector obtained is 5 × 5 × 40 × 2 × 2 ═ 4000 dimensions.

3) Sift (vlad) feature: firstly, SIFT features of a scene image are extracted, a codebook containing k centers is obtained by using k-means, then each local feature is assigned to a central point nearest to the local feature, and finally, residual errors between the local features and the assigned central point are accumulated to serve as a final image representation. The method comprises the steps of finding a nearest codebook cluster center for the features in each image, then accumulating the difference values of all the features and the cluster centers to obtain a K x D VLAD matrix, wherein K is the number of the cluster centers and D is a feature dimension (for example, sift is 128 dimensions), then expanding the matrix into a (K x D) vector, and normalizing L2 of the vector, wherein the obtained vector is VLAD (the value of K is set to be 78, and D is set to be 128). The VLAD has the advantages that the calculation amount can be effectively reduced, and the algorithm is an algorithm which has both precision and efficiency.

4) PLBP feature: the PLBP features are obtained through the LBP histogram series connection of each level pyramid, and each series digital image LBP feature vector is normalized uniformly, so that the pixel information reflecting the image totality is obtained. Firstly, performing edge detection and pyramid segmentation on an image, dividing the image into 4 layers, wherein the first layer is the whole image, the second layer divides the whole image into 4 sub-regions, and the third layer and the fourth layer further divide the previously segmented sub-regions into 4 smaller Block small regions. Next, the LBP features for each sub-region are computed, quantifying the image sub-regions into K patches. And finally, cascading all LBP feature vectors to obtain the PLBP feature vector of the image. The interval of the present invention is set to 40, and is divided into 4 layers of spaces, so the dimension of the final extracted feature is (1+4+16+64) × 40-3400.

(3) Feature fusion

Assuming there are three eigenvectors, β and γ, in A, B, C three eigenspaces, where α ∈ A, β ∈ B, γ ∈ C, then for serial fusion there is aIf α, β, γ represent m, n and q dimensional feature vectors, respectively, thenHas a dimension of m + n + q. k, l, j are the weighting coefficients of the corresponding feature vectors. The invention adopts a serial fusion method, the weight coefficient is set to be 1, and the final fusion dimension is (m + n + q + …). Firstly, SIFT features are extracted, then feature coding is carried out by using a VLAD algorithm to generate coded features, the invention mainly adopts a VLAD feature coding mode, and PLBP, GIST and HOG features of a scene image are simultaneously extracted to generate a feature matrix file corresponding to each picture for fusion, then a feature matrix is loaded according to 10 randomly generated training set and test set files, serial fusion is realized by using a Numpy library, and then the step (4) is implemented.

(4) Normalization process

After the features are extracted, in order to eliminate possible influences caused by dimension, extreme value or noise data, value range difference and the like among the features and improve the convergence rate of the model, the step (3) is processed by adopting a standard deviation standardization method, the average value of the processed feature data is 0, and the standard deviation is 1.

(5) And classifying the scene images by using a support vector machine based on the RBF kernel function.

The model evaluation parameters are average classification accuracy, recall rate, time consumed for feature extraction and time consumed for classification. The higher the average classification accuracy, the less time is spent on feature extraction and classification, which indicates that the prediction capability of the established model is stronger. By comparing the average prediction accuracy (fig. 1), it can be found that the scene classification recognition effect based on the single feature is poor, and a relatively good recognition effect can be achieved by using the feature fusion method (tables 2-4), wherein the scene recognition system using the serial fusion method of sift (vlad), GIST, HOG and PLBP features can obtain a recognition accuracy of 87.27%.

Drawings

As shown in the drawings, fig. 1 shows the classification accuracy of a single feature on the OT data set, fig. 2 shows the confusion matrix of the fusion mode of sift (vlad), GIST, HOG and PLBP features on the OT data set, fig. 3 shows the confusion matrix of the fusion mode of sift (vlad), GIST, HOG and PLBP features on the FP data set, and fig. 4 shows the confusion matrix of the fusion mode of sift (vlad), GIST, HOG and PLBP features on the LSP data set.

Detailed Description

To verify the performance of the model we proposed, we performed experiments on three datasets, Scene-8(OT-8), Scene-13(FP) and Scene-15 (LSP). Each category in the data set consists of 200 and 400 pictures, with an average size of 300 × 250 pixels. The composition of the data set is shown in table 1.

TABLE 1 Experimental data set

The experiment of the invention adopts a strategy of averaging multiple experiments. And respectively and randomly selecting 100 images for each scene as a training set, and taking the rest images as a testing set. The experiment was repeated 10 times for each data set and averaged to obtain the final experimental result.

As can be seen from the table, the serial fusion of sift (vlad), GIST, HOG and PLBP features achieved 87.27%, 83.50% and 79.30% classification accuracy on OT and LSP datasets, respectively, and the average time spent on feature extraction from three datasets to classify 1.1393s, 1.3651s and 1.4529s, respectively.

Experiments can also find that the identified performance shows a descending trend along with the increase of the scale of the data set, and the classification accuracy is reduced compared with that of an OT data set due to the fact that the number of the classifications of the FP data set is increased and an indoor scene is added. And more complex stores and industrial scenes are added in the LSP data set, and the classification accuracy is further reduced.

TABLE 2 Performance indicators corresponding to different fusion modes in OT data set

TABLE 3 Performance indicators corresponding to different fusion modes in FP data set

TABLE 4 Performance indicators corresponding to different fusion modes in LSP dataset

The confusion matrices corresponding to the three data sets in the method for obtaining the optimal classification accuracy in tables 2-4 are respectively shown in fig. 2-4, and it can be seen from the figures that the high recognition accuracy on the OT data set reaches 98%, the identification accuracy and the opencount type accuracy both reach 92%, and the worst coast reaches 78% of classification effect; the Opencount on the FT data set reaches 96%, the newly added android category reaches 97%, the kitchen category identification accuracy reaches 95%, the relative street category is obviously reduced, and the accuracy is only 61%; the degree of detail on LSP data set is 96%, mountain is 95%, and the accuracy of newly added store and industrial classification is 79% and 94%, respectively.

Claims

1. The invention discloses a scene classification method based on multi-feature fusion, which is mainly used for accurately predicting scene images and comprises the following steps:

(1) scene image preprocessing

The method mainly completes preprocessing operations such as size and gray level conversion of a scene image;

(2) feature extraction

SIFT features, GIST features, PLBP features and HOG features of the scene image are extracted, and then the local features are further encoded by using a VLAD algorithm so as to mine related information among the local features, enhance discriminability and improve classification speed; meanwhile, the HOG characteristics of the extracted image are considered to obtain edge and gradient characteristics so as to well grasp the characteristics of local shapes; extracting GIST characteristics to improve the global description capacity of the image; extracting PLBP characteristics to solve the problem of insufficient spatial information expression of textural characteristics;

(3) feature fusion

Storing the scene image features extracted in the step (2) for fusion, then loading a feature matrix according to 10 randomly generated training sets and test set files, finally setting a feature fusion weight coefficient as 1, and realizing serial fusion, wherein the method is characterized in that for the step (3): assuming there are three eigenvectors, β and γ, in A, B, C three eigenspaces, where α ∈ A, β ∈ B, γ ∈ C, then for serial fusion there is a

If α, β, γ represent m, n and q dimensional feature vectors, respectively, then

The dimensionality of (m + n + q), wherein k, l, j are weight coefficients of corresponding feature vectors, the invention adopts a serial fusion method, the weight coefficient is set to be 1, and the final fusion dimensionality is (m + n + q + …);

(4) normalization process

After the features are extracted, in order to eliminate possible influences caused by dimension, extreme value or noise data, value range difference and the like among the features and improve the convergence rate of the model, the step (3) is processed by adopting a standard deviation standardization method, the average value of the processed feature data is 0, and the standard deviation is 1;

(5) classifying the scene images by using a support vector machine based on an RBF kernel function;

and (4) generating a training set and a test set according to rules by the processed features, inputting the training set and the test set into a support vector machine based on the RBF kernel function, and generating performance indexes such as a confusion matrix and a classification result corresponding to each classification, the accuracy rate corresponding to each batch, a recall rate, feature extraction time, classification and feature fusion time, the average accuracy rate of all batches and the like.