CN107085731B

CN107085731B - Image classification method based on RGB-D fusion features and sparse coding

Info

Publication number: CN107085731B
Application number: CN201710328468.7A
Authority: CN
Inventors: 周彦; 向程谕; 王冬丽
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2017-05-11
Filing date: 2017-05-11
Publication date: 2020-03-10
Anticipated expiration: 2037-05-11
Also published as: CN107085731A

Abstract

The invention discloses an image classification method based on RGB-D fusion characteristics and sparse coding, which comprises the following specific implementation steps of: 1) extracting dense SIFT features and PHOG features of the color image and the depth image; (2) performing feature fusion on the extracted features of the two images in a linear series connection mode to finally obtain four different fusion features; (3) clustering different fusion characteristics by using a K-means + + clustering method to obtain four different visual dictionaries; (4) performing local constraint linear coding on each visual dictionary to obtain different image expression sets; (5) and classifying different image expression sets by using a linear SVM (support vector machine), and determining the final classification condition of the obtained multiple classification results by using a voting decision method. The invention has high classification precision.

Description

Image classification method based on RGB-D fusion features and sparse coding

Technical Field

The invention relates to the technical fields of computer vision, pattern recognition and the like, in particular to an image classification method based on RGB-D fusion characteristics and sparse coding.

Background

Today's society has been an era of information explosion, and besides a great amount of text information, multimedia information (pictures, videos and the like) contacted by human beings has also been explosively increased. This requires a computer to accurately understand the image content in a manner understood by humans in order to accurately and efficiently utilize, manage, and retrieve the images. The image classification is an important way for solving the image understanding problem and has an important promoting effect on the development of multimedia retrieval technology. The acquired image may be influenced by multiple factors such as viewpoint change, illumination, occlusion, background and the like, so that image classification has been a challenging problem in the fields of computer vision and artificial intelligence, and therefore, many image feature description and classification technologies are rapidly developed.

In the current image Feature description and classification technology, a main algorithm is based on a Bag-of-Feature (BOF) algorithm, and s.lazebnik proposes a Spatial Pyramid Matching (SPM) framework based on BOF in the article "Spatial Pyramid Matching for clustering natural scene sites", so that the algorithm overcomes the Spatial information lost in the BOF algorithm and effectively improves the accuracy of image classification. However, the algorithms based on the BoW all encode features by using Vector Quantization (VQ), and the hard coding mode does not consider the interrelation between visual words in a visual dictionary, so that the error of the encoded image features is large, and the performance of the whole image classification algorithm is affected.

In recent years, as the Sparse Coding (SC) theory matures, the theory becomes the most popular technique in the field of image classification. Yang proposes a Sparse coding Spatial Pyramid Matching (ScSPM) based on the article "Linear coding Spatial Pyramid coding for image classification", the model replaces the hard distribution mode with the Sparse coding mode, can optimize the weight coefficient of the visual dictionary, thus better quantize the image characteristics, make the accuracy and efficiency of image classification have very big promotion, but because of the reason of the over-complete code book, may be expressed by the very high characteristic of the degree of similarity of original completely differently, the stability of the ScSPM model is not good. Wang et al improved ScSPM, proposed in the article "localized-constrained Linear Coding for image classification", indicating that Locality is more important than sparsity, and using multiple bases in the visual dictionary to represent a feature descriptor, and similar feature descriptors obtain similar Coding by sharing their local bases, which greatly improves the instability of ScSPM.

The method aims at the classification of the color images, ignores the depth information in the object or the scene, and the depth information is one of important clues of image classification, because the depth information can easily separate the foreground from the background according to the distance, the three-dimensional information of the object or the scene can be directly reflected. With the rise of Kinect, the acquisition of depth images becomes easier, and algorithms for image classification in combination with depth information begin to become popular. The article "Kerneldescriptors for visual recognition" by Liefeng Bo et al proposes to extract the features of an image from the perspective of a kernel method and classify the image, however, the algorithm has the defects that the object needs to be modeled three-dimensionally first, which is time-consuming and has low real-time performance; in an article of 'inductor scene segmentation using a structured light sensor', Silberman firstly extracts the features of a Depth image (Depth image) and a color image (RGB image) respectively by using a Scale Invariant Feature Transform (SIFT) algorithm, then performs Feature fusion, and then performs image classification by using SPM (process map coding); janoch respectively extracts features of a depth image and a color image by using a Histogram of Oriented Gradient (HOG) algorithm in an article A Category-Level 3D Object Dataset to push the Kinect to Work, and realizes final image classification after feature fusion; mirdanies et al in the article "Object recognition system in remote controlled weather station using SIFT and SURF methods" fuse the extracted SIFT features of the RGB image with SURF features of the depth image and use the fused features for Object classification. The algorithms are all used for fusing RGB (red, green and blue) features and depth features in a feature layer, so that the image classification precision can be effectively improved. However, the algorithms also have certain defects, that is, the features extracted from the RGB image and the depth image are single features, and when the single feature is adopted, the information extraction from the image is insufficient, and the obtained fusion features cannot sufficiently express the image content, because: the RGB image is easily influenced by various aspects such as illumination change, visual angle change, image geometric deformation, shadow, shielding and the like, the depth image is easily influenced by imaging equipment, so that the problems of holes, noise and the like appear in the image, the single image feature extraction cannot keep robustness on all factors in the image, and the information in the image is inevitably lost.

Therefore, it is necessary to design a method for classifying images with more accurate classification.

Disclosure of Invention

The invention aims to solve the technical problem of providing an image classification method integrating RGB-D fusion characteristics and sparse coding, which is high in accuracy and good in stability and aims at overcoming the defects of the prior art.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

an image classification method based on RGB-D fusion features and sparse coding comprises a training stage and a testing stage:

the training phase comprises the steps of:

step A1, extracting a Scale-invariant feature transform (Scale-invariant feature transform) and a photo (histogram of layered gradient directions) feature of an RGB image and a Depth image (color image and Depth image) of each sample data; the number of sample data is n;

a2, performing feature fusion on the extracted features of two images of each sample data in a pairwise linear series mode to obtain four different fusion features; the same kind of fusion characteristics obtained by n sample data form a set to obtain four fusion characteristic sets;

through the feature extraction, the denseSIFT and PHOG features of the RGB image and the denseSIFT and PHOG features of the Depth image; then, normalizing the obtained features to ensure that all the features have similar scales; in order to reduce the complexity of feature fusion, the invention fuses features in a pairwise linear series connection mode, namely:

f＝K₁·α+K₂·β (1)

wherein K₁，K₂Is a weight corresponding to the feature, and K₁+K ₂1, in the present invention, let K₁＝K₂α represents the extracted features of the RGB image, β represents the extracted features of the Depth image, and four different fusion features are finally obtained, namely RGBD-dense SIFT feature, RGB-dense SIFT feature + PHOGD feature, RGB-PHOG feature + D-dense SIFT feature, RGBD-PHOG feature, fusion features generated by dense SIFT features of the RGB image and the Depth image, the dense SIFT feature of the RGB image and the PHOG feature of the Depth image, the PHOG feature of the RGB image and the dense SIFT feature of the Depth image, and the PHOG feature of the RGB image and the PHOG feature of the Depth image.

Step A3, clustering fusion features in the four fusion feature sets respectively to obtain four different visual dictionaries;

a4, performing feature coding on the fusion features on each visual dictionary by adopting a local constraint linear coding model to obtain four different image expression sets;

step A5, constructing classifiers according to the four different fusion feature sets, the image expression set and the class labels of the corresponding sample data to obtain four different classifiers.

The testing phase comprises the following steps:

b1, extracting and fusing the features of the images to be classified according to the method in the steps A2-A3 to obtain four fusion features of the images to be classified;

step B2, respectively carrying out feature coding on the four fusion features obtained in the step B1 on the four visual dictionaries obtained in the step A3 by adopting a local constraint linear coding model to obtain four different image expressions of the image to be classified;

step B3, classifying the four image representations obtained in step B2 by using the four classifiers obtained in step a5, respectively, to obtain four class labels (the four class labels may include the same class label or may be different class labels);

and step B4, based on the obtained four class labels, obtaining the final class label of the image to be classified by using a voting decision method, namely selecting the class label with the largest number of votes from the four class labels as the final class label.

Further, in the step a3, a K-means + + clustering method is used to perform clustering processing on the fusion features in a certain fusion feature set.

The traditional K-means algorithm for establishing the visual dictionary has the advantages of simplicity, high performance, high efficiency and the like. However, the K-means algorithm has a certain limitation, and the algorithm is random in the selection of the initial clustering center, which results in that the clustering result is greatly influenced by the initial center point, and if the initial center point is selected to fall into a local optimal solution, the result of correctly classifying the images is fatal. Aiming at the defects, the invention uses the K-means + + algorithm to establish the visual dictionary, and adopts a probability selection method to replace the random selection of the initial clustering center. The specific implementation method for clustering any fusion feature to obtain the corresponding visual dictionary is as follows:

3.1) combining the obtained fusion characteristics obtained by n sample data into a set, namely a fusion characteristic set H_I＝{h₁,h₂,h₃,…,h_nAnd setting the clustering number as m;

3.2) fusing feature sets H_I＝{h₁,h₂,h₃,…,h_nRandomly selecting a point in the cluster as a first initial cluster center S₁(ii) a Setting a count value t to be 1;

3.3) fusion feature set H_I＝{h₁,h₂,h₃,…,h_nEvery point h in_i，h_i∈H_ICalculating it and S_tDistance d (h) therebetween_i)；

3.4) selecting the next initial clustering center S_t+1：

Based on the formula

Calculate point h_i' probability of being selected as next initial cluster center, where h_i'∈H_I；

Selecting the point with the maximum probability as the next initial clustering center S_t+1；

3.5) making t equal to t +1, repeating the steps (3) and (4) until t equal to m, namely m initial clustering centers are selected;

3.6) operating a K-means algorithm by utilizing the selected initial clustering centers, and finally generating m clustering centers;

3.7) defining each cluster center as a visual word in the visual dictionary, wherein the cluster number m is the size of the visual dictionary.

Further, in the step a4, a locally constrained linear coding model is used to perform feature coding on the fusion features, where the model expression is as follows:

in the formula: h is_iTo fuse feature sets H_IThe fused feature in (1), i.e. the feature vector to be encoded, h_i∈R^dD represents the dimension of the fused feature; b ═ B₁,b₂,b₃…b_m]Is a visual dictionary established by K-means + + algorithm, b₁～b_mFor m visual words in the visual dictionary, b_j∈R^d；C＝[c₁,c₂,c₃…c_n]For the coded image representation set, where c_i∈R^mThe representation form of sparse coding of an image after coding is finished; lambda is a penalty factor of LLC;

representing the corresponding multiplication of elements; 1^Tc_iWhere 1 represents a vector with all elements 1, then 1^T

c

_i1 is used to constrain the LLC to have translational invariance; d_iIs defined as:

wherein dist (h)_i,B)＝[dist(h_i,b₁),dist(h_i,b₂),…dist(h_i,b_m)]^T，dist(h_i,b_j) Represents h_iAnd b_jThe euclidean distance between, σ, is used to adjust the falling speed of the constraint weights for the local locations.

The present invention employs locally-constrained linear coding (LLC). Locality is more important than sparsity because local position constraints for features can necessarily satisfy sparsity for features, which does not necessarily satisfy local position constraints. The LLC uses local constraint instead of sparse constraint to obtain good performance.

Further, in the step a4, the fusion features are coded by using the approximate locally constrained linear coding modelPerforming feature coding; the coding model in the formula (2) is solved for c_iIn the process of (2), the feature vector h to be encoded_iAnd selecting the visual words with shorter distance in the visual dictionary to form a local coordinate system. Therefore, according to the rule, a simple approximate LLC feature coding mode can be used for accelerating the coding process, namely, the formula (2) is not solved, and for any feature vector h to be coded_iSelecting k nearest visual words in the visual dictionary B as a local visual word matrix B by using k adjacent search_iThe encoding is obtained by solving a linear system of smaller scale. The expression is as follows:

wherein the content of the first and second substances,

a set of image representations obtained for approximate coding, wherein

For the representation form of sparse coding of an image after the approximate coding is finished, the approximate LLC feature coding can convert the computational complexity from o (n) according to the analytic solution of the formula (4)²) Reduced to o (n + k)²) Wherein k is<<n, but the final performance is not much different from the LLC characteristic code. The approximate LLC characteristic coding mode can not only retain local characteristics, but also can ensure the requirement of coding sparsity, so that the approximate LLC model is used for carrying out characteristic coding in the invention.

Further, k is taken to be 50.

Further, in the step a1, the densesif feature divides the image into feature blocks (blocks) with equal size by using a grid, an overlapping manner is adopted between the blocks, the center position of each feature block is used as a feature point, SIFT feature descriptors (the same as the traditional SIFT feature: gradient histogram) of the feature point are formed by all pixel points in the same feature block, and finally, the feature points based on the SIFT feature descriptors form densesif features of the whole image;

the specific steps of the PHOG feature extraction are as follows:

1.1) counting edge information of the image; extracting an edge contour of the image by using a Canny edge detection operator, and using the contour to describe the shape of the image;

1.2) carrying out pyramid grade segmentation on the image, wherein the number of the segmented blocks of the image depends on the number of layers of the pyramid grade; in the invention, the image is divided into 3 layers, wherein the 1 st layer is the whole image; the 2 nd layer divides the image into 4 sub-areas, and the size of each area is consistent; layer 3 is to divide 4 sub-regions based on layer 2, and divide each region into 4 sub-regions to finally obtain 4 × 4 sub-regions;

1.3) extracting HOG feature vectors (Histogram of oriented gradients) of each sub-region in each layer;

and 1.4) finally, carrying out cascade processing (series connection) on the HOG characteristic vectors of the sub-regions in each layer of the image, carrying out data normalization operation after obtaining the HOG data after cascade processing, and finally obtaining the PHOG characteristic of the whole image.

Further, in the step a5, the classifier adopts a linear SVM classifier.

Further, the voting decision method in step B4 may have a problem that different class labels obtain the most and equal votes, and in this case, a random selection method is adopted to randomly select one of the class labels with the equal votes as the final class label.

The invention has the beneficial effects that:

the invention selects a plurality of fusion features, can make up the defect that the information content is insufficient in the single fusion feature of the image, and effectively improves the accuracy of image classification. A Kmeans + + algorithm is selected to establish a visual dictionary, a probability selection method is adopted to replace random selection of an initial clustering center, and the algorithm can be effectively prevented from falling into a local optimal solution. And finally, voting is carried out on each class result by using a voting decision method, the classification results with large differences are fused, and the final classification performance is determined by voting decision, so that the stability of the results is ensured.

Drawings

FIG. 1 is a flow chart of an image classification method integrating RGB-D fusion features with sparse coding.

Fig. 2 shows the LLC features coding model in step a5 of the training phase of the present invention.

FIG. 3 illustrates the test image classification decision module in step B4 of the testing phase of the present invention.

FIG. 4 is a recognition confusion matrix on an RGB-D Scenes data set according to the present invention.

Detailed Description

The invention is described in further detail below with reference to specific examples and with reference to the accompanying drawings. The described examples are intended to be illustrative of the invention and are not intended to be limiting in any way.

FIG. 1 is a flow chart of a system for image classification integrating RGB-D fusion features with sparse coding, the specific implementation steps are as follows:

step S1: extracting dense SIFT features and PHOG features of the RGB image and the Depth image;

step S2: performing feature fusion on the extracted features of the two images in a serial connection mode to finally obtain four different fusion features;

step S3: clustering different fusion characteristics by using a K-means + + clustering method to obtain four different visual dictionaries;

step S4: performing local constraint linear coding on each visual dictionary to obtain different image expression sets;

step S5: and constructing classifiers for different image expression sets by using a linear SVM (support vector machine), and finally determining the final classification by voting on classification results of the four classifiers.

Based on an image classification method integrating RGB-D fusion characteristics and sparse coding, the method provided by the invention is verified by using experimental data.

The experimental data set adopted by the invention is an RGB-D Scenes data set, the data set is a multi-view scene picture data set provided by Washington university, the data set consists of 8 classified Scenes, 5972 pictures are provided, all images are obtained by a Kinect camera, and the size of the images is 640 × 480.

In the RGB-D Scenes dataset, all images were used for the experiment and the image size was adjusted to 256 × 256. For feature extraction, the sampling interval of dense SIFT features extracted from the image in the experiment is set to be 8 pixels, and the image block is 16 × 16. The PHOG characteristic extraction parameters are set as follows: the image block size is 16 × 16, the sampling interval is 8 pixels, and the gradient direction is set to 9. When building a visual dictionary, the dictionary size is set to 200. During SVM classification, a LIBSVM3.12 toolbox of an LIBSVM toolkit is adopted, 80% of pictures are taken in a data set for training, and 20% of pictures are taken for testing.

In the experiment, the method is considered from two aspects, firstly, the method is considered to be compared with the methods of some researchers with higher classification accuracy; secondly, different RGB-D fusion characteristics are considered to be compared with the classification effect of the method.

TABLE 1RGB-D Scenes dataset Classification result comparison

Classification method	Rate of accuracy/%)
		Linear SVM	89.6％
Gaussian kernel function SVM	90.0％
		Random forest	90.1％
HOG	77.2％
		SIFT+SPM	84.2％
The method of the invention	91.7％

The classification accuracy is compared with other methods as shown in table 1. Three features are integrated by the Liefeng Bo in the article "Kernel descriptors for visual recognition", and are trained and classified by a linear SVM (Linear SVM), a Gaussian kernel SVM (Kernel SVM) and a Random Forest (Random Forest), and the accuracy of 89.6%, 90.0% and 90.1% are respectively obtained in the experiment. Janoch uses HOG algorithm to respectively extract features of a depth image and a color image in an article A Category-Level 3D Object Dataset, uses an SVM classifier to realize final classification after feature fusion, and obtains an accuracy rate of 77.2% in the experiment. In an article of 'inductor scene segmentation using an actual structured light sensor', silberman firstly extracts the features of a depth image and a color image respectively by using an SIFT algorithm, then performs feature fusion, performs feature coding by using an SPM, and finally performs classification by using an SVM, wherein in the experiment, the algorithm obtains 84.2% classification accuracy. The algorithm provided by the invention obtains an accuracy of 91.7%, and is improved by 1.6% compared with the best result, so that the algorithm provided by the invention has good classification performance.

TABLE 2 comparison of classification results of different fusion characteristics of RGB-D Scenes data set

As can be seen from table 2, when the image classification is performed by combining the depth information, the accuracy of the classification algorithm based on the single fusion feature is lower than that of the classification algorithm based on the multi-fusion feature, and the image classification algorithm based on the multi-feature fusion can obtain better classification accuracy, but is still slightly lower than that of the image classification algorithm based on the multi-fusion feature decision fusion.

The foregoing description of specific embodiments of the present invention has been presented. It should be understood that the invention is not limited to the particular embodiments described above, but is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the invention.

Claims

1. An image classification method based on RGB-D fusion features and sparse coding is characterized by comprising a training stage and a testing stage:

the training phase comprises the steps of:

a1, extracting denseSIFT and PHOG characteristics of an RGB image and a Depth image of each sample data; the number of sample data is n;

a2, performing feature fusion on the extracted features of two images of each sample data in a pairwise linear series mode to obtain four different fusion features; combining the same kind of fusion features obtained by n sample data into a set to obtain four fusion feature sets;

a3, clustering fusion features in the four fusion feature sets by using a K-means + + clustering method to obtain four different visual dictionaries; the method for establishing the corresponding visual dictionary by using the K-means + + clustering method to perform clustering processing on the fusion features in a certain fusion feature set comprises the following steps:

3.1) recording the fusion feature set as H_I＝{h₁,h₂,h₃,…,h_nAnd setting the clustering number as m;

3.2) in H_IRandomly selecting a fusion feature as a first initial clustering center S₁(ii) a Setting a count value t to be 1;

3.3) to H_IEach of the fusion features h_i，h_i∈H_ICalculating it and S_tDistance d (h) therebetween_i)；

3.4) selecting the next initial clustering center S_t+1：

Based on the formula

Calculate point h_i' probability of being selected as next initial cluster center, where h_i'∈H_I(ii) a Selecting the fusion feature with the maximum probability as the next initial clustering center S_t+1；

3.5) let t be t +1, and repeat steps 3.3) and 3.4) until t is m, i.e. m initial cluster centers are selected;

3.7) defining each clustering center as a visual word in the visual dictionary, wherein the clustering number m is the size of the visual dictionary;

a5, constructing classifiers according to four different fusion feature sets, image expression sets and class labels of corresponding sample data to obtain four different classifiers;

the testing phase comprises the following steps:

step B3, classifying the four image expressions obtained in the step B2 by using the four classifiers obtained in the step A5 respectively to obtain four class labels;

2. The method for classifying images based on RGB-D fusion features and sparse coding as claimed in claim 1, wherein in said step A4, a locally constrained linear coding model is used to perform feature coding on the fusion features, and the model expression is as follows:

in the formula: h is_iTo fuse feature sets H_IThe fused feature in (1), i.e. the feature vector to be encoded, h_i∈R^dD represents the dimension of the fused feature; b ═ B₁,b₂,b₃…b_m]Is a visual dictionary established by K-means + + algorithm, b₁～b_mFor m visual words in the visual dictionary, b_j∈R^d；C＝[c₁,c₂,c₃…c_n]For the coded image representation set, where c_i∈R^mFusing features h for encoding completion_iThe coding coefficient of (2); lambda is a penalty factor of LLC;

representing the corresponding multiplication of elements; 1^Tc_iWhere 1 represents a vector with all elements 1, then 1^Tc_i1 is used for constraining a local constraint linear coding model to have translation invariance; d_iIs defined as:

3. The method for classifying images based on RGB-D fusion features and sparse coding as claimed in claim 1, wherein in said step A4, the fusion features are feature coded by using approximate locally constrained linear coding model, the model expression is as follows:

wherein, B_iIs a distance to be coded feature vector h in a visual dictionary B selected by using a k-neighbor search_iA local visual word matrix of the nearest k visual words,

a set of image representations obtained for approximate coding, wherein

Fusing features h after completion of approximate coding_iThe coding coefficients of (1).

4. The RGB-D fusion features and sparse coding based image classification method according to claim 3, wherein k is 50.

5. The image classification method based on RGB-D fusion features and sparse coding according to any one of claims 1 to 4, wherein in the step A1, the specific steps of PHOG feature extraction are as follows:

1.2) carrying out pyramid grade segmentation on the image; dividing the image into 3 layers, wherein the 1 st layer is the whole image; the 2 nd layer divides the image into 4 sub-areas, and the size of each area is consistent; layer 3 is to divide 4 sub-regions based on layer 2, and divide each region into 4 sub-regions to finally obtain 4 × 4 sub-regions;

1.3) extracting HOG characteristic vectors of each sub-region in each layer;

and 1.4) finally, cascading HOG characteristic vectors of sub-regions in each layer of the image, and after obtaining the cascaded HOG data, carrying out data normalization operation to finally obtain the PHOG characteristic of the whole image.

6. The image classification method based on RGB-D fusion features and sparse coding according to any one of claims 1 to 4, wherein in the step A5, a linear SVM classifier is adopted as the classifier.

7. The image classification method based on RGB-D fusion features and sparse coding according to any one of claims 1 to 4, wherein in step B4, if there are a plurality of class labels with the largest number of votes, one of the several class labels is randomly selected as the final class label.