CN107085731A

CN107085731A - An Image Classification Method Based on RGB‑D Fusion Feature and Sparse Coding

Info

Publication number: CN107085731A
Application number: CN201710328468.7A
Authority: CN
Inventors: 周彦; 向程谕; 王冬丽
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2017-05-11
Filing date: 2017-05-11
Publication date: 2017-08-22
Anticipated expiration: 2037-05-11
Also published as: CN107085731B

Abstract

The invention discloses an image classification method based on RGB-D fusion features and sparse coding. The specific implementation steps are: 1) extracting dense SIFT features and PHOG features of color images and depth images; (2) extracting features from the two images The feature fusion is performed in the form of linear series, and finally four different fusion features are obtained; (3) the K-means++ clustering method is used to cluster the different fusion features to obtain four different visual dictionaries; (4) in each Locally constrained linear coding is performed on a visual dictionary to obtain different image representation sets; (5) use linear SVM to classify different image representation sets, and use voting decision method to determine the final classification for multiple classification results obtained. The classification accuracy of the invention is high.

Description

An Image Classification Method Based on RGB-D Fusion Feature and Sparse Coding

技术领域technical field

本发明涉及计算机视觉、模式识别等技术领域，具体涉及一种基于RGB-D融合特征与稀疏编码的图像分类方法。The invention relates to technical fields such as computer vision and pattern recognition, and in particular to an image classification method based on RGB-D fusion features and sparse coding.

背景技术Background technique

当今社会已是信息爆炸的时代，除了大量的文本信息外，人类接触的多媒体信息(图片，视频等)也呈爆炸式增长。为了准确、高效的利用、管理和检索图像，这就需要计算机按照人类理解的方式准确地理解图像内容。图像分类是解决图像理解问题的重要途径，对多媒体检索技术的发展有重要的推动作用。而所获取的图像可能受到视点变化、照明、遮挡与背景等多因素的影响，这使得图像分类一直以来都是计算机视觉、人工智能领域一个具有挑战性的难题，许多图像特征描述和分类技术因此得到迅速发展。Today's society is an era of information explosion. In addition to a large amount of text information, the multimedia information (pictures, videos, etc.) that human beings come into contact with is also growing explosively. In order to utilize, manage and retrieve images accurately and efficiently, it is necessary for the computer to understand the image content accurately in the way humans understand. Image classification is an important way to solve the problem of image understanding, and it plays an important role in promoting the development of multimedia retrieval technology. The acquired images may be affected by many factors such as viewpoint changes, lighting, occlusion, and background, which makes image classification a challenging problem in the fields of computer vision and artificial intelligence. Many image feature description and classification techniques are therefore be developed rapidly.

当前的图像特征描述和分类技术中，主要算法是基于特征袋(Bag-of-Feature,BOF)的算法，S.Lazebnik在文章“Spatial pyramid matching for recognizing naturalscene categories”中提出了基于BOF的空间金字塔匹配(Spatial Pyramid Matching,SPM)框架，该算法克服了BOF算法中丢失的空间信息，有效的提高了图像分类的准确率。但是基于BoW的算法都是采用矢量量化(Vector Quantization，VQ)对特征进行编码，而这种硬编码模式并没有考虑视觉字典中视觉单词之间的相互关系，从而导致图像特征编码后的误差较大，进而影响整个图像分类算法的性能。In the current image feature description and classification technology, the main algorithm is based on the Bag-of-Feature (BOF) algorithm. S. Lazebnik proposed a BOF-based spatial pyramid in the article "Spatial pyramid matching for recognizing naturalscene categories". Matching (Spatial Pyramid Matching, SPM) framework, this algorithm overcomes the spatial information lost in the BOF algorithm, and effectively improves the accuracy of image classification. However, BoW-based algorithms all use vector quantization (Vector Quantization, VQ) to encode features, and this hard-coding mode does not consider the relationship between visual words in the visual dictionary, resulting in a relatively large error after image feature encoding. Large, which in turn affects the performance of the entire image classification algorithm.

近几年来，随着稀疏编码(Sparse Coding,SC)理论的日渐成熟，该理论也成为图像分类领域最为热门的技术。Yang在文章“Linear spatial pyramid matching usingsparse coding for image classification”中提出了一种基于稀疏编码空间金字塔匹配(Sparse coding Spatial Pyramid Matching，ScSPM)，该模型用稀疏编码的方式替代硬分配模式，能优化视觉字典的权重系数，从而更好地量化图像特征，使得图像分类的准确度和效率都有很大的提升，但由于过完备码书的原因，原本相似度极高的几个特征有可能被截然不同地表示出来，ScSPM模型的稳定性不好。Wang等改进了ScSPM，在文章“Locality-constrained linear coding for image classification”中提出了局部约束线性编码(Locality-constrained Linear Coding,LLC)，指出局部性比稀疏性更加重要，用视觉字典中的多个基表示一个特征描述子，且相似的特征描述子通过共享其局部的基，得到相似的编码，这使得ScSPM的不稳定性得到极大改善。In recent years, with the maturity of the sparse coding (Sparse Coding, SC) theory, this theory has also become the most popular technology in the field of image classification. In the article "Linear spatial pyramid matching using sparse coding for image classification", Yang proposed a sparse coding spatial pyramid matching (Sparse coding Spatial Pyramid Matching, ScSPM), which replaces the hard allocation mode with sparse coding, which can optimize the visual The weight coefficient of the dictionary, so as to better quantify the image features, so that the accuracy and efficiency of image classification have been greatly improved, but due to the over-complete codebook, several features with a high similarity may be completely different Expressed differently, the stability of the ScSPM model is not good. Wang et al. improved ScSPM and proposed Locality-constrained Linear Coding (LLC) in the article "Locality-constrained linear coding for image classification", pointing out that locality is more important than sparsity. A base represents a feature descriptor, and similar feature descriptors get similar codes by sharing their local bases, which greatly improves the instability of ScSPM.

上述方法都是针对彩色图像的分类，忽略了物体或者场景中的深度信息，而深度信息又是图像分类重要线索之一，因为深度信息根据距离很容易将前景与背景分开，能直接反映物体或者场景的三维信息。随着Kinect的兴起，深度图像的获取变的越来越容易，结合深度信息进行图像分类的算法也开始变得流行起来。Liefeng Bo等在文章“Kerneldescriptors for visual recognition”提出了从核方法的角度提取图像的特征并进行图像分类，然而这种算法的缺陷在于首先要对物体进行三维建模，这是非常耗时的，算法的实时性不高；N.Silberman在文章“Indoor scene segmentation using a structured lightsensor”中先用尺度不变特征变化(Scale Invariant Feature Transform，SIFT)算法分别提取深度图像(Depth图像)和彩色图像(RGB图像)的特征，然后再进行特征融合，之后采用SPM编码进行图像分类；A.Janoch在文章“A Category-Level 3D Object Dataset:Puttingthe Kinect to Work”中用方向梯度直方图(Histogram of Oriented Gradient,HOG)算法分别对深度图像和彩色图像进行特征提取，在特征融合后实现最终的图像分类；MirdaniesM等在文章“Object recognition system in remote controlled weapon station usingSIFT and SURF methods”中将提取到的RGB图像的SIFT特征与深度图像的SURF特征进行融合，并将融合后的特征用于目标分类。这些算法都是在特征层进行RGB特征与深度特征的融合，可以有效的提高图像分类的精度。但是这一类算法也同样存在着一定的缺陷，这就是对RGB图像与深度图像提取的特征都是单一的特征，而采用单一特征时存在对图像的信息提取不足，所得到的融合特征并不能充分的表述图像内容，其原因在于：RGB图像易受到光照变化、视角变化、图像几何变形、阴影与遮挡等多方面的影响，深度图像容易受到成像设备的影响，导致图像中出现孔洞、噪声等问题，单一的图像特征提取并不能对图像中所有的因素保持鲁棒性，这势必会丢失图像中的信息。The above methods are all aimed at the classification of color images, ignoring the depth information of objects or scenes, and depth information is one of the important clues of image classification, because depth information can easily separate the foreground from the background according to the distance, and can directly reflect the object or 3D information of the scene. With the rise of Kinect, the acquisition of depth images has become easier and easier, and algorithms for image classification combined with depth information have also become popular. In the article "Kerneldescriptors for visual recognition", Liefeng Bo et al. proposed to extract the features of the image from the perspective of the kernel method and perform image classification. However, the defect of this algorithm is that the object must first be modeled in 3D, which is very time-consuming. The real-time performance of the algorithm is not high; N.Silberman first used the Scale Invariant Feature Transform (SIFT) algorithm to extract depth images (Depth images) and color images ( RGB image) features, and then perform feature fusion, and then use SPM encoding for image classification; A. Janoch used the Histogram of Oriented Gradient in the article "A Category-Level 3D Object Dataset: Putting the Kinect to Work" (Histogram of Oriented Gradient , HOG) algorithm to extract features from depth images and color images respectively, and achieve final image classification after feature fusion; MirdaniesM et al. will extract RGB images The SIFT features of the deep image are fused with the SURF features of the depth image, and the fused features are used for target classification. These algorithms are the fusion of RGB features and depth features in the feature layer, which can effectively improve the accuracy of image classification. However, this type of algorithm also has certain defects, that is, the features extracted from the RGB image and the depth image are a single feature, and when a single feature is used, there is insufficient information extraction from the image, and the obtained fusion features cannot be obtained. The reason for fully expressing the image content is that the RGB image is easily affected by illumination changes, viewing angle changes, image geometric deformation, shadows and occlusions, etc., and the depth image is easily affected by the imaging device, resulting in holes, noise, etc. in the image. The problem is that a single image feature extraction cannot maintain robustness to all factors in the image, which will inevitably lose the information in the image.

因此，有必要设计一种分类更为准确的图像分类方法。Therefore, it is necessary to design a more accurate image classification method.

发明内容Contents of the invention

本发明要解决的技术问题是，针对现有技术的不足，提供了一种集成RGB-D融合特征与稀疏编码的图像分类方法，准确性高，稳定性好。The technical problem to be solved by the present invention is to provide an image classification method integrating RGB-D fusion features and sparse coding, which has high accuracy and good stability.

为了解决上述技术问题，本发明所提供的技术方案为：In order to solve the problems of the technologies described above, the technical solution provided by the present invention is:

一种基于RGB-D融合特征与稀疏编码的图像分类方法，包括训练阶段和测试阶段：An image classification method based on RGB-D fusion features and sparse coding, including training phase and testing phase:

所述训练阶段包括以下步骤：The training phase includes the following steps:

步骤A1、针对每一个样本数据，提取其RGB图像与Depth图像(彩色图像与深度图像)的denseSIFT(Scale-invariantfeaturetransform，尺度不变特征变换)与PHOG(PyramidHistogramofOrientedGradients，分层梯度方向直方图)特征；样本数据的个数为n；Step A1. For each sample data, extract the denseSIFT (Scale-invariant feature transform) and PHOG (Pyramid Histogram of Oriented Gradients) features of its RGB image and Depth image (color image and depth image); The number of sample data is n;

步骤A2、针对每一个样本数据，对其两种图像提取的特征采用两两线性串联的形式进行特征融合，得到四种不同的融合特征；n个样本数据得到的同种融合特征组成一个集合，得到四种融合特征集；Step A2. For each sample data, perform feature fusion on the features extracted from the two images in the form of two-two linear series to obtain four different fusion features; the same fusion features obtained from n sample data form a set, Get four fusion feature sets;

通过上述特征提取，RGB图像的denseSIFT和PHOG特征，以及Depth图像的denseSIFT与PHOG特征；之后对所得到的特征进行归一化，使所有的特征拥有相似的尺度；本发明为了降低特征融合的复杂度，采用两两线性串联的方式对特征进行融合，即：Through the above feature extraction, the denseSIFT and PHOG features of the RGB image, and the denseSIFT and PHOG features of the Depth image; then normalize the obtained features so that all features have similar scales; the present invention reduces the complexity of feature fusion Degree, the features are fused in a pairwise linear series, that is:

f＝K₁·α+K₂·β (1)f=K ₁ ·α+K ₂ ·β (1)

其中K₁，K₂为特征对应的权值，且K₁+K₂＝1，本发明中令K₁＝K₂。α代表RGB图像提取的特征，β代表Depth图像提取的特征；最终得到四种不同的融合特征，即：RGBD-dense SIFT特征、RGB-dense SIFT特征+PHOGD特征、RGB-PHOG特征+D-dense SIFT特征、RGBD-PHOG特征；分别表示RGB图像和Depth图像的dense SIFT特征产生的融合特征、RGB图像的denseSIFT特征和Depth图像的PHOG特征产生的融合特征、RGB图像的PHOG特征和Depth图像的dense SIFT特征产生的融合特征、RGB图像和Depth图像的PHOG特征产生的融合特征。Where K ₁ and K ₂ are weights corresponding to features, and K ₁ +K ₂ =1, and K ₁ =K ₂ in the present invention. α represents the feature extracted from the RGB image, and β represents the feature extracted from the Depth image; finally four different fusion features are obtained, namely: RGBD-dense SIFT feature, RGB-dense SIFT feature + PHOGD feature, RGB-PHOG feature + D-dense SIFT feature, RGBD-PHOG feature; respectively represent the fusion feature generated by the dense SIFT feature of the RGB image and the Depth image, the fusion feature generated by the denseSIFT feature of the RGB image and the PHOG feature of the Depth image, the PHOG feature of the RGB image and the dense of the Depth image Fusion features generated by SIFT features, fusion features generated by PHOG features of RGB images and Depth images.

步骤A3、分别对四种融合特征集中的融合特征进行聚类处理，得到四种不同的视觉字典；Step A3, respectively clustering the fusion features in the four fusion feature sets to obtain four different visual dictionaries;

步骤A4、在每种视觉字典上，采用局部约束线性编码模型对融合特征进行特征编码，得到四种不同的图像表述集；Step A4. On each visual dictionary, use a locally constrained linear coding model to perform feature coding on the fused features to obtain four different image representation sets;

步骤A5、根据四种不同的融合特征集、图像表述集以及相应的样本数据的类标签构造分类器，得到四个不同的分类器。Step A5, constructing classifiers according to four different fusion feature sets, image representation sets and corresponding class labels of sample data, to obtain four different classifiers.

所述测试阶段包括以下步骤：The testing phase includes the following steps:

步骤B1、按照步骤A2～A3中的方法提取和融合待分类图像的特征，得到待分类图像的四种融合特征；Step B1, extracting and fusing features of the image to be classified according to the method in steps A2 to A3 to obtain four fusion features of the image to be classified;

步骤B2、在步骤A3得到的四种视觉字典上，采用局部约束线性编码模型分别对步骤B1得到的四种融合特征进行特征编码，得到待分类图像四种不同的图像表述；Step B2, on the four visual dictionaries obtained in step A3, use the locally constrained linear coding model to perform feature encoding on the four fusion features obtained in step B1, and obtain four different image representations of the image to be classified;

步骤B3、用步骤A5得到的四个分类器分别对步骤B2得到的四种图像表述进行分类，得到四个类标签(四个类标签中可能包含相同的类标签，也可能都是不同的类标签)；Step B3, use the four classifiers obtained in step A5 to classify the four image representations obtained in step B2 respectively to obtain four class labels (the four class labels may contain the same class labels, or may all be different class labels Label);

步骤B4、基于得到的四个类标签，使用投票决策方法得到该待分类图像的最终类标签，即选取四个类标签中票数最多的类标签作为最终类标签。Step B4. Based on the obtained four class labels, use the voting decision method to obtain the final class label of the image to be classified, that is, select the class label with the most votes among the four class labels as the final class label.

进一步地，所述步骤A3中，使用K-means++聚类方法针对某种融合特征集中的融合特征进行聚类处理。Further, in the step A3, the K-means++ clustering method is used to cluster the fusion features in a certain fusion feature set.

传统建立视觉字典的K-means算法具有简单、性能高效等优点。但K-means算法自身也存在着一定局限性，算法在对初始聚类中心的选择上是随机的，这就导致聚类结果受初始中心点的影响较大，如果由初始中心点的选择而陷入局部最优解，这对图像正确分类的结果是致命的。所以针对这点不足，本发明使用K-means++算法进行视觉字典建立，采取一种概率选取的方法代替随机选择初始聚类中心。针对任一种融合特征进行聚类处理，得到相应的视觉字典的具体实现方法如下：The traditional K-means algorithm for building a visual dictionary has the advantages of simplicity and high performance. However, the K-means algorithm itself also has certain limitations. The algorithm selects the initial cluster center randomly, which leads to the clustering result being greatly affected by the initial center point. Falling into a local optimal solution is fatal to the result of correct image classification. Therefore, in view of this deficiency, the present invention uses the K-means++ algorithm to establish a visual dictionary, and adopts a method of probability selection instead of random selection of initial cluster centers. The specific implementation method of clustering for any fusion feature to obtain the corresponding visual dictionary is as follows:

3.1)将由n个样本数据得到的得到的融合特征组成一个集合，即融合特征集H_I＝{h₁,h₂,h₃,…,h_n}，并设置聚类数目为m；3.1) Form a set of fusion features obtained from n sample data, that is, fusion feature set H _I ={h ₁ ,h ₂ ,h ₃ ,...,h _n }, and set the number of clusters to m;

3.2)在融合特征集H_I＝{h₁,h₂,h₃,…,h_n}中随机选择一个点作为第一个初始聚类中心S₁；设置计数值t＝1；3.2) Randomly select a point in the fusion feature set H _I ={h ₁ ,h ₂ ,h ₃ ,…,h _n } as the first initial cluster center S ₁ ; set the count value t=1;

3.3)对融合特征集H_I＝{h₁,h₂,h₃,…,h_n}中每一个点h_i，h_i∈H_I，计算它与S_t之间的距离d(h_i)；3.3) For each point h _i in the fusion feature set H _I ={h ₁ ,h ₂ ,h ₃ ,…,h _n }, h _i ∈ H _I , calculate the distance _d (h _i );

3.4)选择下一初始聚类中心S_t+1：3.4) Select the next initial clustering center S _t+1 :

基于公式计算点h_i'被选择为下一初始聚类中心的概率，其中h_i'∈H_I；formula based Calculate the probability that point h _i ' is selected as the next initial cluster center, where h _i '∈H _I ;

选择概率最大的点作为下一初始聚类中心S_t+1；Select the point with the highest probability as the next initial cluster center S _t+1 ;

3.5)令t＝t+1，重复步骤(3)和(4)，直到t＝m，即m个初始聚类中心被选出来；3.5) Make t=t+1, repeat steps (3) and (4), until t=m, that is, m initial cluster centers are selected;

3.6)利用选出来的初始聚类中心来运行K-means算法，最终于生成m个聚类中心；3.6) Use the selected initial cluster centers to run the K-means algorithm, and finally generate m cluster centers;

3.7)定义每个聚类中心为视觉字典中的一个视觉单词，聚类数目m即为视觉字典的大小。3.7) Define each cluster center as a visual word in the visual dictionary, and the number of clusters m is the size of the visual dictionary.

进一步地，所述步骤A4中，采用局部约束线性编码模型对融合特征进行特征编码，模型表达式如下：Further, in the step A4, a locally constrained linear coding model is used to perform feature coding on the fusion feature, and the model expression is as follows:

式中：h_i为融合特征集H_I中的融合特征，即待编码的特征向量，h_i∈R^d，d表示融合特征的维度；B＝[b₁,b₂,b₃…b_m]是通过K-means++算法建立的视觉字典，b₁～b_m为视觉字典中的m个视觉单词，b_j∈R^d；C＝[c₁,c₂,c₃…c_n]为编码得到的图像表述集，其中c_i∈R^m为编码完成后一幅图像稀疏编码的表示形式；λ为LLC的惩罚因子；表示元素对应相乘；1^Tc_i中1表示全部元素为1的向量，那么1^Tc_i＝1用于对LLC进行约束，使其具有平移不变性；d_i定义为：In the formula: h _i is the fusion feature in the fusion feature set H _I , that is, the feature vector to be encoded, h _i ∈ R ^d , d represents the dimension of the fusion feature; B=[b ₁ ,b ₂ ,b ₃ …b _m ] is a visual dictionary established by the K-means++ algorithm, b ₁ ~ b _m are m visual words in the visual dictionary, b _j ∈ R ^d ; C=[c ₁ ,c ₂ ,c ₃ …c _n ] is the code The obtained image representation set, where c _i ∈ R ^m is the sparsely coded representation of an image after coding; λ is the penalty factor of LLC; Indicates that elements are multiplied correspondingly; 1 in 1 ^T c _i represents a vector with all elements being 1, then 1 ^T c _i =1 is used to constrain the LLC so that it has translation invariance; d _i is defined as:

其中dist(h_i,B)＝[dist(h_i,b₁),dist(h_i,b₂),…dist(h_i,b_m)]^T，dist(h_i,b_j)表示h_i与b_j之间的欧式距离，σ用于调整局部位置的约束权重的下降速度。where dist(h _i ,B)=[dist(h _i ,b ₁ ),dist(h _i ,b ₂ ),…dist(h _i ,b _m )] ^T , dist(h _i ,b _j ) means h The Euclidean distance between _i and _bj , σ is used to adjust the descent speed of the constraint weights for local locations.

本发明采用局部约束线性编码(Locality-constrainedlinearcoding,LLC)。因为特征的局部性位置约束必然可以满足特征的稀疏性，而满足特征的稀疏性不一定满足局部性位置约束，所以局部比稀疏更重要。LLC使用局部约束代替稀疏约束，能获得良好的性能。The present invention adopts locality-constrained linear coding (Locality-constrained linear coding, LLC). Because the local position constraints of features must satisfy the sparsity of features, but the sparsity of features does not necessarily satisfy the local position constraints, so locality is more important than sparseness. LLC uses local constraints instead of sparse constraints and can achieve good performance.

进一步地，所述步骤A4中，采用近似的局部约束线性编码模型对融合特征进行特征编码；式(2)中编码模型在求解c_i的过程中，待编码的特征向量h_i倾向选择视觉字典中距离较近的视觉单词，形成一个局部坐标系统。因此，根据这个规律可以使用一种简单的近似LLC特征编码方式来加速编码过程，即不求解式(2)，对于任意一个待编码的特征向量h_i，使用k邻近搜索选取视觉字典B中距离其最近的k个视觉单词作为局部视觉单词矩阵B_i，通过求解规模更小的线性系统来获得编码。其表达式如下：Further, in the step A4, an approximate local constrained linear coding model is used to perform feature coding on the fused features; in the process of _solving the coding model in formula (2), the feature vector h to be _coded tends to select the visual dictionary Visual words that are closer in distance form a local coordinate system. Therefore, according to this law, a simple approximate LLC feature encoding method can be used to speed up the encoding process, that is, without solving equation (2), for any feature vector h _i to be encoded, use the k-neighborhood search to select the distance in the visual dictionary B The nearest k visual words are used as the local visual word matrix B _i , and the encoding is obtained by solving a smaller-scale linear system. Its expression is as follows:

其中，为近似编码得到的图像表述集，其中为近似编码完成后一幅图像稀疏编码的表示形式，根据式(4)解析解，近似LLC特征编码能够将计算复杂度从o(n²)降为o(n+k²)，其中k<<n，但最后的性能与LLC特征编码相差不大。近似LLC特征编码方式既可以保留局部特征，又可以保证编码稀疏性的要求，所以在本发明中使用近似LLC模型进行特征编码。in, is the image representation set obtained by approximate encoding, where It is the representation of sparse coding of an image after the approximate coding is completed. According to the analytical solution of formula (4), the approximate LLC feature coding can reduce the computational complexity from o(n ² ) to o(n+k ² ), where k<< n, but the final performance is not much different from LLC feature encoding. The approximate LLC feature encoding method can not only preserve local features, but also ensure the requirement of encoding sparsity, so the approximate LLC model is used for feature encoding in the present invention.

进一步地，取k＝50。Further, take k=50.

进一步地，所述步骤A1中，denseSIFT特征利用网格将图像划分得到大小相等的特征块(block)，并且块与块之间采用重叠方式，每个特征块的中心位置作为一个特征点，通过同一个特征块里的所有像素点来形成该特征点的SIFT特征描述符(与传统SIFT特征一样的特征描述符：梯度直方图)，最后这些基于SIFT特征描述符的特征点组成整幅图像的dense SIFT特征；Further, in the step A1, the denseSIFT feature uses a grid to divide the image into feature blocks (blocks) of equal size, and the blocks are overlapped, and the center position of each feature block is used as a feature point, through All the pixels in the same feature block form the SIFT feature descriptor of the feature point (the same feature descriptor as the traditional SIFT feature: gradient histogram), and finally these feature points based on the SIFT feature descriptor form the entire image. dense SIFT features;

PHOG特征提取的具体步骤如下：The specific steps of PHOG feature extraction are as follows:

1.1)统计图像的边缘信息；利用Canny边缘检测算子提取出图像的边缘轮廓，并将此轮廓用于描述图像的形状；1.1) Statistical image edge information; use the Canny edge detection operator to extract the edge profile of the image, and use this profile to describe the shape of the image;

1.2)对图像进行金字塔等级分割，图像分割的块数取决于金字塔等级的层数；本发明中将图像分成3层，第1层为整个图像；第2层将图像划分为4个子区域，每个区域的大小一致；第3层是在第2层的基础上对4个子区域进行划分，把每个区域再划分为4个子区域，最终得到4×4个子区域；1.2) image is carried out pyramid level segmentation, the block number of image segmentation depends on the number of layers of pyramid level; Image is divided into 3 layers in the present invention, and the 1st layer is whole image; The 2nd layer divides image into 4 subregions, each The size of each area is the same; the third layer divides 4 sub-areas on the basis of the second layer, divides each area into 4 sub-areas, and finally obtains 4×4 sub-areas;

1.3)在每一层中提取每一个子区域的HOG特征向量(Histogram of OrientedGridients，方向梯度直方图)；1.3) Extract the HOG feature vector (Histogram of OrientedGridients, directional gradient histogram) of each sub-region in each layer;

1.4)最后将图像各个层中子区域的HOG特征向量进行级联处理(串联)，在得到级联后的HOG数据后，进行数据的归一化操作，最终得到整幅图像的PHOG特征。1.4) Finally, the HOG feature vectors of the sub-regions in each layer of the image are cascaded (serialized). After the concatenated HOG data are obtained, the data is normalized, and finally the PHOG features of the entire image are obtained.

进一步地，所述步骤A5中，分类器采用线性SVM分类器。Further, in the step A5, the classifier adopts a linear SVM classifier.

进一步地，针对所述步骤B4中的投票决策方法会出现不同类标签得到最多且相等票数的问题，对于这种情况，采用随机选择的方法，在这几个相等票数的类标签中随机选择其中一个类标签作为最终的类标签。Further, for the voting decision method in step B4, there will be a problem that different class labels get the most and equal number of votes. In this case, a random selection method is used to randomly select among the class labels with equal number of votes. A class label serves as the final class label.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明选用多个融合特征，可以弥补图像单一的融合特征存在信息量不足的缺点，有效的提高了图像分类的准确率。选用Kmeans++算法建立视觉字典，采用概率选取的方法代替随机选择初始聚类中心，可以有效的避免算法陷入局部最优解。最后利用投票决策的方法对每个类结果投票，将差异大的分类结果融合，由投票决策来决定最后的分类性能，保证了结果的稳定性。The present invention selects a plurality of fusion features, which can make up for the shortcoming of insufficient information in a single fusion feature of an image, and effectively improves the accuracy of image classification. The Kmeans++ algorithm is used to build a visual dictionary, and the method of probability selection is used instead of randomly selecting the initial cluster center, which can effectively prevent the algorithm from falling into a local optimal solution. Finally, the voting decision method is used to vote for each class result, and the classification results with large differences are fused, and the final classification performance is determined by the voting decision, which ensures the stability of the result.

附图说明Description of drawings

图1为集成RGB-D融合特征与稀疏编码的图像分类方法的流程图。Figure 1 is a flowchart of an image classification method integrating RGB-D fusion features and sparse coding.

图2为本发明训练阶段步骤A5中LLC特征编码模型。Fig. 2 is the LLC feature encoding model in step A5 of the training phase of the present invention.

图3为本发明测试阶段步骤B4中测试图像分类决策模块。Fig. 3 is a test image classification decision-making module in step B4 of the test phase of the present invention.

图4为本发明在RGB-D Scenes数据集上的识别混淆矩阵。Fig. 4 is the identification confusion matrix of the present invention on the RGB-D Scenes dataset.

具体实施方式detailed description

下面结合具体实例，并参照详细附图，对本发明进一步详细说明。但所描述的实例旨在于对本发明的理解，而对其不起任何限定作用。The present invention will be further described in detail below in conjunction with specific examples and with reference to the detailed drawings. However, the described examples are intended for the understanding of the present invention and do not have any limiting effect on it.

图1是集成RGB-D融合特征与稀疏编码的图像分类的系统流程图，具体实施步骤如下：Figure 1 is a flow chart of the image classification system integrating RGB-D fusion features and sparse coding. The specific implementation steps are as follows:

步骤S1：提取RGB图像与Depth图像的dense SIFT特征和PHOG特征；Step S1: extract the dense SIFT features and PHOG features of the RGB image and the Depth image;

步骤S2：对两种图像提取的特征采用串联的形式进行特征融合，最终得到四种不同的融合特征；Step S2: Perform feature fusion on the features extracted from the two images in series, and finally obtain four different fusion features;

步骤S3：使用K-means++聚类方法对不同的融合特征进行聚类处理得到四种不同的视觉字典；Step S3: use the K-means++ clustering method to cluster different fusion features to obtain four different visual dictionaries;

步骤S4：在每个视觉字典上进行局部约束线性编码，得到不同的图像表述集；Step S4: Perform local constrained linear coding on each visual dictionary to obtain different image representation sets;

步骤S5：利用线性SVM对不同的图像表述集构造分类器，最后通过对这四种分类器的分类结果进行投票决策来确定最终的分类。Step S5: use linear SVM to construct classifiers for different image representation sets, and finally determine the final classification by voting on the classification results of these four classifiers.

基于集成RGB-D融合特征与稀疏编码的图像分类方法，本发明利用实验数据对本发明的方法进行验证。Based on the image classification method integrating RGB-D fusion features and sparse coding, the present invention uses experimental data to verify the method of the present invention.

本发明采用的实验数据集是RGB-D Scenes数据集，该数据集是由华盛顿大学提供的一个多视角的场景图片数据集，该数据集由8个分类场景组成，共5972张图片，图像全部通过Kinect摄像机获取，大小均为640*480。The experimental data set used in the present invention is the RGB-D Scenes data set, which is a multi-view scene picture data set provided by the University of Washington. The data set is composed of 8 classified scenes, with a total of 5972 pictures. Acquired through the Kinect camera, the size is 640*480.

在RGB-D Scenes数据集中，将全部图像用于实验并将图像尺寸调整为256*256。对于特征提取，本次实验中图像提取的dense SIFT特征采样间隔设置为8像素，图像块为16×16。PHOG特征提取参数设置为：图像块大小为16×16，采样间隔为8像素，梯度方向设为9。建立视觉字典时，字典大小设为200。SVM分类时采用LIBSVM工具包的libsvm3.12工具箱，数据集中取80％图片用于训练，20％图片用于测试。In the RGB-D Scenes dataset, all images are used for experiments and the image size is resized to 256*256. For feature extraction, the dense SIFT feature sampling interval of image extraction in this experiment is set to 8 pixels, and the image block is 16×16. The PHOG feature extraction parameters are set as follows: the image block size is 16×16, the sampling interval is 8 pixels, and the gradient direction is 9. When building a visual dictionary, the dictionary size is set to 200. The libsvm3.12 toolbox of the LIBSVM toolkit is used for SVM classification. 80% of the pictures in the data set are used for training and 20% of the pictures are used for testing.

在此次实验中，从两个方面考虑本发明方法，第一，考察本发明方法跟当前分类准确率较高的一些研究者的方法进行对比；第二，考察不同的RGB-D融合特征与本发明方法的分类效果进行对比。In this experiment, the method of the present invention is considered from two aspects. First, compare the method of the present invention with the methods of some researchers with higher classification accuracy at present; second, investigate different RGB-D fusion features and The classification effect of the method of the present invention is compared.

表1RGB-D Scenes数据集分类结果比较Table 1 Comparison of classification results of RGB-D Scenes dataset

分类方法Classification 准确率/％Accuracy/% 线性SVMLinear SVM 89.6％89.6% 高斯核函数SVMGaussian Kernel SVM 90.0％90.0% 随机森林random forest 90.1％90.1% HOGHOG 77.2％77.2% SIFT+SPMSIFT+SPM 84.2％84.2% 本发明方法The method of the invention 91.7％91.7%

分类准确率与其他方法的对比如表1所示。Liefeng Bo在文章“Kerneldescriptors for visual recognition”中将三种特征集成，分别用线性SVM(LinearSVM)、高斯核函数SVM(Kernel SVM)和随机森林(Random Forest)对其进行训练与分类，在此次实验中分别获得89.6％、90.0％和90.1％的准确率。A.Janoch在文章“A Category-Level 3D Object Dataset:Putting the Kinect to Work”中使用HOG算法分别对深度图像和彩色图像进行特征提取，在特征融合后使用SVM分类器实现最终的分类，在本次实验中此方法获得77.2％的准确率。N.Silberman在文章“Indoor scene segmentation using astructured light sensor”中先用SIFT算法分别提取深度图像和彩色图像的特征，然后再进行特征融合，之后采用SPM进行特征编码，最后采用SVM进行分类，在此次实验中此算法获得84.2％的分类准确率。而本发明提出的算法获得了91.7％的准确率，与之前最好的结果相比提高了1.6％，由此可以看出本发明算法具有良好的分类性能。The comparison of classification accuracy with other methods is shown in Table 1. Liefeng Bo integrated the three features in the article "Kerneldescriptors for visual recognition", and used linear SVM (LinearSVM), Gaussian kernel function SVM (Kernel SVM) and random forest (Random Forest) to train and classify them respectively. In this The accuracy rates of 89.6%, 90.0% and 90.1% are respectively obtained in the experiment. A. Janoch used the HOG algorithm in the article "A Category-Level 3D Object Dataset: Putting the Kinect to Work" to extract the features of the depth image and the color image respectively. After the feature fusion, the SVM classifier was used to achieve the final classification. In this This method achieves 77.2% accuracy in this experiment. In the article "Indoor scene segmentation using astructured light sensor" by N.Silberman, the SIFT algorithm is used to extract the features of the depth image and the color image respectively, and then the feature fusion is performed, and then SPM is used for feature encoding, and finally SVM is used for classification. Here In this experiment, this algorithm obtains 84.2% classification accuracy. However, the algorithm proposed by the present invention obtains an accuracy rate of 91.7%, which is 1.6% higher than the previous best result. It can be seen that the algorithm of the present invention has good classification performance.

表2RGB-D Scenes数据集不同融合特征分类结果对比Table 2 Comparison of classification results of different fusion features in RGB-D Scenes dataset

从表2可以看出，在联合深度信息进行图像分类时，基于单一融合特征的分类算法准确率低于基于多融合特征的分类算法，而基于多特征融合的图像分类算法可以取得较好的分类准确率，但还是略低于基于多融合特征决策融合的图像分类算法。It can be seen from Table 2 that when combining depth information for image classification, the accuracy of the classification algorithm based on single fusion features is lower than that based on multi-fusion features, while the image classification algorithm based on multi-feature fusion can achieve better classification The accuracy rate is higher, but it is still slightly lower than the image classification algorithm based on multi-fusion feature decision fusion.

以上对本发明的具体实施例进行了描述。应当理解的是，本发明并不局限于上述特定实施方式，凡在本发明的精神实质与原理之内所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the above-mentioned specific embodiments, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention .

Claims

1. an image classification method based on RGB-D fusion feature and sparse coding, is characterized in that, comprises training stage and testing stage:

The training phase includes the following steps:

Step A1, for each sample data, extract the denseSIFT and PHOG features of its RGB image and Depth image; the number of sample data is n;

Step A2. For each sample data, perform feature fusion on the features extracted from the two images in the form of two-two linear series to obtain four different fusion features; combine the same type of fusion features obtained from n sample data into one Set to get four fusion feature sets;

Step A3, respectively clustering the fusion features in the four fusion feature sets to obtain four different visual dictionaries;

Step A4. On each visual dictionary, use a locally constrained linear coding model to perform feature coding on the fused features to obtain four different image representation sets;

Step A5, constructing classifiers according to four different fusion feature sets, image representation sets and corresponding class labels of sample data, to obtain four different classifiers;

The testing phase includes the following steps:

Step B1, extracting and fusing features of the image to be classified according to the method in steps A2 to A3 to obtain four fusion features of the image to be classified;

Step B2, on the four visual dictionaries obtained in step A3, use the locally constrained linear coding model to perform feature encoding on the four fusion features obtained in step B1, and obtain four different image representations of the image to be classified;

Step B3, using the four classifiers obtained in step A5 to classify the four image representations obtained in step B2, respectively, to obtain four class labels;

Step B4. Based on the obtained four class labels, use the voting decision method to obtain the final class label of the image to be classified, that is, select the class label with the most votes among the four class labels as the final class label.

2. the image classification method based on RGB-D fusion feature and sparse coding according to claim 1, it is characterized in that, in described step A3, use K-means++ clustering method to carry out for the fusion feature in certain fusion feature set Clustering processing, the method of establishing a corresponding visual dictionary is as follows:

3.1) Record this kind of fusion feature set as H _I ={h ₁ ,h ₂ ,h ₃ ,…,h _n }, and set the number of clusters as m;

3.2) Randomly select a fusion feature in H _I as the first initial clustering center S ₁ ; set count value t=1;

3.3) For each fusion feature h _i in H _I , h _i ∈ H _I , calculate the distance d(h _i ) between it and S _t ;

3.4) Select the next initial clustering center S _t+1 :

formula based Calculate the probability that point h _i ' is selected as the next initial cluster center, where h _i '∈H _I ; select the fusion feature with the highest probability as the next initial cluster center S _t+1 ;

3.5) Make t=t+1, repeat steps (3) and (4), until t=m, that is, m initial cluster centers are selected;

3.6) Use the selected initial cluster centers to run the K-means algorithm, and finally generate m cluster centers;

3.7) Define each cluster center as a visual word in the visual dictionary, and the number of clusters m is the size of the visual dictionary.

3. The image classification method based on RGB-D fusion feature and sparse coding according to claim 2, characterized in that, in the step A4, the fusion feature is characterized by using a locally constrained linear coding model, and the model expression is as follows :

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mi>C</mi> </munder> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>|</mo> <mo>|</mo> <msub> <mi>h</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>Bc</mi> <mi>i</mi> </msub> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>+</mo> <mi>&lambda;</mi> <mo>|</mo> <mo>|</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>&CircleTimes;</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> </mtd> </mtr> <mtr> <mtd> <mtable> <mtr> <mtd> <mrow> <mi>s</mi> <mo>.</mo> <mi>t</mi> <mo>.</mo> </mrow> </mtd> <mtd> <mrow> <msup> <mn>1</mn> <mi>T</mi> </msup> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mo>&ForAll;</mo> <mi>i</mi> </mrow> </mtd> </mtr> </mtable> </mtd> </mtr> </mtable> </mfenced>

In the formula: h _i is the fusion feature in the fusion feature set H _I , that is, the feature vector to be encoded, h _i ∈ R ^d , d represents the dimension of the fusion feature; B=[b ₁ ,b ₂ ,b ₃ …b _m ] is a visual dictionary established by the K-means++ algorithm, b ₁ ~ b _m are m visual words in the visual dictionary, b _j ∈ R ^d ; C=[c ₁ ,c ₂ ,c ₃ …c _n ] is the code The obtained image representation set, where c _i ∈ R ^m is the encoding coefficient of the fusion feature h _i after encoding; λ is the penalty factor of LLC; Indicates that the elements are multiplied correspondingly; 1 in 1 ^T c _i represents a vector with all elements being 1, then 1 ^T c _i =1 is used to constrain the locally constrained linear coding model so that it has translation invariance; d _i is defined as:

<mrow> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <mrow> <mo>(</mo> <msub> <mi>h</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>B</mi> <mo>)</mo> </mrow> </mrow> <mi>&sigma;</mi> </mfrac> <mo>)</mo> </mrow> </mrow>

where dist(h _i ,B)=[dist(h _i ,b ₁ ),dist(h _i ,b ₂ ),…dist(h _i ,b _m )] ^T , dist(h _i ,b _j ) means h The Euclidean distance between _i and _bj , σ is used to adjust the descent speed of the constraint weights for local locations.

4. The image classification method based on RGB-D fusion features and sparse coding according to claim 1, characterized in that, in the step A4, an approximate local constrained linear coding model is used to perform feature encoding on the fusion features, and the model expression The formula is as follows:

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mover> <mi>C</mi> <mo>~</mo> </mover> </munder> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mo>|</mo> <mo>|</mo> <msub> <mi>h</mi> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mover> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>~</mo> </mover> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> </mrow> </mtd> </mtr> <mtr> <mtd> <mtable> <mtr> <mtd> <mrow> <mi>s</mi> <mo>.</mo> <mi>t</mi> <mo>.</mo> </mrow> </mtd> <mtd> <mrow> <msup> <mn>1</mn> <mi>T</mi> </msup> <mover> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>~</mo> </mover> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mo>&ForAll;</mo> <mi>i</mi> </mrow> </mtd> </mtr> </mtable> </mtd> </mtr> </mtable> </mfenced>

Among them, B _i is the local visual word matrix composed of _k visual words closest to the feature vector hi to be encoded in the visual dictionary B selected by k-neighbor search, is the image representation set obtained by approximate encoding, where is the encoding coefficient of the fusion feature _hi after the approximate encoding is completed.

5. the image classification method based on RGB-D fusion feature and sparse coding according to claim 4, characterized in that k=50.

6. The image classification method based on RGB-D fusion features and sparse coding according to any one of claims 1 to 5, wherein, in the step A1, the specific steps of PHOG feature extraction are as follows:

1.1) Statistical image edge information; use the Canny edge detection operator to extract the edge profile of the image, and use this profile to describe the shape of the image;

1.2) Segment the image in a pyramid level; divide the image into 3 layers, the first layer is the entire image; the second layer divides the image into 4 sub-regions, and the size of each region is the same; the third layer is the basis of the second layer Divide the 4 sub-regions above, divide each region into 4 sub-regions, and finally get 4×4 sub-regions;

1.3) Extract the HOG feature vector of each sub-region in each layer;

1.4) Finally, the HOG feature vectors of the sub-regions in each layer of the image are cascaded, and after the concatenated HOG data are obtained, the normalization operation of the data is performed, and finally the PHOG features of the entire image are obtained.

7. The image classification method based on RGB-D fusion features and sparse coding according to any one of claims 1 to 5, characterized in that, in the step A5, the classifier adopts a linear SVM classifier.

8. The image classification method based on RGB-D fusion features and sparse coding according to any one of claims 1 to 5, characterized in that, in the step B4, if there are multiple class labels with the largest number of votes, Then randomly select one of these class labels as the final class label.